CN115035538A

CN115035538A - Training method of text recognition model, and text recognition method and device

Info

Publication number: CN115035538A
Application number: CN202210685043.2A
Authority: CN
Inventors: 章成全; 庾悦晨; 李煜林; 曹健健; 钦夏孟; 姚锟; 韩钧宇; 刘经拓; 丁二锐; 王井东
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-03-22
Filing date: 2022-03-22
Publication date: 2022-09-09
Anticipated expiration: 2042-03-22
Also published as: JP2022177242A; CN115035538B; KR20220122566A; CN114399769B; CN114399769A

Abstract

The disclosure provides a training method of a text recognition model, a text recognition method and a text recognition device, relates to the technical field of artificial intelligence, particularly to the technical field of deep learning and computer vision, and can be applied to scenes such as optical character recognition. The scheme is as follows: the method comprises the steps of conducting mask prediction on partial images in an acquired first sample image to obtain a complete prediction image corresponding to the first sample image, conducting mask prediction on partial texts in an acquired second sample image to obtain predicted text contents corresponding to the partial texts, training according to the complete prediction image and the predicted text contents to obtain a pre-training model, and generating a text recognition model according to the pre-training model, wherein the text recognition model is used for conducting text recognition on images to be recognized, so that the pre-training model learns strong image visual reasoning capability and text semantic reasoning capability, and accuracy and reliability of text recognition are improved when the text recognition model generated based on the pre-training model is used for conducting text recognition.

Description

Training method of text recognition model, and text recognition method and device

Technical Field

The present disclosure relates to the technical field of Artificial Intelligence (AI), and in particular, to the technical field of deep learning and computer vision, and may be applied to scenes such as Optical Character Recognition (OCR), and in particular, to a training method for a text Recognition model, a text Recognition method, and an apparatus thereof.

Background

OCR technology has gained wide attention and application in various industries such as education, finance, medical treatment, transportation and insurance.

In the related art, a text recognition model can be constructed in combination with OCR technology and deep learning to perform text recognition on an image based on the text recognition model.

However, the text recognition model generally relies on visual information to discriminate text content in an image based on the visual information, and there is a disadvantage that the accuracy of recognition is low.

Disclosure of Invention

The disclosure provides a training method of a text recognition model, a text recognition method and a text recognition device for improving the reliability of text recognition.

According to a first aspect of the present disclosure, there is provided a training method of a text recognition model, including:

performing mask prediction on a partial image in an acquired first sample image to obtain a complete prediction image corresponding to the first sample image;

performing the mask prediction on a part of the obtained text in the second sample image to obtain predicted text content corresponding to the part of the text;

and training according to the predicted complete image and the predicted text content to obtain a pre-training model, and generating a text recognition model according to the pre-training model, wherein the text recognition model is used for performing text recognition on the image to be recognized.

According to a second aspect of the present disclosure, there is provided a text recognition method including:

acquiring an image to be recognized, wherein the image to be recognized comprises a text;

performing text recognition on the image to be recognized based on a pre-trained text recognition model to obtain text content in the image to be recognized;

wherein the text recognition model is obtained based on the method according to the first aspect.

According to a third aspect of the present disclosure, there is provided a training apparatus for a text recognition model, comprising:

the prediction unit is used for performing mask prediction on the partial image in the acquired first sample image to obtain a complete prediction image corresponding to the first sample image;

the prediction unit is further configured to perform the mask prediction on a part of the acquired text in the second sample image to obtain a predicted text content corresponding to the part of the text;

the training unit is used for training according to the predicted complete image and the predicted text content to obtain a pre-training model;

and the generating unit is used for generating a text recognition model according to the pre-training model, wherein the text recognition model is used for performing text recognition on the image to be recognized.

According to a fourth aspect of the present disclosure, there is provided a text recognition apparatus comprising:

the device comprises an acquisition unit, a recognition unit and a recognition unit, wherein the acquisition unit is used for acquiring an image to be recognized, and the image to be recognized comprises a text;

the recognition unit is used for carrying out text recognition on the image to be recognized based on a pre-trained text recognition model to obtain text contents in the image to be recognized;

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first or second aspect.

According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to the first or second aspect.

According to a seventh aspect of the present disclosure, there is provided a computer program product comprising: a computer program, stored in a readable storage medium, from which at least one processor of an electronic device can read the computer program, execution of the computer program by the at least one processor causing the electronic device to perform the method of the first or second aspect.

According to the technical scheme, the pre-training model is made to learn stronger image visual reasoning capability and text semantic reasoning capability by the technical scheme that the pre-training model is used for generating the text recognition model based on the pre-training model, so that the accuracy and the reliability of text recognition are improved when the text recognition model based on the pre-training model is used for recognizing the text.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 4 is a schematic illustration of a fourth embodiment according to the present disclosure;

FIG. 5 is a schematic diagram according to a fifth embodiment of the present disclosure;

FIG. 6 is a schematic diagram according to a sixth embodiment of the present disclosure;

FIG. 7 is a schematic diagram according to a seventh embodiment of the present disclosure;

FIG. 8 is a schematic diagram according to an eighth embodiment of the present disclosure;

FIG. 9 is a schematic illustration according to a ninth embodiment of the present disclosure;

fig. 10 is a block diagram of an electronic device for implementing a text recognition model training method and a text recognition method according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

When the text recognition model is constructed by combining the OCR technology and deep learning, the method can be realized by adopting a 'module separation' mode, and also can adopt an 'end-to-end model' mode.

The method of "module separation" is to construct a text detection module, an information extraction module and a text recognition module, so as to combine the three modules to construct a text recognition model.

If a 'module separation' mode is adopted, modules need to be constructed in advance and combined, the process is relatively complicated, the efficiency is relatively low, and the accuracy can be accumulated and superposed, so that the defect that the identification accuracy of the text identification model constructed based on the mode is relatively low is caused.

For example, the "end-to-end model" means that a predicted result is obtained from the input end to the output end, such as inputting an image at the input end and obtaining a predicted text content of the image at the output end.

However, the adoption of the end-to-end model requires data labeling, such as labeling of real text content of an image, and the data used for providing training is relatively effective, which results in a disadvantage that the reliability of the trained text recognition model is relatively low.

The text recognition model trained based on any one of the above methods usually only makes two types of judgments, and when different vertical classes have different types of field requirements, the text recognition model needs to be redesigned, especially the number of channels for classification, and the text recognition model needs to be retrained and cannot be reused.

For example, an image character detection model (EAST), a segmented character detection model (DB), a text detector (LOMO), and the like in the OCR technology can be generally used only for two types of judgment such as a document (text) class and a non-document class (non-text). If the identification requirement of fields in which the user is interested under a specific vertical category needs to be solved, the number of classification categories needs to be increased.

In some embodiments, a new text recognition model may be obtained by training in a manner of detecting the augmented category, for example, field classification may be performed by adding an additional language model on the basis of the original text recognition model.

For example, if the Text recognition model is an end-to-end Text detection and recognition (FOTS) and a Text detection plus recognition model (Mask Text spot) in the OCR technology, an additional language model such as Bidirectional Encoder Representation (BERT) is required to be added to obtain a new Text recognition model, and additional training is required to be added due to the addition of the additional language model, which results in disadvantages of higher training cost, lower efficiency, and the like.

To avoid at least one of the above technical problems, the inventors of the present disclosure have made creative efforts to obtain the inventive concept of the present disclosure: and training in an end-to-end model mode to obtain a pre-training model, namely performing end-to-end pre-training on a model base, and performing pre-training by combining visual dimensions and semantic dimensions to generate a text recognition model based on the base obtained by pre-training.

Based on the invention concept, the invention discloses a training method of a text recognition model, a text recognition method and a text recognition device, relates to the technical field of artificial intelligence, in particular to the technical field of deep learning and computer vision, and can be applied to scenes such as OCR (optical character recognition) and the like so as to improve the reliability of the text recognition model for text recognition.

Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure, and as shown in fig. 1, the method for training a text recognition model provided in this embodiment includes:

s101: and performing mask prediction on the partial image in the acquired first sample image to obtain a complete prediction image corresponding to the first sample image.

For example, the execution subject of this embodiment may be a training device of a text recognition model (hereinafter, simply referred to as a training device), and the training device may be a server (such as a cloud server, or a local server, or a server cluster), or may be a terminal device, or may be a computer, or may be a processor, or may be a chip, and the embodiment is not limited.

The mask prediction is to perform mask processing (or referred to as masking processing) on a partial image, text, or the like, and restore a complete image, text, or the like before the mask processing, that is, before the masking processing.

Accordingly, this step can be understood as: the method comprises the steps of obtaining a first sample image comprising a text, performing mask processing on a partial image of the first sample image, and predicting a complete first sample image (namely predicting a complete image) based on the image after the mask processing.

That is, this step may be understood as an image reconstruction task (mask image modifying) that performs image reconstruction on the first sample image in a manner that incorporates mask prediction.

S102: and performing mask prediction on part of the text in the acquired second sample image to obtain predicted text content corresponding to the part of the text.

In connection with the above analysis, this step can be understood as: acquiring a second sample image comprising a text, performing mask processing on a part of the text in the second sample image, and predicting text content of the part of the text processed by the mask (namely predicting the text content) based on the text processed by the mask.

That is, this step can be understood as a text reconstruction task (mask OCR modeling) to reconstruct the text of the second sample image in combination with the mask prediction, and specifically to reconstruct a part of the text in the second sample image.

It should be noted that the first sample image and the second sample image may be the same image or different images, and this embodiment is not limited thereto.

S103: and training according to the predicted complete image and the predicted text content to obtain a pre-training model, and generating a text recognition model according to the pre-training model.

The text recognition model is used for performing text recognition on the image to be recognized.

The pre-trained model may be understood as a base of the text recognition model or may be understood as a hidden layer of the text recognition model.

By combining the analysis, the pre-training model is obtained by training based on image reconstruction and text reconstruction, so that the pre-training model learns stronger image visual reasoning capability and text semantic reasoning capability, and the text recognition model generated based on the pre-training model has stronger accuracy and reliability.

In this embodiment, end-to-end model training may be implemented, that is, respective corresponding prediction results may be output directly based on the first sample image and the second sample image, for example, the prediction result corresponding to the first sample image is a predicted complete image, the prediction result corresponding to the second sample image is predicted text content, and there is no need to add other links, for example, a link of performing text detection on the second sample image based on manual operation or an OCR technology to obtain a text, so as to improve training efficiency and save training resources and cost.

Based on the above analysis, an embodiment of the present disclosure provides a training method for a text recognition model, including: performing mask prediction on a partial image in an acquired first sample image to obtain a predicted complete image corresponding to the first sample image, performing mask prediction on a partial text in an acquired second sample image to obtain predicted text content corresponding to the partial text, training according to the predicted complete image and the predicted text content to obtain a pre-training model, and generating a text recognition model according to the pre-training model, wherein the text recognition model is used for performing text recognition on an image to be recognized, in the embodiment, the predicted complete image corresponding to the first sample image is obtained through prediction based on the mask, the predicted text content of the partial text in the second sample image is obtained through prediction based on the mask, and the pre-training model is generated by combining the predicted complete image and the predicted text content to generate the technical characteristics of the text recognition model based on the pre-training model, the pre-training model learns stronger image visual reasoning capability and text semantic reasoning capability, so that the accuracy and reliability of text recognition are improved when the text recognition is carried out on the text recognition model generated based on the pre-training model.

Fig. 2 is a schematic diagram of a second embodiment of the present disclosure, and as shown in fig. 2, the method for training a text recognition model provided in this embodiment includes:

s201: and acquiring the target object.

Wherein the target object includes a first sample image and a second sample image.

It should be understood that, in order to avoid tedious statements, the technical features of the present embodiment that are the same as those of the above embodiments are not described in detail in this embodiment.

S202: and randomly covering partial objects in the target object, and predicting the covered partial objects in the target object according to the uncovered objects in the target object to obtain a prediction result.

If the target object is the first sample image, part of the target object is a partial image, and the prediction result is a predicted complete image.

And if the target object is the second sample image, part of the target object is a part of text, and the prediction result is the predicted text content.

In some embodiments, predicting a part of objects covered in the target object according to objects not covered in the target object, and obtaining a prediction result includes the following steps:

the first step is as follows: and extracting object features corresponding to objects which are not covered in the target object to obtain first object features.

The second step is as follows: and predicting the covered part of the target object according to the first object characteristic to obtain a prediction result.

And if the target object is the first sample image, the first object feature is the first visual feature. And if the target object is the second sample image, the first object feature is the first semantic feature.

S203: and training according to the predicted complete image and the predicted text content to obtain a pre-training model, and generating a text recognition model according to the pre-training model.

For the reader to more deeply understand the implementation principle of the present disclosure, the above-mentioned embodiment (the embodiment shown in fig. 1 and 2) is now explained in detail with reference to fig. 3.

Fig. 3 is a schematic diagram according to a third embodiment of the present disclosure, and as shown in fig. 3, the method for training a text recognition model provided in this embodiment includes:

s301: a first sample image is acquired.

Similarly, in order to avoid the tedious statements, the technical features of the present embodiment that are the same as those of the above embodiments are not described in detail in this embodiment.

S302: randomly masking a portion of the image in the first sample image.

It should be understood that training of the network model is generally an iterative training process, and in this embodiment, each iterative training is to randomly mask a partial image of the first sample image, so the number of the first sample images may be one, and of course, the number of the first sample images may also be multiple, and this embodiment is not limited.

S303: and predicting the covered partial image in the first sample image according to the uncovered image in the first sample image to obtain a predicted complete image.

For example, after randomly masking the first sample image, the partial image in the first sample image is masked, and the other partial image is not masked, the complete first sample image may be determined based on the unmasked image (i.e., the predicted complete image).

In the embodiment, the method determines the predicted complete image by combining a random covering and prediction mode, so that the uncertainty in the training process can be increased, and the reliability of restoring the complete image by the pre-training model obtained by training is improved.

Wherein S302-S303 may be implemented based on a mask auto-encoder (MAE). That is, the first sample image may be input to a mask self-encoder, outputting a predicted full image.

In some embodiments, S303 may include the steps of:

the first step is as follows: and extracting the visual features corresponding to the uncovered image in the first sample image to obtain the first visual features.

The visual features include texture features, contour features, color features, shape features, and the like, which are not listed here.

Accordingly, the first visual feature refers to the corresponding texture feature, contour feature, color feature, shape feature, and the like of the uncovered image in the first sample image.

The second step is as follows: and according to the first visual characteristic, predicting the covered partial image in the first sample image to obtain a predicted complete image.

In this embodiment, a predicted complete image is obtained by combining visual features such as texture features, contour features, color features, shape features and the like corresponding to an uncovered image, which is equivalent to obtaining a predicted complete image based on a visual context, so as to train and obtain a pre-training model capable of completing learning of context knowledge of visual cues.

In some embodiments, the second step may comprise the sub-steps of:

the first sub-step: and according to the first visual characteristic, predicting the visual characteristic corresponding to the covered partial image in the first sample image to obtain a second visual characteristic.

Exemplarily, in connection with the above analysis, this sub-step may be understood as: and predicting visual features such as texture features, contour features, color features and shape features corresponding to the shielded partial images according to the visual features such as texture features, contour features, color features and shape features corresponding to the non-shielded images.

The second substep: and determining the covered partial image in the first sample image according to the second visual characteristic.

Illustratively, after obtaining the visual features such as texture features, contour features, color features, and shape features corresponding to the covered partial image, the covered partial image may be supplemented and repaired based on the visual features.

The third substep: and generating a predicted complete image according to the uncovered image in the first sample image and the determined covered partial image in the first sample image.

By combining the analysis, after the covered partial image is supplemented and repaired, the covered partial image is restored, and the uncovered partial image and the restored covered partial image are spliced to obtain a predicted complete image, namely, a first sample image is restored, so that the predicted complete image is highly attached to the first sample image, and the accuracy and reliability of the predicted complete image are improved.

S304: a second sample image is acquired.

As can be seen from the above analysis, the first sample image and the second sample image may be the same image, and accordingly, if the first sample image and the second sample image are the same image, this step may be omitted.

S305: randomly masking portions of the text in the second sample image.

Similarly, the training of the network model is usually an iterative training process, and in this embodiment, each iterative training is a partial text that randomly covers the second sample image, so the number of the second sample images may be one, and certainly, the number of the second sample images may also be multiple, which is not limited in this embodiment.

For example, a partial word, a partial sentence, or the like in the second sample image may be randomly masked.

S306: and predicting the covered part of the text in the second sample image according to the uncovered text in the second sample image to obtain the predicted text content.

For example, after randomly masking the second sample image, a portion of the text in the second sample image is masked, and another portion of the text is not masked, the text content of the masked portion of the text may be determined based on the non-masked text (i.e., the predicted text content).

In the embodiment, by determining the text content in combination with a "stochastic masking + prediction" manner, the uncertainty in the training process can be increased, thereby improving the reliability of the pre-training model obtained by training to restore the complete image.

S305-S306 may be implemented based on Mask Language Model (MLM). That is, the second sample image may be input to the mask language model, outputting the predicted text content.

In some embodiments, S306 may include the steps of:

the first step is as follows: and extracting semantic features corresponding to the uncovered text in the second sample image to obtain the first semantic features.

The semantic features refer to features of logical relations between character strings. Accordingly, the first semantic feature may be understood as a feature of a logical relationship between character strings included in the text that is not covered, and may also be understood as a feature of an association relationship between characters (words and/or phrases) in the text that is not covered.

The second step is as follows: and predicting the covered part of the text in the second sample image according to the first semantic features to obtain predicted text content.

In this embodiment, by combining visual features such as logical relationships between character strings corresponding to an uncovered text, predicted text content is obtained, which is equivalent to obtaining predicted text content based on semantic context, so as to train and obtain a pre-training model capable of completing context knowledge learning of semantic clues.

In some embodiments, the second step may comprise the sub-steps of:

the first substep: and predicting semantic features corresponding to the covered partial text in the second sample image according to the first semantic features to obtain second semantic features.

Exemplarily, in connection with the above analysis, this sub-step may be understood as: and predicting semantic features such as the features of the logical relationship between the character strings and the like corresponding to the covered partial text according to the semantic features such as the features of the logical relationship between the character strings and the like corresponding to the uncovered text.

The second substep: and generating the predicted text content according to the second semantic features.

For example, after obtaining semantic features such as features of logical relationships between character strings corresponding to the uncovered text, the semantic features of the covered partial text may be supplemented and repaired based on the semantic features.

By combining the analysis, after the semantic features of the covered partial text are supplemented and repaired, the semantic features of the covered partial text are restored, and the text content (namely, the predicted text content) corresponding to the semantic features can be determined, so that the predicted text content is highly attached to the text content of the covered partial text, and the accuracy and reliability of the predicted text content are improved.

S307: and training according to the predicted complete image and the predicted text content to obtain a pre-training model, and generating a text recognition model according to the pre-training model.

Fig. 4 is a schematic diagram of a fourth embodiment of the present disclosure, and as shown in fig. 4, the method for training a text recognition model provided in this embodiment includes:

s401: and performing mask prediction on the partial image in the acquired first sample image to obtain a complete prediction image corresponding to the first sample image.

S402: and performing mask prediction on part of the text in the acquired second sample image to obtain predicted text content corresponding to the part of the text.

S403: and training according to the predicted complete image and the predicted text content to obtain a pre-training model.

For example, the underlying network model may be trained based on the predicted full image and the predicted text content to obtain a pre-trained model.

For example, model parameters of the underlying network model may be adjusted based on the predicted full image and the predicted text content to arrive at a pre-trained model.

The basic network model may be a Vision converter (ViT), a neural network model (backhaul), such as a convolutional neural network model (CNN), or another network model, and this embodiment is not limited.

S404: and acquiring a task to be identified and a training image.

Wherein the training image includes text.

The task to be recognized may be determined based on the recognition requirement of the text recognition model, for example, the task to be recognized may be a text detection task, a text recognition task, a field classification task, or other recognition tasks, which are not listed one by one here.

S405: and training the pre-training model according to the task to be recognized and the training image to obtain a text recognition model.

By combining the analysis, the pre-training model has a model for finishing the context knowledge learning of the visual clues and a model for the context knowledge learning of the semantic clues, namely the pre-training model is a multi-mode feature extraction base, so that the text recognition model obtained by combining the pre-training model training has the context knowledge recognition capability based on the visual clues and the context knowledge recognition capability based on the semantic clues.

And the pre-training model is trained by combining the task to be recognized, so that the text recognition model corresponding to the pre-training model can be trained based on different recognition requirements, the flexibility and diversity of the text recognition model obtained by training are improved, and the method can be widely applied to various recognition scenes and meet different recognition requirements.

In some embodiments, the pre-trained model (i.e., the multi-modal feature extraction base) may be loaded to a Text detection network model (EAST), a segmentation-based Text detection network (DB), a Text detection network (LOMO) and the like to implement the Text detection task of the Text recognition model; for another example, the pre-training model may be loaded to a Convolutional Recurrent Neural Network (CRNN), where the Convolutional Recurrent Neural Network may use a connection dominant time Classification (CTC) decoding manner, an Attention mechanism (Attention) decoding manner, a converter (transformer) decoding method, or the like, to implement a text recognition task of the text recognition model; as another example, the pre-trained model may be loaded to a Fully Connected network model (FC), or a Convolutional Neural network model (CNN), to perform the field classification task of the text recognition model.

In some embodiments, S405 may include the steps of:

the first step is as follows: and inputting the training images into a pre-training model to obtain Multi-modal Feature Maps (Multi-modal Feature Maps) corresponding to the training images.

In conjunction with the above analysis, a multi-modal feature map is used to characterize the features in multiple dimensions of the training image, such as features in visual dimensions and features in semantic dimensions. For example, a multi-modal feature map can be used to characterize corresponding image features and semantic features of the training image.

In some embodiments, the multi-modal profile may be represented as (d x h w), where d represents the number of signature channels, and h and w represent the height and width of the multi-modal profile.

The second step is as follows: and generating a text recognition model according to the task to be recognized and the multi-modal feature map.

In this embodiment, the multi-modal feature map can represent the features of the training image from multiple dimensions, so that the visual features of the training image can be represented, the semantic features of the training image can be represented, and the represented visual features and semantic features have strong reliability and comprehensiveness, so that the text recognition model generated by combining the multi-modal feature map has strong reliability and accuracy.

In some embodiments, the second step may comprise the sub-steps of:

the first sub-step: and predicting a prediction recognition result of the training image under the task to be recognized according to the multi-modal feature map.

For example, the multi-modal feature map may be input to a convolutional recurrent neural network to obtain a predictive recognition result (e.g., a predictive text result).

The second substep: and constructing a text recognition model according to a real recognition result and a predicted recognition result preset by the training image.

The real recognition result may be obtained by labeling the training image in advance, and the labeling mode is not limited in this embodiment, for example, the real recognition result may be a manual labeling mode or an automatic labeling mode.

For example, a loss value between the real recognition result and the predicted recognition result may be calculated, and if the loss value is greater than (or equal to) a preset loss threshold, the iteration is performed for training, otherwise, if the loss value is less than the preset loss threshold, the text recognition model is built, or if the iteration number reaches a preset iteration number, the text recognition model is built.

For example, if a text recognition model for text recognition of a train ticket needs to be trained, the training image is a train ticket image, the train ticket image is input into a pre-training model, a multi-mode feature map of the train ticket image is output, the multi-mode feature map is input into a convolution cyclic neural network, a prediction recognition result such as "date, train number, seat number" in the train ticket image is output, the prediction recognition result is compared with a pre-labeled "date, train number, seat number" (namely, a real recognition result) to obtain a text recognition model through training, and the text recognition model obtained through training can be used for recognizing text contents of "date, train number, seat number" in the train ticket image to be recognized.

Fig. 5 is a schematic diagram of a fifth embodiment according to the present disclosure, and as shown in fig. 5, the text recognition method provided in this embodiment includes:

s501: and acquiring an image to be identified.

Wherein the image to be recognized comprises a text.

For example, the executing subject of this embodiment may be a text recognition device, and the text recognition device may be the same device as the training device or a device different from the training device, and this embodiment is not limited.

S502: and performing text recognition on the image to be recognized based on a pre-trained text recognition model to obtain text content in the image to be recognized.

The text recognition model is obtained based on the training method of the text recognition model according to any one of the embodiments.

In some embodiments, S502 may include the steps of:

the first step is as follows: and determining a multi-modal feature map of the image to be recognized according to the text recognition model.

The second step is as follows: and determining text content in the image to be recognized according to the multi-modal feature map.

Wherein the multimodal feature map of the image to be recognized is used to characterize: and visual characteristics and semantic characteristics of the image to be recognized.

For example, in combination with the above analysis, the text recognition model includes a pre-trained model, and if the text recognition model is obtained by loading the pre-trained model to the convolutional recurrent neural network for training, that is, the text recognition model further includes the convolutional recurrent neural network, this embodiment may be understood as:

and inputting the image to be recognized into the pre-training model, outputting a multi-modal characteristic diagram, inputting the multi-modal characteristic diagram into the convolution circulation neural network, and outputting text contents in the image to be recognized.

Fig. 6 is a schematic diagram of a sixth embodiment according to the present disclosure, and as shown in fig. 6, the training apparatus 600 for text recognition model provided in this embodiment includes:

the prediction unit 601 is configured to perform mask prediction on the partial image in the acquired first sample image to obtain a complete prediction image corresponding to the first sample image.

The prediction unit 601 is further configured to perform mask prediction on a portion of the acquired second sample image to obtain a predicted text content corresponding to the portion of the text.

And the training unit 602 is configured to train to obtain a pre-training model according to the predicted complete image and the predicted text content.

The generating unit 603 is configured to generate a text recognition model according to the pre-training model, where the text recognition model is used to perform text recognition on the image to be recognized.

Fig. 7 is a schematic diagram of a seventh embodiment according to the present disclosure, and as shown in fig. 7, the training apparatus 700 for a text recognition model provided in this embodiment includes:

a prediction unit 701, configured to perform mask prediction on a partial image in the acquired first sample image to obtain a complete prediction image corresponding to the first sample image.

The prediction unit 701 is further configured to perform mask prediction on a part of the acquired text in the second sample image, so as to obtain predicted text content corresponding to the part of the text.

In some embodiments, in conjunction with fig. 7, the prediction unit 701 includes:

and a cover masking unit 7011 for randomly masking a part of the target object.

And a predicting subunit 7012, configured to predict, according to an object that is not covered in the target object, a part of the covered object in the target object, so as to obtain a prediction result.

If the target object is a first sample image, part of the target object is a partial image, and the prediction result is a prediction complete image; and if the target object is the second sample image, part of the target object is a part of text, and the prediction result is the predicted text content.

In some embodiments, predictor 7012 includes:

and the extraction module is used for extracting the object features corresponding to the objects which are not covered in the target object to obtain the first object features.

And the prediction module is used for predicting the covered part of the target object according to the first object characteristic to obtain a prediction result.

If the target object is a first sample image, the first object feature is a first visual feature; if the target object is the second sample image, the first object feature is the first semantic feature.

In some embodiments, the target object is a first sample image, the first object feature is a first visual feature; a prediction module comprising:

and the first prediction sub-module is used for predicting the visual characteristics corresponding to the covered partial images in the first sample image according to the first visual characteristics to obtain second visual characteristics.

And the first determining sub-module is used for determining the covered partial image in the first sample image according to the second visual characteristic.

And the first generation submodule is used for generating a complete prediction image according to the uncovered image in the first sample image and the determined covered partial image in the first sample image.

In some embodiments, the target object is a second sample image, the first object feature is a first semantic feature; a prediction module comprising:

and the second prediction submodule is used for predicting semantic features corresponding to the covered part of the text in the second sample image according to the first semantic features to obtain second semantic features.

And the second generation submodule is used for generating the predicted text content according to the second semantic features.

And the training unit 702 is configured to train to obtain a pre-training model according to the predicted complete image and the predicted text content.

The generating unit 703 is configured to generate a text recognition model according to the pre-training model, where the text recognition model is used to perform text recognition on the image to be recognized.

In conjunction with fig. 7, in some embodiments, the generating unit 703 includes:

the obtaining subunit 7031 is configured to obtain a task to be identified and a training image, where the training image includes a text.

And the training subunit 7032 is configured to train the pre-training model according to the task to be recognized and the training image, so as to obtain a text recognition model.

In some embodiments, training subunit 7032 includes:

and the input module is used for inputting the training images into the pre-training model to obtain the multi-modal characteristic diagram corresponding to the training images.

And the generating module is used for generating a text recognition model according to the task to be recognized and the multi-modal characteristic diagram.

In some embodiments, a module is generated comprising:

and the third prediction sub-module is used for predicting the prediction recognition result of the training image under the task to be recognized according to the multi-modal feature map.

And the construction submodule is used for constructing the recognition model according to the real recognition result and the predicted recognition result preset by the training image.

Fig. 8 is a schematic diagram of an eighth embodiment of the present disclosure, and as shown in fig. 8, the present embodiment provides a text recognition apparatus 800, including:

an obtaining unit 801, configured to obtain an image to be recognized, where the image to be recognized includes a text.

The recognition unit 802 is configured to perform text recognition on the image to be recognized based on a pre-trained text recognition model, so as to obtain text content in the image to be recognized.

The text recognition model is obtained based on the training method of the text recognition model as described in any of the above embodiments.

As can be seen in fig. 8, in some embodiments, the identifying unit 802 includes:

the first determining unit 8021 is configured to determine a multi-modal feature map of the image to be recognized according to the text recognition model.

A second determining unit 8022, configured to determine text content in the image to be recognized according to the multi-modal feature map.

The multi-modal feature map of the image to be recognized is used for representing: and visual characteristics and semantic characteristics of the image to be recognized.

Fig. 9 is a schematic diagram according to a ninth embodiment of the present disclosure, and as shown in fig. 9, an electronic device 900 in the present disclosure may include: a processor 901 and a memory 902.

A memory 902 for storing programs; the Memory 902 may include a volatile Memory (RAM), such as a Static Random Access Memory (SRAM), a Double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM), and the like; the memory may also comprise a non-volatile memory, such as a flash memory. The memory 902 is used to store computer programs (e.g., applications, functional modules, etc. that implement the methods described above), computer instructions, etc., which may be stored in partitions in the one or more memories 902. And the above-described computer programs, computer instructions, data, and the like can be called by the processor 901.

The computer programs, computer instructions, etc. described above may be stored in one or more memories 902 in partitions. And the above-mentioned computer program, computer instruction, etc. can be called by the processor 901.

A processor 901 for executing the computer program stored in the memory 902 to implement the steps of the method according to the above embodiments.

Reference may be made in particular to the description relating to the previous method embodiments.

The processor 901 and the memory 902 may be separate structures or may be an integrated structure integrated together. When the processor 901 and the memory 902 are separate structures, the memory 902, the processor 901 may be coupled through a bus 903.

The electronic device of this embodiment may execute the technical solution in the method, and the specific implementation process and the technical principle are the same, which are not described herein again.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

According to an embodiment of the present disclosure, the present disclosure also provides a computer program product comprising: a computer program, stored in a readable storage medium, from which at least one processor of the electronic device can read the computer program, the at least one processor executing the computer program causing the electronic device to perform the solution provided by any of the embodiments described above.

FIG. 10 shows a schematic block diagram of an example electronic device 1000 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 10, the apparatus 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM1003, various programs and data necessary for the operation of the device 1000 can be stored. The calculation unit 1001, ROM 1002, and RAM1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

A number of components in device 1000 are connected to I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and a communication unit 1009 such as a network card, a modem, a wireless communication transceiver, or the like. The communication unit 1009 allows the device 1000 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

Computing unit 1001 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 1001 executes the respective methods and processes described above, such as a training method of a text recognition model, a text recognition method. For example, in some embodiments, the training method of the text recognition model, the text recognition method, may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1000 via ROM 1002 and/or communications unit 1009. When the computer program is loaded into the RAM1003 and executed by the computing unit 1001, one or more steps of the training method of the text recognition model, the text recognition method described above may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured by any other suitable means (e.g. by means of firmware) to perform a training method of the text recognition model, a text recognition method.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A training method of a text recognition model is characterized by comprising the following steps:

performing the mask prediction on a part of text in an acquired second sample image to obtain predicted text content corresponding to the part of text, wherein the first sample image and the second sample image are different images;

2. The method of claim 1, wherein the image in the first sample image and the text in the second sample image are characterized by different content.

3. The method of claim 1, wherein the generating a text recognition model from the pre-trained model comprises:

acquiring a task to be recognized and a training image, wherein the training image comprises a text;

and training the pre-training model according to the task to be recognized and the training image to obtain the text recognition model.

4. The method of claim 3, wherein the training the pre-training model according to the task to be recognized and the training image to obtain the text recognition model comprises:

inputting the training images into the pre-training model to obtain a multi-modal feature map corresponding to the training images, wherein the multi-modal feature map is used for representing image features and semantic features corresponding to the training images;

and generating the text recognition model according to the task to be recognized and the multi-modal feature map.

5. The method of claim 4, wherein the generating the text recognition model from the task to be recognized and the multi-modal feature map comprises:

predicting a prediction recognition result of the training image under the task to be recognized according to the multi-modal feature map;

and constructing the text recognition model according to a real recognition result preset by the training image and the prediction recognition result.

6. The method of any of claims 1-5, wherein the mask prediction comprises:

randomly covering a part of the target objects;

predicting a part of objects which are covered in the target object according to objects which are not covered in the target object to obtain a prediction result;

if the target object is a first sample image, part of the target object is a partial image, and the prediction result is the predicted complete image; if the target object is a second sample image, part of the target object is a part of text, and the prediction result is the content of the predicted text.

7. The method of claim 6, wherein predicting the covered part of the target objects according to the uncovered part of the target objects to obtain a prediction result comprises:

extracting object features corresponding to uncovered objects in the target object to obtain first object features;

predicting a part of objects covered in the target object according to the first object characteristics to obtain a prediction result;

if the target object is a first sample image, the first object feature is a first visual feature; and if the target object is a second sample image, the first object feature is a first semantic feature.

8. The method of claim 7, wherein the target object is a first sample image, the first object feature is a first visual feature; predicting the covered part of the target object according to the first object characteristic to obtain a prediction result, wherein the prediction result comprises

According to the first visual feature, predicting the visual feature corresponding to the covered partial image in the first sample image to obtain a second visual feature;

determining a masked partial image in the first sample image according to the second visual characteristic;

and generating the predicted complete image according to the uncovered image in the first sample image and the determined covered partial image in the first sample image.

9. The method of claim 7 or 8, wherein the target object is a second sample image, the first object feature is a first semantic feature; the predicting the covered part of the target object according to the first object feature to obtain the prediction result comprises the following steps:

according to the first semantic features, predicting semantic features corresponding to the covered partial text in the second sample image to obtain second semantic features;

and generating the predicted text content according to the second semantic features.

10. A text recognition method, comprising:

wherein the text recognition model is obtained based on the method according to any one of claims 1 to 9.

11. The method of claim 10, wherein performing text recognition on the image to be recognized based on a pre-trained text recognition model to obtain text content in the image to be recognized comprises:

determining a multi-modal feature map of the image to be recognized according to the text recognition model, and determining text content in the image to be recognized according to the multi-modal feature map;

wherein the multi-modal feature map of the image to be recognized is used for characterizing: and the visual characteristic and the semantic characteristic of the image to be recognized.

12. An apparatus for training a text recognition model, comprising:

the prediction unit is used for performing mask prediction on a partial image in the acquired first sample image to obtain a complete prediction image corresponding to the first sample image;

the prediction unit is further configured to perform the mask prediction on a part of the acquired text in a second sample image to obtain predicted text content corresponding to the part of the text, where the first sample image and the second sample image are different images;

13. The apparatus of claim 12, wherein the image in the first sample image and the text in the second sample image are characterized by different content.

14. The apparatus of claim 12, wherein the generating unit comprises:

the device comprises an acquisition subunit, a recognition unit and a recognition unit, wherein the acquisition subunit is used for acquiring a task to be recognized and a training image, and the training image comprises a text;

and the training subunit is used for training the pre-training model according to the task to be recognized and the training image to obtain the text recognition model.

15. The apparatus of claim 14, wherein the training subunit comprises:

the input module is used for inputting the training images into the pre-training model to obtain a multi-modal feature map corresponding to the training images, wherein the multi-modal feature map is used for representing image features and semantic features corresponding to the training images;

and the generating module is used for generating the text recognition model according to the task to be recognized and the multi-modal feature map.

16. The apparatus of claim 15, wherein the means for generating comprises:

the third prediction sub-module is used for predicting a prediction recognition result of the training image under the task to be recognized according to the multi-modal feature map;

and the construction sub-module is used for constructing the text recognition model according to a preset real recognition result of the training image and the predicted recognition result.

17. The apparatus according to any one of claims 12-16, wherein the prediction unit comprises:

a covering subunit, configured to cover part of the target objects randomly;

the predicting subunit is used for predicting a part of objects covered in the target object according to objects not covered in the target object to obtain a prediction result;

18. The apparatus of claim 17, wherein the predictor subunit comprises:

the extraction module is used for extracting object features corresponding to objects which are not covered in the target object to obtain first object features;

the prediction module is used for predicting a part of objects covered in the target object according to the first object characteristics to obtain a prediction result;

19. The apparatus of claim 18, wherein the target object is a first sample image, the first object feature is a first visual feature; the prediction module comprises:

the first prediction sub-module is used for predicting visual characteristic second visual characteristics corresponding to covered partial images in the first sample image according to the first visual characteristics;

a first determining sub-module, configured to determine a masked partial image in the first sample image according to the second visual feature;

and the first generation submodule is used for generating the predicted complete image according to the uncovered image in the first sample image and the determined covered partial image in the first sample image.

20. The apparatus of claim 18 or 19, wherein the target object is a second sample image, the first object feature is a first semantic feature; the prediction module comprises:

the second prediction sub-module is used for predicting semantic features corresponding to the covered part of text in the second sample image according to the first semantic features to obtain second semantic features;

21. A text recognition apparatus, comprising:

the device comprises an acquisition unit, a recognition unit and a processing unit, wherein the acquisition unit is used for acquiring an image to be recognized, and the image to be recognized comprises a text;

wherein the text recognition model is derived based on the method according to any one of claims 1-9.

22. The apparatus of claim 21, wherein the identifying unit comprises:

the first determining unit is used for determining a multi-modal feature map of the image to be recognized according to the text recognition model;

the second determining unit is used for determining text content in the image to be recognized according to the multi-modal feature map;

23. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9; or to enable the at least one processor to perform the method of claim 10 or 11.

24. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-9; alternatively, the computer instructions are for causing the computer to perform the method of claim 10 or 11.