CN114596566B

CN114596566B - Text recognition method and related device

Info

Publication number: CN114596566B
Application number: CN202210402933.8A
Authority: CN
Inventors: 姜媚
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-04-18
Filing date: 2022-04-18
Publication date: 2022-08-02
Anticipated expiration: 2042-04-18
Also published as: WO2023202197A1; CN114596566A

Abstract

The application relates to the technical field of computers, in particular to the technical field of artificial intelligence, and provides a text recognition method and a related device, which are used for improving the text recognition accuracy rate, and the method comprises the following steps: inputting an image to be recognized into a target classification model, obtaining language distribution information and an original text presenting direction, then, carrying out image correction on the image to be recognized based on the original text presenting direction to obtain a target recognition image, then, determining a plurality of text region image sets corresponding to languages respectively, and finally, respectively adopting target text recognition models associated with the corresponding languages based on the text region image sets to obtain text recognition results. Therefore, the text recognition precision is improved by accurately judging and predicting the distribution information and the text presenting direction.

Description

Text recognition method and related device

Technical Field

The application relates to the technical field of computers, and provides a text recognition method and a related device.

Background

With the continuous development of computer technology, image-based text recognition technology is widely applied, and image-based text recognition technology refers to recognition of text information contained in an image to be recognized.

In the related art, a target language corresponding to an image to be recognized is determined, and then text information included in the image to be recognized is determined according to a recognition model corresponding to the target language.

However, with the above text recognition method, text recognition cannot be performed on an image containing characters of a plurality of languages. In addition, once the target language is wrongly recognized, the text recognition result is directly influenced, and the recognition accuracy is low.

Disclosure of Invention

The embodiment of the application provides a text recognition method and a related device, which are used for improving the accuracy of text recognition.

In a first aspect, an embodiment of the present application provides a text recognition method, including:

inputting an image to be recognized containing a text into a target classification model, and obtaining corresponding language distribution information and an original text presenting direction, wherein the language distribution information contains a plurality of languages corresponding to the text and text position information corresponding to the languages;

based on the original text presenting direction and a preset target character presenting direction, carrying out image correction on the image to be recognized, and taking the corrected image to be recognized as a target recognition image;

determining a text region image set corresponding to each of the plurality of languages from the target recognition image based on the obtained text position information;

and respectively adopting target text recognition models associated with corresponding languages based on the obtained text region image sets to obtain text recognition results corresponding to the images to be recognized.

In a second aspect, an embodiment of the present application provides a text recognition apparatus, including:

the image classification unit is used for inputting an image to be recognized containing a text into a target classification model to obtain corresponding language distribution information and an original text presentation direction, wherein the language distribution information contains a plurality of languages corresponding to the text and text position information corresponding to the languages;

the image correction unit is used for correcting the image to be recognized based on the original text presenting direction and a preset target character presenting direction, and taking the corrected image to be recognized as a target recognition image;

a text positioning unit, configured to determine, from the target recognition image, a text region image set corresponding to each of the plurality of languages based on the obtained text position information;

and the image identification unit is used for obtaining a text identification result corresponding to the image to be identified by respectively adopting the target text identification models associated with the corresponding languages based on the obtained image set of each text region.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a processor and a memory, where the memory stores a computer program, and when the computer program is executed by the processor, the processor is caused to execute the steps of the text recognition method.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, which includes a computer program and is configured to, when the computer program runs on an electronic device, cause the electronic device to execute the steps of the text recognition method.

In a fifth aspect, the present application provides a computer program product, where the computer program product includes a computer program, where the computer program is stored in a computer-readable storage medium, and a processor of an electronic device reads and executes the computer program from the computer-readable storage medium, so that the electronic device executes the steps of the text recognition method.

In the embodiment of the application, an image to be recognized is input into a target classification model, corresponding language distribution information and an original text presenting direction are obtained, then, based on the original text presenting direction and a preset target character presenting direction, the image to be recognized is subjected to image correction, the corrected image to be recognized is used as a target recognition image, then, based on obtained text position information, a text region image set corresponding to each of a plurality of languages is determined from the target recognition image, and finally, based on each text region image set, a target text recognition model associated with the corresponding language is adopted respectively, and a text recognition result corresponding to the image to be recognized is obtained.

Therefore, on one hand, the languages contained in the text to be recognized can be positioned through the language distinguishing information, and then the problem of multi-language mixed arrangement of the image is solved to a certain extent, on the other hand, the image correction is carried out in combination with the text presenting direction, so that the text recognition efficiency and the recognition accuracy can be improved, and in addition, the distribution information and the text presenting direction are accurately judged and predicted, so that the images in each text area can be correctly distributed to the recognition model corresponding to the languages, and the text recognition accuracy is further improved.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of a text recognition method provided in an embodiment of the present application;

FIG. 3A is a diagram illustrating various languages provided in an embodiment of the present application;

FIG. 3B is a diagram illustrating directions of text presentation provided in an embodiment of the present application;

fig. 4 is a schematic diagram of obtaining language distribution information and an original text presentation direction provided in an embodiment of the present application;

FIG. 5 is a schematic diagram of an image rectification process provided in an embodiment of the present application;

FIG. 6 is a schematic illustration of a target recognition image provided in an embodiment of the present application;

FIG. 7 is a logic diagram of a text recognition method provided in an embodiment of the present application;

fig. 8A is a schematic structural diagram of a target text line detection model provided in an embodiment of the present application;

FIG. 8B is a schematic illustration of a shape correction process provided in an embodiment of the present application;

FIG. 9 is a schematic diagram of a target classification model provided in an embodiment of the present application;

FIG. 10A is a flowchart illustrating a language recognizer model training method provided in an embodiment of the present application;

FIG. 10B is a schematic diagram of the logic provided in the examples herein for determining the loss of the first model;

FIG. 11A is a schematic flow chart illustrating a language recognizer model training method provided in the embodiments of the present application;

FIG. 11B is a schematic diagram of the logic provided in the examples herein for determining the second model penalty;

FIG. 11C is a schematic illustration of contrast loss and cross-entropy loss provided in an embodiment of the present application;

FIG. 12 is a diagram illustrating two text recognition results provided in an embodiment of the present application;

FIG. 13 is a schematic structural diagram of a pre-training language identification model provided in an embodiment of the present application;

fig. 14 is a schematic structural diagram of a graph-text distance recognition model provided in an embodiment of the present application;

FIG. 15 is a schematic flowchart of a text recognition model training method provided in an embodiment of the present application;

FIG. 16 is a schematic illustration of the spatial distribution characteristics of Thai provided in an embodiment of the present application;

fig. 17A is a schematic diagram of a SAR and CTC based text recognition model provided in an embodiment of the present application;

FIG. 17B is a diagram illustrating several text recognition results provided in the example of the present application;

FIG. 18A is a logic diagram of a data synthesis method provided in an embodiment of the present application;

FIG. 18B is a schematic illustration of a synthesized annotated sample provided in an embodiment of the present application;

FIG. 19 is a logical illustration of a data text style migration provided in an embodiment of the present application;

FIG. 20 is a sample diagram of a text style migration sample provided in an embodiment of the present application;

fig. 21 is a schematic structural diagram of a text recognition apparatus provided in an embodiment of the present application;

fig. 22 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments, but not all embodiments, of the technical solutions of the present application. All other embodiments obtained by a person skilled in the art without any inventive step based on the embodiments described in the present application are within the scope of the protection of the present application.

Some concepts related to the embodiments of the present application are described below.

Text detection: the location of the text in the image is located.

Text recognition: and converting the image obtained by text detection into a text recognition result.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.

Machine Learning (ML) is a multi-domain cross subject, and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis and algorithm complexity theory. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

The scheme provided by the embodiment of the application relates to the machine learning technology of artificial intelligence. In the embodiments of the present application, the training process of the text detection model, the classification model and the text recognition model, and the corresponding model application process are mainly involved, and in particular, after the image to be recognized is obtained through the text detection model, the image to be recognized is input into the target classification model, obtaining corresponding language distribution information and an original text presenting direction, then, based on the original text presenting direction and a preset target character presenting direction, and then, based on the obtained position information of each text, from the target recognition image, determining a text region image set corresponding to each of a plurality of languages, and finally, respectively adopting a target text recognition model associated with the corresponding language based on each text region image set to obtain a text recognition result corresponding to the image to be recognized. The detailed model training process is described below and will not be described in detail herein.

It should be noted that, in the embodiment of the present application, the model training process may adopt offline training or online training, which is not limited to this.

The preferred embodiments of the present application will be described in conjunction with the drawings of the specification, it should be understood that the preferred embodiments described herein are for purposes of illustration and explanation only and are not intended to limit the present application, and features of the embodiments and examples of the present application may be combined with each other without conflict.

Fig. 1 is a schematic diagram of an application scenario provided in the embodiment of the present application. The application scenario includes at least a terminal device 110 and a server 120. The number of the terminal devices 110 may be one or more, the number of the servers 120 may also be one or more, and the number of the terminal devices 110 and the number of the servers 120 are not particularly limited in the present application. In the embodiment of the present application, a client related to text recognition may be installed on the terminal device 110, and the server 120 may be a server related to data processing. In addition, the client in the present application may be software, or may also be a web page, an applet, and the like, and the server is a background server corresponding to the software, or the web page, the applet, and the like, or a server specially used for data processing and the like, which is not limited in this application.

In this embodiment of the application, the terminal device 110 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, an internet of things device, a smart home appliance, a vehicle-mounted terminal, and the like, but is not limited thereto.

The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like. The terminal device 110 and the server 120 may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

It should be noted that the text recognition method in the embodiment of the present application may be executed by the server or the terminal device alone, or may be executed by both the server and the terminal device.

For example, a terminal device inputs an image to be recognized into a target classification model to obtain corresponding language distribution information and an original text presenting direction, then performs image correction on the image to be recognized based on the original text presenting direction and a preset target character presenting direction, uses the corrected image to be recognized as a target recognition image, then determines a text region image set corresponding to each language from the target recognition image based on obtained text position information, and finally obtains a text recognition result corresponding to the image to be recognized by respectively adopting target text recognition models associated with the corresponding languages based on the text region image sets. Alternatively, the text recognition process described above is performed by the server.

Then, or, the terminal device responds to a text recognition operation for the image to be recognized to obtain an image to be recognized, the server inputs the image to be recognized into the target classification model to obtain corresponding language distribution information and an original text presenting direction, then, based on the original text presenting direction and a preset target character presenting direction, image correction is performed on the image to be recognized, the image to be recognized obtained after correction is used as a target recognition image, then, based on the obtained text position information, a text region image set corresponding to each of a plurality of languages is determined from the target recognition image, finally, based on each text region image set, a target text recognition model associated with the corresponding language is respectively adopted to obtain a text recognition result corresponding to the image to be recognized, and the application is not specifically limited to this.

It should be noted that, in the embodiment of the present application, the text recognition method may be applied to any scene where characters in an image need to be extracted, for example, but not limited to, picture character extraction, scan translation, picture translation, reading, retrieval of literature, sorting of letters and packages, editing and collating of manuscripts, summarizing and analyzing of a large number of statistical reports and cards, statistical summarization of commodity invoices, recognition of commodity codes, management of commodity warehouses, and the like.

The text recognition method provided by the exemplary embodiment of the present application is described below with reference to the accompanying drawings in conjunction with the application scenarios described above, it should be noted that the application scenarios described above are only shown for the convenience of understanding the spirit and principles of the present application, and the embodiments of the present application are not limited in this respect.

Referring to fig. 2, a schematic flowchart of a possible text recognition method provided in an embodiment of the present application is shown, where the method may be applied to an electronic device, where the electronic device may be a terminal device or a server, and the specific flow is as follows:

s201, inputting an image to be recognized containing a text into a target classification model, and obtaining corresponding language distribution information and an original text presenting direction, wherein the language distribution information contains a plurality of languages corresponding to the text and text position information corresponding to the plurality of languages.

In the embodiment of the present application, the target classification model may also be referred to as a language/direction classification model (LOPN) of a multitask architecture. The target classification model can predict language distribution and predict text presentation direction of the image to be recognized.

Wherein, the languages can be a plurality of items in the following languages: chinese, japanese, korean, latin, thai, arabic, hindi, symbolic id, wherein symbolic id includes one or more of a number or a symbol, but is not limited thereto.

Referring to fig. 3A, a schematic diagram of various languages provided in this embodiment is shown, where semantics of texts corresponding to chinese, japanese, korean, latin, thai, arabic, and hindi are all "hello", and symbol marks represent 11 hours (11: 00) to 12 hours (12: 00).

The text rendering direction is used to represent the typesetting direction of the text, and exemplary text rendering directions include, but are not limited to, 0 °, 90 °, 180 °, and 270 °.

Taking chinese as an example, referring to fig. 3B, when the text presentation direction is 0 ° and 180 °, the typesetting directions of the characters are both horizontal typesetting, and when the text presentation direction is 0 ° and 180 °, the typesetting directions of the characters are both vertical typesetting.

It should be noted that, in this embodiment of the application, the image to be recognized may be an original image, or may also be a partial image including a text extracted from the original image through text detection, where a specific text detection manner is referred to below.

Taking the image 1 to be recognized as an example, referring to fig. 4, the image 1 to be recognized includes chinese "stepping" and english "spring", the image 1 to be recognized is input into the target classification model, and the language distribution information and the original text presentation direction corresponding to the image 1 to be recognized are obtained, wherein the original text presentation direction is 180 °, and the language distribution information includes a plurality of languages corresponding to the text: the Chinese and English characters are used, and the Chinese character corresponding text position information and the English character corresponding text position information are included, the Chinese character corresponding text position information is used for representing the text position where the 'stepping' is located, and the English character corresponding text position information is used for representing the text position where the 'spring' is located.

S202, based on the original text presenting direction and the preset target character presenting direction, image correction is carried out on the image to be recognized, and the image to be recognized obtained after correction is used as a target recognition image.

Still taking the image 1 to be recognized as an example, as shown in fig. 5, assuming that the preset target character presenting direction is 0 °, the original text presenting direction is 180 °, based on the original text presenting direction and the preset target character presenting direction, image correction is performed on the image 1 to be recognized, and the image to be recognized obtained after correction is taken as the target recognition image 1, wherein the character presenting direction of the target recognition image 1 is 0 °.

And S203, determining a text region image set corresponding to a plurality of languages from the target recognition image based on the obtained text position information.

Taking the target recognition image 2 as an example, referring to fig. 6, the target recognition image 2 is the corrected image 2 to be recognized, the target recognition image 2 includes chinese, japanese and english, fig. 6 includes a dashed box 61, a dashed box 62, a dashed box 63 and a dashed box 64, where the dashed box 61 and the dashed box 64 both represent text region images corresponding to japanese, the dashed box 62 represents text region images corresponding to chinese, and the dashed box 63 represents text region images corresponding to english.

And S204, respectively adopting target text recognition models associated with corresponding languages based on the obtained text region image sets to obtain text recognition results corresponding to the images to be recognized.

In order to improve the text recognition efficiency, for the characteristics and scales of character sets corresponding to different languages, five target text recognition models can be adopted, wherein the five target text recognition models are respectively target text recognition models corresponding to Chinese, Japanese, Korean, English and mixed Latin, and the mixed Latin comprises Latin, Thai, Vietnam, Russian, Arabic and Hindi, but is not limited to the above, and the scales of the character sets corresponding to Chinese, Japanese, Korean, Thai and mixed Latin are respectively 1w +, 9000+, 8000+, 200+ and 1000 +. Hereinafter, the above-mentioned five target text recognition models will be described as examples.

Specifically, when S204 is executed, the following operations may be adopted, but are not limited to:

respectively inputting the obtained image sets of the text regions into target text recognition models associated with the corresponding languages, and obtaining text recognition sub-results corresponding to the image sets of the text regions;

and obtaining a text recognition result corresponding to the image to be recognized based on the obtained text recognition sub-results.

Still taking the target recognition image 2 as an example, referring to fig. 7, in fig. 7, both the dashed box 61 and the dashed box 64 represent text region images corresponding to japanese, the dashed box 62 represents text region images corresponding to chinese, the dashed box 63 represents text region images corresponding to english, the text region images represented by the dashed box 61 are input into a target text recognition model associated with japanese to obtain corresponding text recognition sub-results 61, the text region images represented by the dashed box 63 are input into a target text recognition model associated with japanese to obtain corresponding text recognition sub-results 63, the text region images represented by the dashed box 62 are input into a target text recognition model associated with chinese to obtain corresponding text recognition sub-results 62, the text region images represented by the dashed box 61 are input into a target text recognition model associated with hybrid latin to obtain corresponding text recognition sub-results 64, further, based on the text recognizer result 61, the text recognizer result 62, the text recognizer result 63, and the text recognizer result 64, a text recognition result corresponding to the image 2 to be recognized is obtained.

Due to the fact that the quantity distribution difference of the character sets of different languages is large, through the implementation mode, the situation that the recognition result is biased to the language of the large character set (such as Chinese, Japanese and Korean) and the language of the small character set (such as Latin, Arabic and Thai) is ignored can be relatively avoided, therefore, the character recognition precision is improved, the text recognition model is optimized in adaptability according to the character characteristics of the language, and flexible updating of the model is achieved.

In some embodiments, the image to be recognized may be obtained by text detection, which may specifically adopt, but is not limited to, the following ways:

mode 1: an original image is acquired and at least one sub-image containing text is extracted from the original image.

Specifically, the original image may be input into the target text line detection model, and at least one sub-image including text may be obtained. The target text line detection model may be implemented by using a Differential Binarization (DB) algorithm, but is not limited thereto.

Referring to fig. 8A, the main Network portion of the target text line detection model may adopt a Full Convolution Network (FCN) architecture based on a lightweight Network architecture, and the multi-stream branch of the head is used to determine whether a pixel in the image is a character and to perform binarization threshold learning. The target text line detection model may be composed of 13 × 3 convolution operator and 2 deconvolution operators with step size of 2, 1/2, 1/4, …, and 1/32 respectively representing the scale compared to the input original image. The lightweight network architecture can adopt, but is not limited to, mobilenetv2, mobilenetv3, shufflenet and the like.

The input original image passes through a resnet50-vd layer of a Feature Pyramid Network (FPN) (detailed structure is referred to as 2.2.4), meanwhile, the output of the characteristic pyramid is converted into the same size in an upsampling mode, and a characteristic graph is generated in a cascading (cascade) mode, then, based on the feature map, a probability map (probability map) and a threshold map (threshold map) can be predicted, wherein, the probability map is used for representing the probability of each pixel belonging to the text, the threshold map is used for representing the corresponding threshold of each pixel, further, based on the probability map and the threshold map, a binary map (approximate binary map) can be obtained, and finally, based on the binary map, corresponding sub-images are obtained, each sub-image comprising sub-image 1, sub-image 2 and sub-image 3, the sub-image 1 comprises a text "Harvesting", the sub-image 2 comprises a text "great", and the sub-image 3 comprises a text "skills".

Mode 2:

the method comprises the steps of obtaining an original image, extracting at least one sub-image containing a text from the original image, respectively carrying out shape correction processing on each extracted sub-image based on a preset image shape, and taking any one of the sub-images obtained after the correction processing as an image to be recognized.

The preset image shape may be set as a regular pattern such as a rectangle, but is not limited thereto. In practical applications, the preset image shape is usually set to be rectangular for facilitating subsequent image processing operations, that is, the image shape of the sub-image is corrected to be rectangular. Since the process of extracting the sub-image in the mode 2 is the same as the process of extracting the sub-image in the mode 1, the description is omitted here.

For example, as shown in fig. 8B, assuming that the preset image shape is a rectangle, the sub-image 1 includes a text "Harvesting", and the image of the sub-image 1 is a curved shape, based on the preset image shape, the extracted sub-image 1 is subjected to shape correction processing to obtain a corrected sub-image 1, and the corrected sub-image 1 is a rectangular image including a text "Harvesting".

Obviously, in the embodiment of the present application, by performing polygon fitting on the extracted sub-images, the text regions are corrected to be rectangular text regions for the curved text regions, and then subsequent text recognition is performed based on the rectangular text regions, so that the text regions in any shape can be detected and recognized, and meanwhile, the text recognition accuracy can be further improved.

Next, a target classification model and a target text recognition model related in the embodiments of the present application are introduced:

first, object classification model

Referring to fig. 9, the target classification model may include a target feature extraction Network, a target language identification submodel, and a target direction identification submodel, where the target feature extraction Network may use, but is not limited to, a Convolutional Neural Network (CNN), the target feature extraction Network includes S1, S2, S3, and S4 layers, where the S1 layer includes operations such as Depth (DW) convolution, normal convolution, an activation function, matrix multiplication, and Pointwise (PW) convolution, and the normal convolution may also be directly referred to as convolution.

First, an acquisition process of the target language identification submodel is introduced.

In some embodiments, the target classification model includes a target language identification submodel, and the target language identification submodel may be trained by:

based on the obtained first training data set, model training is performed on an initial language identification submodel included in the initial identification model, and a target language identification submodel is output, where as shown in fig. 10A, in an iteration process, the following operations are performed:

s1001, inputting the training data x into the initial language identification submodel to obtain predicted language distribution information corresponding to the training data x. The training data x may be any one of the training data included in the first training data set.

Taking training data as an image a as an example, referring to fig. 10B, the image a includes a text "x records 20 years of the food company", and the image a is input into the initial language identification submodel to obtain predicted language distribution information corresponding to the image a, where the predicted language distribution information corresponding to the image a includes chinese characters, character identifiers, and text position information corresponding to the chinese characters and the character identifiers, where the text position information of the chinese characters is used to represent the text "x records 0 years of the food company", and the text position information corresponding to the character identifiers is used to represent the text "2".

S1002, determining the loss of the first model based on the predicted language distribution information and the real language distribution information corresponding to the training data x.

Still taking training data as an image a as an example, referring to fig. 10B, the real language distribution information corresponding to the image a includes chinese characters and character identifiers, and text position information corresponding to the chinese characters and the character identifiers, where the text position information of the chinese characters is used to represent a text "x" year around food companies, and the text position information corresponding to the character identifiers is used to represent a text "20".

S1003, based on the first model loss, model parameter adjustment is carried out on the initial language identification submodel.

Through the implementation mode, the model can be trained based on the predicted language distribution information and the real language distribution information corresponding to the training data, so that the language classification accuracy of the model is improved, and the accuracy of text recognition is further improved.

In the practical application process, texts contained in the image may not belong to a single category completely, for example, chinese characters often appear in japanese and korean, and for example, latin and symbolic signs often appear in a mixed manner with characters of any language, in order to effectively describe the situation of mixed character distribution, in the embodiment of the present application, by optimizing model parameters, predicted language distribution information is fitted to a soft target, and the soft target (soft target) refers to counting the probability of each category of characters appearing in each text character string. Specifically, referring to fig. 10A, when S1002 is executed, the following operations may be adopted, but are not limited to:

s10021, determining a prediction distribution probability based on the prediction language distribution information, wherein the prediction distribution probability comprises prediction probabilities corresponding to the languages, and each prediction probability is used for representing the text length ratio of the corresponding language in each language.

Taking training data as an image a as an example, determining a prediction distribution probability based on the predicted language distribution information corresponding to the image a, wherein in the prediction distribution probability, the prediction probability of Chinese is 90%, and the prediction probability of symbol identification is 10%.

S10022, determining real distribution probability based on the real language distribution information, wherein the real distribution probability comprises respective corresponding real probabilities of the languages, and each real probability is used for representing the text length ratio of the corresponding language in each language.

Still taking training data as an image a as an example, based on the real language distribution information, the real distribution probability is determined, and in the real distribution probability, the real probability of the Chinese is 80%, and the real probability of the symbolic sign is 20%.

S10023, determining a first model loss based on the prediction distribution probability and the real distribution probability.

Through the implementation mode, the distribution condition of the mixed characters can be effectively described, so that the prediction accuracy of language distribution is improved, and the identification accuracy of the multilingual text is improved.

In some embodiments, the first model loss may be Cross Entropy loss (CE loss) or KL divergence loss (Kullback-Leibler divergence loss) when S10023 is executed.

In order to describe the similarity between the predicted distribution probability and the true distribution probability, in the embodiment of the present application, KL divergence loss may be used as a target loss of the network to language prediction, and the calculation mode of the KL divergence loss is shown in formula (1):

wherein, P is used to represent the prediction distribution probability, Q is used to represent the true distribution probability, P (x) is used to represent the prediction probability corresponding to the language x, Q (x) is used to represent the prediction probability corresponding to the language x, and x is a certain language in various languages.

Next, an acquisition process of the target direction recognition submodel is described.

As a possible implementation manner, in this embodiment of the application, in a one-time iteration process, the training data x may be input into the initial direction recognition submodel to obtain a predicted text presentation direction of the training data x, and then the second model loss is determined based on the predicted text presentation direction and the real text presentation direction of the training data x, and further the model parameter adjustment is performed on the initial direction recognition submodel based on the second model loss.

As another possible implementation manner, in order to enable the model to learn the difference between images in different text presentation directions well, in this embodiment of the application, images in different text presentation directions may be acquired in an input layer, and the difference is increased by maximizing the distance between feature layers of the images in different text presentation directions in model training, so that the model learns the understanding of the text presentation directions better. Specifically, the target direction recognition submodel is obtained by:

and performing model training on the initial direction recognition submodel contained in the initial recognition model based on the acquired first training data set, and outputting a target direction recognition submodel.

In the following, still taking the training data x as an example, referring to fig. 11A, in the course of one iteration, the following operations are performed:

s1101, acquiring training data x, and rotating the training data x according to a preset image rotation angle to obtain comparison data y.

In the embodiment of the application, in order to enable the model to better learn the differences of the images in opposite character directions, namely the text presentation directions of 0 ° and 180 ° and the text presentation directions of 90 ° and 270 °, the preset image rotation angle may be set to 180 °. It should be noted that the image rotation angle may be set according to the actual application scene, and is not limited to 180 °.

Taking the training data x as the image B as an example, referring to fig. 11B, the image B includes korean "hello", and after the image B is acquired, the training data x is rotated by 180 degrees according to a preset image rotation angle, so as to obtain the comparison data y.

And S1102, respectively inputting the training data x and the comparison data y into the initial direction recognition submodel to obtain the predicted text presenting directions corresponding to the training data x and the comparison data y respectively.

Taking training data x as an image b as an example, inputting the training data x into the initial direction identification submodel to obtain a predicted text presenting direction corresponding to the training data x, wherein the predicted text presenting direction corresponding to the training data x is 0 °, and inputting comparison data y into the initial direction identification submodel to obtain a predicted text presenting direction corresponding to the comparison data y, wherein the predicted text presenting direction corresponding to the comparison data y is 180 °.

S1103, determining second model loss based on the obtained presentation directions of the predicted texts, and adjusting model parameters of the initial direction recognition submodel based on the second model loss.

By the implementation mode, the model can learn the difference between the images in different text presenting directions, so that the recognition accuracy of the text in different text presenting directions is improved.

In some embodiments, the second model loss may be a model predicted loss or a contrast loss (contrast loss), or may be a weighted result of the model predicted loss and the contrast loss. The model prediction loss may be cross entropy loss or focal loss (focal loss), but is not limited thereto, and the cross entropy loss is only used as an example for description.

And if the second model loss adopts cross entropy loss, calculating cross entropy loss based on the obtained presentation directions of the predicted texts and the corresponding presentation directions of the real texts, and taking the calculated cross entropy loss as the second model loss.

And if the second model loss adopts the contrast loss, calculating the contrast loss between the presentation directions of the predicted texts based on the obtained presentation directions of the predicted texts, and taking the calculated contrast loss as the second model loss.

If the second model loss is weighted by the cross-entropy loss and the contrast loss, the second model loss can be determined in the following manner:

determining contrast loss based on the image characteristics corresponding to the training data x and the contrast data y respectively;

determining cross entropy losses corresponding to the training data x and the comparison data y respectively based on the real text presenting directions corresponding to the training data x and the comparison data y respectively and the obtained predicted text presenting directions;

and determining the second model loss based on the obtained cross entropy losses and the contrast loss, and based on the cross entropy loss weight and the contrast loss weight.

Wherein, the calculation formula of the contrast loss is shown in formula (2):

wherein, C _loss Representing the contrast loss, d is used for representing the Euclidean distance between image features corresponding to the training data x and the contrast data y, margin is a set threshold, and a max () function is used for taking the maximum value.

The image features corresponding to the training data x and the comparison data y are image features obtained by extracting image features from an image, and the image features may be referred to as image embedding.

When determining the second model loss based on the obtained cross entropy losses and contrast losses, and the cross entropy loss weight and contrast loss weight, the second model loss may be determined by using the cross entropy loss weight and contrast loss weight based on the sum of the cross entropy losses or the average value of the cross entropy losses, but the present invention is not limited thereto.

Still taking the training data x as the image b as an example, referring to fig. 11C, it is determined that the contrast loss is 0.2 based on the image features corresponding to the training data x and the contrast data y, then it is determined that the cross entropy loss corresponding to the training data x is 0.1 based on the real text presentation direction and the predicted text presentation direction corresponding to the training data x, it is determined that the cross entropy loss corresponding to the contrast data y is 0.1 based on the real text presentation direction and the predicted text presentation direction corresponding to the contrast data y, then, it is assumed that the weights corresponding to the cross entropy loss and the contrast loss are both 0.5, and it is determined that the second model loss is 0.2 based on the sum of the cross entropy losses by using the cross entropy loss weight and the contrast loss weight.

Obviously, in the embodiment of the application, the difference can be increased by maximizing the distance between the feature layers of the images in different text presenting directions in model training, so that the model can learn the understanding of the text presenting directions better, the recognition accuracy of the text presenting directions of the model is improved, and the text recognition accuracy is improved.

It should be noted that, in this embodiment of the present application, the target direction recognition submodel and the target language recognition submodel may be included in the same target classification model, and the target direction recognition submodel and the target language recognition submodel may also be configured separately to implement the function of the target classification model, which is not described in detail herein.

And finally, introducing the acquisition process of the target feature extraction network.

For an image containing text, the judgment of the type can depend on the apparent features of the image, but the judgment of the text presenting direction relates to the identification and understanding of the text content, and the simple apparent features of the image cannot process some special pictures. For example, referring to fig. 12, the image c and the image d both include the text "codep", the text rendering direction of the image c is 180 °, the text rendering direction of the image d is 0 °, however, the text recognition result corresponding to the image c is "codep", and the text recognition result corresponding to the image d is "dapos", and obviously, the text recognition is incorrect.

In order to assist the learning of the text presentation direction by the model and improve the precision of text presentation direction classification, in the embodiment of the application, a multi-stream recognition model which adopts the same backbone network as a target classification model is introduced as a pre-training task, the multi-stream recognition model is used for recognizing the content of the multi-language text, and after the training of the multi-stream recognition model is completed, the backbone network of the multi-stream recognition model is obtained by training and is used as a target feature extraction network. Further, based on the target feature extraction network, training an initial language identification submodel and an initial direction identification submodel to obtain a target language identification submodel and a target direction identification submodel.

Specifically, the target feature extraction network is obtained by training the following operations:

constructing a pre-training language identification model based on the initial feature extraction network;

and performing iterative training on the pre-training recognition network based on the obtained second training data set to obtain a target feature extraction network.

It should be noted that, in the embodiment of the present application, the pre-trained language recognition model may also be referred to as a multi-stream recognition model.

Referring to fig. 13, the pre-training language identification model includes an input layer, a backbone network, a timing sequence model, a multi-stream decoder and an output layer, where the backbone network has the same structure as the backbone network in the target classification model, the backbone network is used for learning image appearance features, the timing sequence model is used for learning context information of a text, and the multi-stream decoder includes, but is not limited to, decoders corresponding to chinese, japanese, korean, latin, thai, arabic, hindi, and symbolic signs.

Wherein, the time sequence model can adopt a Long Short-Term Memory network (LSTM). Wherein x is _t Representing the input value at time t, y _t The output value of t time is shown, sigma is a gate activation function, information useful for calculation of subsequent time is transmitted in the LSTM by forgetting and memorizing new information in the cell state, useless information is discarded, and the forward LSTM and the reverse LSTM are combined to form a bidirectional long-short term memory network (BilTM).

By the implementation mode, the model convergence speed can be remarkably improved, and meanwhile, the classification precision of the text presentation direction and the language distribution can be improved, so that the text recognition precision is improved.

In some embodiments, the target classification model may introduce additional overhead to the overall word recognition process, which may multiply the time overhead once the number of text lines in the image is large. In order to maximally reduce the time cost of the target classification model, in the embodiment of the present application, model compression clipping may be performed on the target classification model. Specifically, the target feature extraction network may adopt a lightweight network framework, and the lightweight network framework may adopt, but is not limited to, mobilenetv2, mobilenetv3, shufflenet, and the like. In addition, in order to enhance the feature attention, an SE (squeeze-excitation) layer may be added to the target feature extraction network, and a dimension reduction layer may be added to at least one of the target direction recognition submodel and the target language recognition submodel to implement further reduction operation.

In order to further improve the online prediction speed of the model, in the embodiment of the application, a pytorch quantization perception training (QAT) is adapted to TensorRT int8 to perform 8-bit quantization model tuning, so that the prediction speed of an online Graphics Processing Unit (GPU) is improved under the condition that only a weak loss of classification precision is ensured.

Second, target text recognition model

In the embodiment of the present application, the target text recognition model includes an image feature encoder, a time sequence model, and a decoder, where the decoder may implement multi-task decoding by using a connected time sequence Classification (CTC) decoder and an attention (attention) for assisting CTC learning. The image feature encoder may employ, but is not limited to, ResNet-50VD, and the timing model may employ, but is not limited to, bi-directional LSTM, the timing model being used to enhance learning of textual context information in the image. Since the structure of the text recognition model is similar to that of the multi-stream recognition model, it is not described herein again.

In addition, because the model training is different from Chinese and English recognition tasks, multi-language recognition marking data are extremely scarce, the cost is high, data inspection is difficult, a professional linguistic learner is required to perform auxiliary marking, the requirement on the capability of the marker is high, and a large amount of training data cannot be acquired in a short time.

In order to avoid the influence of the serious scarcity of the multilingual identification marking data on the text identification performance, in the embodiment of the application, on one hand, the text identification performance can be improved through Semi-supervised Learning (SSL), the SSL is widely applied to an image classification task and can be used for solving the problem of shortage of labeled training data, and on the other hand, the text identification performance can be improved through data generation.

Next, the semi-supervised learning process is described.

Specifically, in the embodiment of the present application, the target text recognition model is obtained by training through the following operations:

acquiring a third training data set, wherein the third training data set comprises marked samples and unmarked samples;

training a first text recognition model of the included image feature extraction network based on each labeled sample to obtain a second text recognition model, and constructing a picture-text distance recognition model based on the image feature extraction network included in the second text recognition model;

and performing iterative training on the second text recognition model based on the marked samples, the unmarked samples and the image-text distance recognition model to obtain a target text recognition model.

It should be noted that, in the embodiment of the present application, the first training data set, the second training data set, and the third training data set may be the same or different, and are not limited thereto.

Each labeled sample contains a corresponding real text recognition result, and each unlabeled sample does not contain a corresponding real text recognition result.

The first text recognition model refers to an untrained text recognition model. The image feature extraction network can adopt, but is not limited to CNN, and the image-text distance recognition model adopts the image feature extraction network included in the second text recognition model as its own image feature extraction network.

The image-text distance identification model inputs various marked samples and corresponding sample labels, and the model loss can adopt sequencing loss, so that the image-text distance of the same pair of marked samples and sample labels is minimized, and the image-text distance of different pairs of marked samples and sample labels is maximized.

For example, referring to fig. 14, for an annotation sample 1, the annotation sample 1 and a corresponding sample label are input into an image-text distance recognition model, so as to obtain an image feature corresponding to the annotation sample 1 and a text feature corresponding to the sample label, and an image-text distance between the annotation sample 1 and the corresponding sample label is minimized by optimizing model parameters.

Specifically, referring to fig. 15, when the second text recognition model is iteratively trained, the following operations may be iteratively performed for each labeled sample and each unlabeled sample:

s1501, obtaining N marked samples and M unmarked samples, and respectively inputting the M unmarked samples into the second text recognition model to obtain predicted text recognition results corresponding to the M unmarked samples.

In the embodiment of the present application, the value of N, M is a positive integer. N, M may or may not have the same value. The acquired N annotated samples and M unlabeled samples may be referred to as a batch of image data (batch).

Taking the unlabeled sample 2 as an example, referring to fig. 14, the unlabeled sample 2 is input into the second text recognition model, so as to obtain the text recognition result 2 corresponding to the unlabeled sample 2, where the text recognition result 2 is "BA CHU LON CON".

S1502, inputting the M unlabeled samples and the corresponding predicted text recognition results into the image-text distance recognition model, obtaining image-text distances corresponding to the M unlabeled samples, and determining sub-losses of each model corresponding to at least one unlabeled sample based on the obtained image-text distances.

Still taking the unlabeled sample 2 as an example, referring to fig. 14, the unlabeled sample 2 and the corresponding text recognition result 2 are input into the image-text distance recognition model, and the image-text distance D1 corresponding to the unlabeled sample 2 is obtained, where the value of D1 is 0.33.

Specifically, determining each model sub-loss corresponding to at least one unlabeled sample based on the obtained text distance of each image includes:

screening unmarked samples with the image text distance not greater than a preset distance threshold from the M unmarked samples based on the obtained image text distance, and taking the screened unmarked samples as samples to be enhanced;

acquiring target enhancement samples corresponding to the samples to be enhanced, and taking a prediction text recognition result corresponding to each sample to be enhanced as a sample label of the corresponding target enhancement sample;

and respectively inputting the obtained target enhancement samples into the second text recognition model to obtain model sub-losses corresponding to the target enhancement samples, and taking the model sub-losses corresponding to the target enhancement samples as the model sub-losses corresponding to the M unlabeled samples.

It should be noted that, in the embodiment of the present application, the target enhancement sample refers to sample data obtained after performing data enhancement on a corresponding sample to be enhanced through data enhancement modes such as rotation, flipping, scaling, contrast change, noise disturbance, and the like. The sample label of the target enhancement sample may also be referred to as a pseudo-label of the target enhancement sample.

Still taking the unlabeled sample 2 as an example, referring to fig. 14, it is assumed that the preset distance threshold is 0.5, and the image text distance 1 corresponding to the unlabeled sample 2 is not greater than 0.5, so that the unlabeled sample 2 is taken as a sample to be enhanced. Then, a target enhancement sample 2 corresponding to the unlabeled sample 2 is obtained, and the text recognition result 2 of "BA CHU LON CON" is used as a pseudo label of the target enhancement sample 2. And then, inputting the obtained target enhancement sample 2 into a second text recognition model to obtain a model sub-loss corresponding to the target enhancement sample 2. Similarly, the model sub-losses corresponding to the target enhancement samples can be obtained, and then the model sub-losses corresponding to the M unlabeled samples are obtained.

And S1503, inputting the N marked samples into the second text recognition model respectively to obtain model sub-losses corresponding to the N marked samples respectively.

It should be noted that the model sub-loss may adopt cross entropy loss, focus loss, and the like.

Taking the labeled sample 1 as an example, referring to fig. 14, the labeled sample 1 is input into the second text recognition model, so as to obtain the model sub-loss corresponding to the labeled sample 1, where the model sub-loss corresponding to the labeled sample 1 is 1.

S1504, determining a third model loss based on the obtained model sub-losses, and adjusting model parameters of the second text recognition model based on the third model loss.

In this embodiment of the present application, in one input batch, a sum of model sub-losses corresponding to N labeled samples and M unlabeled samples may be used as a third model loss. And then gradient back propagation is carried out based on the loss of the third model, so that the aim of optimizing the model parameters is fulfilled.

Due to the fact that the scale difference between the real data and the synthetic data adopted by the model training is large, for example, the real data is ten thousand levels, and the synthetic data is million levels, through the implementation mode, on one hand, the real data can be prevented from being submerged by the synthetic data through double-flow input collaborative training of the real data and the synthetic data, and on the other hand, the text recognition performance can be improved based on a semi-supervised learning mode.

In some special languages, the characters usually include vowel superscripts and tone marks, and these characters are usually located above and below the characters of the basic character, so that the text recognition for such languages needs to consider not only the time sequence order, but also two-dimensional spatial information. For example, referring to fig. 16, in thai, there are vowel superscripts and tone marks in the characters of thai, which are generally located above and below the characters of the capital characters.

To improve the recognition accuracy of such languages, in some embodiments, the decoder in the target text recognition model may employ a two-dimensional spatial attention decoder in an irregular character recognition method (SAR). The two-dimensional space attention decoder introduces that the decoding of each step does not only focus on the time sequence information, but also considers the space image characteristic information, so that the recognition performance of some irregular texts with space distribution is relatively better.

Because the SAR calculates the 2D attention weight value in each step during decoding, the decoding speed is about 15 times of that of the CTC, and the SAR recognition accuracy is easy to limit for long texts, in some embodiments, a decoder in the target text recognition model can also adopt a double-current decoding structure of the CTC and the SAR, two decoders share the time sequence characteristics of the LSTM, only the CTC branch result is predicted during decoding, and the SAR is used for assisting CTC learning.

For example, referring to fig. 17A, the CNN module may use 31 layers of ResNet to obtain a feature map (feature map), and then the feature map passes through an LSTM-based encoder-decoder framework (encoder-decoder framework) and a 2D attribute module connected to a decoder to finally output a text based on SAR recognition. The feature map may output a text recognized based on the CTC through the CTC decoder.

Through the implementation mode, the image is subjected to CNN and then is down-sampled to 1/8 with the original height, so that more spatial information can be reserved, the recognition accuracy of the text recognition model is improved, and meanwhile, no forward time-consuming overhead is caused.

In the decoding process based on the SAR, the algorithm is introduced into the Thai recognition model by the two-dimensional attention module, and experimental data show that the SAR recognition precision is improved by nearly 7% compared with that of CTC.

In the decoding process of the SAR, the introduction of a two-dimensional attention module enables the decoding of each step to not only focus on the time sequence information, but also consider the spatial image characteristic information, so that the recognition performance of some irregular texts with spatial distribution is relatively better. Referring to fig. 17B, in the thai recognition model, SAR and CTC are respectively used as decoders, wherein thai in the image e represents hello, thai in the image f represents summer, and there are characters with recognition errors in CTC, and obviously, the recognition accuracy of SAR is higher than that of CTC.

Next, a data generation process will be described.

Specifically, at least one of the following operations may be performed:

operation A: and (6) synthesizing data.

Specifically, each text corpus, each font format and each background image are obtained, and each labeled sample is synthesized based on each text corpus, each font format and each background image.

In the embodiment of the application, data synthesis can be performed based on the architecture of the textrender, and text corpora, font formats and background images of any language are input and output as synthesized labeling samples.

For example, referring to fig. 18A, the text corpus 1 is rberinMann, the font format 1 is robot. ttf, and the background image 1 is a color background, and based on the text corpus 1, the font format 1, and the background image 1, the synthesized annotation sample is as shown in fig. 18A.

In order to further increase the number of samples, referring to fig. 18B, in the embodiment of the present application, in the process of synthesizing a labeled sample, information such as font size, color, gap, thickness, and the like of a text corpus may be configured, and horizontal text rendering or vertical text rendering may also be performed. For the background image, after the background image is subjected to processing operations such as interception, image enhancement transformation and the like, the processed background image and the text corpus are superposed. As an example, the superimposed image can be directly used as a synthesized annotation sample. As another example, for the superimposed image, one or more of the following operations are performed: poisson fusion, perspective transformation, alpha channel image superposition, image highlight enhancement, image printing enhancement, image enhancement, interference and image size transformation, wherein the interference comprises but is not limited to blurring, noise and horizontal line superposition interference.

By the implementation mode, images containing multiple languages such as Chinese, Japanese, Korean, Thai, Yue, Russian, Arad, Latin, and seal can be generated according to the characteristics of the multi-language data, and meanwhile, generation of horizontal and vertical text images and generation of text images in some special languages such as Arab or seal land (from right to left, special deformation) are supported. On the other hand, the operations of highlight, printing interference, data collage and the like are introduced into the data synthesis, so that the synthesized picture can be closer to the real data.

And operation B: and migrating data styles.

Specifically, each text corpus is obtained and is respectively input into the target data style migration model to obtain each labeled sample.

Referring to fig. 19, the target data style migration model includes a text transfer module, a background extraction module, and a text fusion module, and the target data style migration model adopts a Generative Adaptive Network (GAN). After the text corpus and the image of the target text style are input into the target data style migration model, the text corpus is subjected to a text transfer module to obtain a text corpus Osk corresponding to a target text font in the target text style, the target text style is subjected to a text transfer module background extraction module to output an image background Ob contained in the target text style, and in addition, the target text style can also be subjected to a text transfer module to output a text corpus Ot corresponding to a target text font containing an original image background, so that a corresponding labeling sample is generated. FIG. 19 also contains L _T 、L _B And L _F Wherein L is _T Model loss, L, representing text transfer Module _B Model loss, L, representing the background extraction Module _F Representing the model loss of the text fusion module.

Referring to fig. 20, when the corpus is "requires" and the target text style is shown as an image g, inputting the corpus "requires" and the target text style into the target data style migration model to obtain a labeled sample "requires", when the corpus is "crisis" and the target text style is shown as an image h, inputting the corpus "crisis" and the target text style into the target data style migration model to obtain a labeled sample "crisis", and when the corpus is "beyond" and the target text style is shown as an image i, inputting the corpus "beyond" and the target text style into the target data style migration model to obtain a labeled sample "beyond".

Because the model training of the GAN is not stable enough, on one hand, in the embodiment of the application, the original antagonistic loss can be replaced by the Hinge loss in the SN-GAN to stabilize the training process, so that the large-amplitude oscillation of the gradient is avoided. On the other hand, when the L1 loss of the generated labeling sample and the target text style is calculated, in the embodiment of the present application, the L1 loss weighted by the text mask region may be adopted, so that the over-learning of the model to the background is reduced, and the constraint on the text region pixels is enhanced. In the testing stage, the trained data style migration model is adopted to perform style migration learning on the input text expectation and the target text style image, so that the recognition data which is more similar to the real data is generated.

Through the implementation mode, the font style of the real data is learned through data style migration, on one hand, the domain difference between the synthesized data and the real data can be further solved, on the other hand, the real data can be subsequently utilized to perform style migration on the data synthesized through the operation A, and the real data can also be utilized to perform style migration on other real data, so that the sample diversity is increased.

It should be noted that, in the embodiment of the present application, in the text recognition model based on semi-supervision, the model may also be trained by using reinforcement learning (DQN), where the text recognition model is used as an agent (agent) part in DQN, the image and the predicted text may be used as an environment (environment) part in DQN, and the feedback (reward) may be represented by an image text distance and an editing distance reward.

The language/direction classification network (LOPN) of the text provided in the embodiment of the present application can quickly and accurately perform language distribution prediction and direction judgment on a text line image, and the experimental results on the set test set are shown in table 1. As can be seen from Table 1, modeling the distribution using soft target probabilities, there is a performance gain of nearly 7% in the accuracy of language classification using KL divergence loss. Meanwhile, the classification model after the multi-stream text recognition model is adopted for pre-training is greatly improved in language classification and direction classification, the understanding of the model to the image text is enhanced by the introduction of the recognition task, and the defect that the classification is carried out only by the apparent features of the image is overcome. Moreover, double-flow data supervision is added in direction classification, so that the model is helped to better distinguish the directions of the image characters, and the performance of the direction classification is improved.

TABLE 1 LOPN test set accuracy

In the embodiment of the application, TensorRT is adopted to deploy the classification network model, an NVIDIA-T4 GPU model is adopted on the line, and the comparison result of the prediction speeds of the original model and the quantitative model is shown in Table 2.

TABLE 2 LOPN TensorRT model prediction accuracy speed comparison

As shown in table 3, the text recognition model in the embodiment of the present application combines data synthesis optimization and semi-supervised training, obtains higher recognition performance on the set language test set, and has recognition accuracy greatly superior to that of the existing open-source model compared to the open-source model.

Specifically, a Normalized Edit Distance (NED) and a sequence accuracy (SeqACC) may be used as the identification task evaluation index, the NED may be calculated by using formula (3), and the SeqACC may be calculated by using formula (4):

in the formula (3) and the formula (4), D represents a Laves distance,

representing predicted text (i.e. text recognition results),

representing true value, N representing total number of images to be recognized, | x | ₊ Representing a statistical operation, and is incremented by 1 when the value of x is true.

TABLE 3 multilingual text recognition model accuracy

Obviously, as can be seen from table 3, in training of different language data sets, the multi-stream data training, the multi-task decoding, the semi-supervised text recognition training, the data style migration, and the enhanced data enhancement adopted in the embodiment of the present application all bring effective performance gains to the text recognition task, and have higher practicability and universality.

Referring to table 4, it is shown that the comparison result of the recognition accuracy of the thailand is obtained when the model adopts the CTC decoder, the SAR decoder, and the CTC + SAR dual-stream decoder, and it is obvious that the performance of the two evaluation indexes, i.e., the recognition speed and the NED, is the best when the CTC + SAR dual-stream decoder is adopted.

TABLE 4 comparison of recognition accuracy of Thai

Based on the same inventive concept, the embodiment of the application provides a text recognition device. As shown in fig. 21, which is a schematic structural diagram of the text recognition apparatus 2100, the apparatus may include:

an image classification unit 2101, configured to input an image to be recognized including a text into a target classification model, to obtain corresponding language distribution information and a presentation direction of an original text, where the language distribution information includes a plurality of languages corresponding to the text and text position information corresponding to each of the languages;

an image correction unit 2102 configured to perform image correction on the image to be recognized based on the original text presentation direction and a preset target character presentation direction, and use the image to be recognized obtained after the image to be recognized is corrected as a target recognition image;

a text positioning unit 2103, configured to determine, from the target recognition image, a text region image set corresponding to each of the plurality of languages based on the obtained text position information;

an image recognition unit 2104, configured to obtain, based on the obtained text region image sets, text recognition results corresponding to the images to be recognized by respectively using the target text recognition models associated with the corresponding languages.

As a possible implementation manner, when the target text recognition model associated with the corresponding language is respectively adopted based on the obtained image set of each text region to obtain the text recognition result corresponding to the image to be recognized, the image recognition unit 2104 is specifically configured to:

respectively inputting the obtained image sets of the text regions into target text recognition models associated with corresponding languages, and obtaining text recognition sub-results corresponding to the image sets of the text regions;

As a possible implementation manner, the text recognition apparatus 2100 further includes a model training unit 2105, where the target classification model includes a target language identification submodel, and the model training unit 2105 is configured to:

model training is carried out on an initial language identification submodel contained in an initial identification model based on an obtained first training data set, and the target language identification submodel is output, wherein in the process of one iteration, the following operations are executed:

inputting a training data contained in the first training data set into an initial language identification submodel to obtain predicted language distribution information corresponding to the training data;

and determining a first model loss based on the predicted language distribution information and the real language distribution information corresponding to the training data, and adjusting model parameters of the initial language identification submodel based on the first model loss.

As a possible implementation manner, when determining that the first model is lost based on the predicted language distribution information and the real language distribution information corresponding to the training data, the model training unit 2105 is specifically configured to:

determining a prediction distribution probability based on the prediction language distribution information, wherein the prediction distribution probability comprises prediction probabilities corresponding to the languages, and each prediction probability is used for representing the text length ratio of the corresponding language in each language;

determining a real distribution probability based on the real language distribution information, wherein the real distribution probability comprises respective corresponding real probabilities of the languages, and each real probability is used for representing the text length ratio of the corresponding language in each language;

determining the first model loss based on the predicted distribution probability and the true distribution probability.

As a possible implementation manner, the text recognition apparatus 2100 further includes a model training unit 2105, where the target classification model includes a target direction identifier model, and the model training unit 2105 is configured to:

model training is carried out on an initial direction recognition submodel contained in an initial recognition model based on an obtained first training data set, and the target direction recognition submodel is output, wherein in the process of one iteration, the following operations are executed:

acquiring training data contained in the first training data set, and rotating the training data according to a preset image rotation angle to obtain comparison data;

respectively inputting the training data and the comparison data into the initial direction recognition submodel to obtain a predicted text presentation direction corresponding to each of the training data and the comparison data;

and determining the loss of a second model based on the obtained presentation directions of the predicted texts, and adjusting model parameters of the initial direction recognition submodel based on the second model.

As a possible implementation manner, when determining that the second model is lost based on the obtained predicted text presentation directions, the model training unit 2105 is specifically configured to:

determining a contrast loss based on the image features corresponding to the training data and the contrast data respectively;

determining model prediction losses corresponding to the training data and the comparison data respectively based on the real text presenting directions corresponding to the training data and the comparison data respectively and the obtained predicted text presenting directions;

and determining a second model loss based on the obtained prediction loss and the contrast loss of each model, and based on the model prediction loss weight and the contrast loss weight.

As a possible implementation manner, the target classification model further includes a target feature extraction network, and the model training unit 2105 is further configured to:

and performing iterative training on the pre-training recognition network based on the obtained second training data set to obtain the target feature extraction network.

As a possible implementation manner, before the image to be recognized including the text is input into the target classification model and corresponding language distribution information and the original text presentation direction are obtained, the image classification unit 2101 is further configured to:

acquiring an original image, and extracting at least one sub-image containing a text from the original image;

and respectively carrying out shape correction processing on each extracted sub-image based on a preset image shape, and taking any one of the sub-images obtained after the correction processing as the image to be recognized.

As a possible implementation, the model training unit 2105 is further configured to:

obtaining a third training data set, wherein the third training data set comprises marked samples and unmarked samples

Training a first text recognition model of the included image feature extraction network based on the labeling samples to obtain a second text recognition model, and constructing a picture-text distance recognition model based on the image feature extraction network included in the second text recognition model;

and performing iterative training on the second text recognition model based on the marked samples, the unmarked samples and the image-text distance recognition model to obtain the target text recognition model.

As a possible implementation manner, when performing iterative training on the second text recognition model based on each labeled sample, each unlabeled sample, and the image-text distance recognition model to obtain the target text recognition model, the model training unit 2105 is specifically configured to:

iteratively performing the following operations for the labeled samples and the unlabeled samples:

obtaining at least one labeled sample and at least one unlabeled sample, and respectively inputting the at least one unlabeled sample into the second text recognition model to obtain a predicted text recognition result corresponding to each unlabeled sample;

inputting the at least one unmarked sample and a corresponding predicted text recognition result into the image-text distance recognition model to obtain image-text distances corresponding to the at least one unmarked sample, and determining each model sub-loss corresponding to the at least one unmarked sample based on the obtained image-text distances;

inputting the at least one labeled sample into the second text recognition model respectively to obtain a model sub-loss corresponding to the at least one labeled sample;

and determining a third model loss based on the obtained model sub-losses, and adjusting the model parameters of the second text recognition model based on the third model loss.

As a possible implementation manner, when determining that each model corresponding to the at least one unlabeled sample is lost based on the obtained image text distance, the model training unit 2105 is specifically configured to:

screening out unlabeled samples with the image text distance not greater than a preset distance threshold from the at least one unlabeled sample based on the obtained image text distance, and taking the screened unlabeled samples as samples to be enhanced;

obtaining target enhancement samples corresponding to the samples to be enhanced respectively, and taking the prediction text recognition results corresponding to the samples to be enhanced as sample labels of the corresponding target enhancement samples;

and respectively inputting the obtained target enhancement samples into the second text recognition model to obtain model sub-losses corresponding to the target enhancement samples, and taking the model sub-losses corresponding to the target enhancement samples as the model sub-losses corresponding to the at least one unmarked sample.

As a possible implementation manner, the model training unit 2105 is further configured to perform at least one of the following operations:

acquiring each text corpus, each font format and each background image, and synthesizing each annotation sample based on each text corpus, each font format and each background image;

and acquiring each text corpus, and respectively inputting each text corpus into a target data style migration model to obtain each labeling sample.

For convenience of description, the above parts are separately described as modules (or units) according to functional division. Of course, the functionality of the various modules (or units) may be implemented in the same one or more pieces of software or hardware when implementing the present application.

With regard to the apparatus in the above-described embodiment, the specific manner in which each unit executes the request has been described in detail in the embodiment related to the method, and will not be elaborated here.

As will be appreciated by one skilled in the art, aspects of the present application may be embodied as a system, method or program product. Accordingly, various aspects of the present application may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

Based on the same inventive concept, the embodiment of the application also provides the electronic equipment. In one embodiment, the electronic device may be a server or a terminal device. Referring to fig. 22, which is a schematic structural diagram of a possible electronic device provided in an embodiment of the present application, in fig. 22, an electronic device 2200 includes: a processor 2210, and a memory 2220.

The memory 2220 stores a computer program executable by the processor 2210, and the processor 2210 can execute the steps of the text recognition method by executing the instructions stored in the memory 2220.

Memory 2220 may be a volatile memory (volatile memory), such as a random-access memory (RAM); the Memory 2220 may also be a non-volatile Memory (non-volatile Memory), such as a Read-Only Memory (ROM), a flash Memory (flash Memory), a Hard Disk Drive (HDD) or a solid-state drive (SSD); or memory 2220 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. Memory 2220 may also be a combination of the above.

Processor 2210 may include one or more Central Processing Units (CPUs), or be a digital processing unit or the like. Processor 2210, when executing the computer program stored in memory 2220, implements the text recognition method described above.

In some embodiments, processor 2210 and memory 2220 may be implemented on the same chip, or in some embodiments, they may be implemented separately on separate chips.

The specific connection medium between the processor 2210 and the memory 2220 is not limited in the embodiments of the present application. In the embodiment of the present application, the processor 2210 and the memory 2220 are connected by a bus, which is depicted by a thick line in fig. 22, and the connection manner between other components is merely illustrative and not limited thereto. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of description, only one thick line is depicted in fig. 22, but only one bus or one type of bus is not depicted.

Based on the same inventive concept, embodiments of the present application provide a computer-readable storage medium including a computer program for causing an electronic device to perform the steps of the text recognition method described above when the computer program runs on the electronic device. In some possible embodiments, the various aspects of the text recognition method provided by the present application may also be implemented in the form of a program product including a computer program for causing an electronic device to perform the steps of the text recognition method described above when the program product is run on the electronic device, for example, the electronic device may perform the steps as shown in fig. 2.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable Disk, a hard Disk, a RAM, a ROM, an erasable programmable Read-Only Memory (EPROM or flash Memory), an optical fiber, a portable Compact Disk Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product of the embodiments of the present application may be a CD-ROM and include a computer program, and may be run on an electronic device. However, the program product of the present application is not so limited, and in this document, a readable storage medium may be any tangible medium that can contain, or store a computer program for use by or in connection with a command execution system, apparatus, or device.

A readable signal medium may include a propagated data signal with a readable computer program embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a computer program for use by or in connection with a command execution system, apparatus, or device.

It is to be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or operation from another entity or operation without necessarily requiring or implying any actual such relationship or order between such entities or operations.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method of text recognition, the method comprising:

based on the obtained image set of each text region, respectively adopting a target text recognition model associated with the corresponding language to obtain a text recognition result corresponding to the image to be recognized;

the target text recognition model is obtained by training through the following operations:

inputting the at least one unmarked sample and a corresponding predicted text recognition result into the image-text distance recognition model to obtain an image-text distance corresponding to the at least one unmarked sample;

acquiring target enhancement samples corresponding to the samples to be enhanced, and taking the prediction text recognition results corresponding to the samples to be enhanced as sample labels of the corresponding target enhancement samples;

respectively inputting the obtained target enhancement samples into the second text recognition model to obtain model sub-losses corresponding to the target enhancement samples, and taking the model sub-losses corresponding to the target enhancement samples as the model sub-losses corresponding to the at least one unmarked sample;

and determining a third model loss based on the obtained model sub-losses, and adjusting model parameters of the second text recognition model based on the third model loss.

2. The method according to claim 1, wherein the obtaining of the text recognition result corresponding to the image to be recognized by using the target text recognition model associated with the corresponding language based on the obtained image set of each text region comprises:

3. The method of claim 1, wherein the target classification model comprises a target language identification submodel, and the target language identification submodel is obtained by:

4. The method of claim 3, wherein said determining a first model loss based on said predicted language distribution information and said one training data corresponding to said actual language distribution information comprises:

5. The method according to any of claims 1-4, wherein a target direction identifier submodel is included in the target classification model, and the target direction identifier submodel is obtained by:

6. The method of claim 5, wherein determining a second model loss based on the resulting predicted text rendering directions comprises:

7. The method of claim 5, wherein the target classification model further comprises a target feature extraction network, the target feature extraction network being trained by:

and performing iterative training on the pre-training language identification model based on the acquired second training data set to obtain the target feature extraction network.

8. The method according to any one of claims 1-4, wherein before inputting the image to be recognized containing the text into the target classification model and obtaining the corresponding language distribution information and the original text presentation direction, the method further comprises:

9. The method of any one of claims 1 to 4, wherein each annotated sample is obtained using at least one of:

10. A text recognition apparatus, comprising:

the image identification unit is used for respectively adopting target text identification models associated with corresponding languages based on the obtained image sets of the text regions to obtain text identification results corresponding to the images to be identified;

the model training unit is used for acquiring a third training data set, and the third training data set comprises marked samples and unmarked samples; training a first text recognition model of the included image feature extraction network based on the labeling samples to obtain a second text recognition model, and constructing a picture-text distance recognition model based on the image feature extraction network included in the second text recognition model; and iteratively performing the following operations for the labeled samples and the unlabeled samples:

screening out unlabeled samples with the image text distance not greater than a preset distance threshold from the at least one unlabeled sample based on the obtained image text distance, and taking the screened unlabeled samples as samples to be enhanced; acquiring target enhancement samples corresponding to the samples to be enhanced, and taking the prediction text recognition results corresponding to the samples to be enhanced as sample labels of the corresponding target enhancement samples; respectively inputting the obtained target enhancement samples into the second text recognition model to obtain model sub-losses corresponding to the target enhancement samples, and taking the model sub-losses corresponding to the target enhancement samples as the model sub-losses corresponding to the at least one unmarked sample;

11. An electronic device, comprising a processor and a memory, wherein the memory stores a computer program which, when executed by the processor, causes the processor to carry out the steps of the method of any of claims 1 to 9.

12. A computer-readable storage medium, characterized in that it comprises a computer program for causing an electronic device to carry out the steps of the method according to any one of claims 1 to 9, when said computer program is run on said electronic device.