CN113705313A

CN113705313A - Text recognition method, device, equipment and medium

Info

Publication number: CN113705313A
Application number: CN202110374315.2A
Authority: CN
Inventors: 张慧; 黄珊
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-04-07
Filing date: 2021-04-07
Publication date: 2021-11-26

Abstract

The application provides a text recognition method, a text recognition device, text recognition equipment and a text recognition medium, which relate to the technical field of artificial intelligence and are used for improving the efficiency of text recognition.

Description

Text recognition method, device, equipment and medium

Technical Field

The application relates to the technical field of computers, in particular to the technical field of artificial intelligence, and provides a text recognition method, a text recognition device, text recognition equipment and a text recognition medium.

Background

Scene text recognition refers to recognizing text in an image with a corresponding background. Due to the reasons that the background of the image is complex and various, the text form in the image is various and the like, the difficulty of scene text recognition is high.

At present, the scene text recognition mode is as follows: and extracting image characteristics of a text area in the image, performing depth time sequence coding on the image characteristics, and then decoding to obtain a text recognition result. However, in this method, when depth time-series coding is performed on image features, each part of the image features needs to be coded one by one, which results in low efficiency of text recognition.

Disclosure of Invention

The embodiment of the application provides a text recognition method, a text recognition device, text recognition equipment and a text recognition medium, which are used for improving the efficiency of text recognition.

In one aspect, a text recognition method is provided, including:

acquiring an image to be identified;

extracting a text area to be recognized from the image to be recognized;

adopting a trained target text recognition model to obtain target visual features corresponding to the text region to be recognized, and performing decoding operation on the target visual features to obtain a text recognition result of the text region to be recognized;

the trained target text recognition model is obtained by performing joint training on a text recognition model to be trained and an attention coding and decoding model, wherein the input of the attention coding and decoding model to be trained is the output of a convolutional network in the text recognition model to be trained.

An embodiment of the present application provides a text recognition apparatus, including:

the image acquisition module is used for acquiring an image to be identified;

the region extraction module is used for extracting a text region to be identified from the image to be identified;

the text recognition module is used for acquiring a target visual feature corresponding to the text region to be recognized by adopting a trained target text recognition model, and decoding the target visual feature to acquire a text recognition result of the text region to be recognized;

In a possible embodiment, the apparatus further includes a model training module, wherein the model training module is configured to train to obtain the target text recognition model by:

extracting corresponding sample text regions from each sample image in the sample image set respectively to obtain a sample text region set;

performing multiple rounds of iterative training on the text recognition model to be trained based on the sample text region set until a model convergence condition is met, wherein the text recognition model further comprises a decoding network, and each round of iterative training comprises the following operations:

inputting each sample text region selected from the sample text region set into the convolutional network, and respectively extracting the respective sample visual feature of each sample text region;

respectively inputting the obtained visual features of each sample into the decoding network and the attention coding and decoding model, and respectively obtaining a first predicted text label distribution and a second predicted text label distribution which respectively correspond to each text region of each sample;

determining a first training loss of the text recognition model to be trained based on the obtained first predicted text label distribution, and determining a second training loss of the attention coding and decoding model based on the obtained second predicted text label distribution;

determining a joint training loss based on the first training loss and the second training loss, and performing parameter adjustment on the text recognition model based on the joint training loss.

In a possible embodiment, the model training module is specifically configured to:

based on the decoding network, respectively performing the following operations on each column element contained in each sample visual feature:

performing linear operation on one column of elements in each column of elements to obtain probability distribution of the column of elements on a preset label category, wherein the preset label category comprises a plurality of characters and invalid output symbols;

determining a predictive label to which the list of elements belongs based on the probability distribution;

and obtaining a first predictive text label distribution corresponding to each sample visual feature based on the predictive labels corresponding to each column of elements contained in each sample visual feature.

based on the attention coding and decoding model, respectively executing the following operations for the sample visual features:

performing bidirectional coding on one sample visual feature in each sample visual feature to obtain a sample text semantic feature corresponding to a sample text region; and the number of the first and second groups,

and decoding the semantic features of the sample text by using an attention mechanism to obtain a second predicted text label distribution corresponding to the sample text region.

In a possible embodiment, the model training module is further configured to:

before inputting each sample text region selected from the sample text region set into the convolutional network and respectively extracting the respective sample visual feature of each sample text region, selecting each sample text region from the sample text region set;

and respectively executing the following operations aiming at the selected sample text regions:

scaling one sample text region in each sample text region in an equal proportion to obtain a scaled sample text region, wherein the first size of the scaled sample text region in the first direction is a first preset size;

and filling the scaled sample text region along a second direction until the scaled sample text region is determined, wherein a second size in the second direction reaches a second preset size, and the first direction and the second direction are perpendicular to each other.

for each first predictive text label distribution, respectively performing the following operations:

determining a forward probability and a backward probability corresponding to each predicted text label in a first predicted text label distribution by using dynamic programming, wherein the first predicted text label distribution is one of the first predicted text label distributions;

obtaining an initial training loss corresponding to the distribution of the first predicted text label based on the forward probability and the backward probability corresponding to each predicted text label;

and determining a first training loss of the text recognition model to be trained based on the obtained initial training losses.

for each predicted text label in the one first distribution of predicted text labels, performing the following operations:

determining the probability sum of each prefix sub-path passing through a predicted text label in the first predicted text label distribution at a preset time in each candidate path to obtain the forward probability corresponding to the predicted text label;

and determining the sum of the probabilities of the suffix sub-paths of the predicted text label passing through the predicted text label at the preset time in each candidate path to obtain the backward probability corresponding to the predicted text label.

based on a first weight, performing weighted summation on the first training loss and the second training loss to obtain a joint training loss;

when the parameters of the text recognition model are adjusted based on the joint training loss, the method further comprises:

and adjusting the value of the first weight.

In a possible embodiment, the text recognition module is specifically configured to:

decoding the target visual features to obtain target text label distribution corresponding to the text area to be identified;

and fusing the weight of the pre-trained language model and the target text label distribution to obtain the text recognition result.

and fusing the weight of the pre-trained language model and the target text label distribution based on a finite weight state conversion machine to obtain a text recognition result.

An embodiment of the present application provides a computer device, including:

at least one processor, and

a memory communicatively coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor, the at least one processor implementing any of the text recognition methods as previously discussed by executing the instructions stored by the memory.

Embodiments of the present application provide a computer storage medium having stored thereon computer instructions that, when executed on a computer, cause the computer to perform any of the text recognition methods as discussed above.

Due to the adoption of the technical scheme, the embodiment of the application has at least the following technical effects:

in the embodiment of the application, when the text in the image is recognized, the visual features of the text area in the image are extracted and decoded, so that the text in the image can be recognized, the depth time sequence coding of the visual features is not needed, the efficiency of text recognition can be relatively improved, and the resource consumption required by text recognition is reduced. In addition, the text recognition model for text recognition is obtained by performing joint training based on the attention coding and decoding model, and the input of the attention coding and decoding model is the output of the convolution network based on the text recognition model, so that the attention coding and decoding model can supervise the visual features extracted by the text recognition model, the accuracy of the visual features extracted by the text recognition model is improved, and the accuracy of text recognition is ensured.

Drawings

Fig. 1A is a schematic view of an application scenario of a text recognition method according to an embodiment of the present application;

fig. 1B is a schematic view of an application scenario of a text recognition method according to an embodiment of the present application;

fig. 2 is a flowchart of a text recognition method according to an embodiment of the present application;

FIG. 3 is a diagram illustrating an example of a text recognition process provided by an embodiment of the present application;

fig. 4 is a schematic structural diagram of a text recognition model according to an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating a principle of decoding a network output in a text recognition model according to an embodiment of the present application;

FIG. 6 is an exemplary diagram of a WFST representation of a language model provided by an embodiment of the present application;

FIG. 7 is a flowchart of a method for training a text recognition model according to an embodiment of the present application;

FIG. 8A is a schematic diagram illustrating a training text recognition model according to an embodiment of the present disclosure;

FIG. 8B is an exemplary diagram of multiple paths that may be used to generate a first predictive text label distribution according to an embodiment of the present disclosure;

FIG. 9 is a flowchart of a method for training a text recognition model according to an embodiment of the present disclosure;

fig. 10 is a diagram of an interaction process between a terminal and a server according to an embodiment of the present application;

fig. 11 is a diagram illustrating a process of a text recognition method according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a text recognition apparatus according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to better understand the technical solutions provided by the embodiments of the present application, the following detailed description is made with reference to the drawings and specific embodiments.

To facilitate better understanding of the technical solutions of the present application for those skilled in the art, the following terms related to the present application are introduced.

1. Image: an image in the present application refers to an image having a text region, that is, a text exists in the image, unless otherwise specified. For convenience of description, an image to be subjected to text recognition is referred to as an image to be recognized, and an image used for training a text recognition model is referred to as a sample image.

2. Visual characteristics: also referred to as visual feature coding, is used to represent the features of a text region in an image, such as the shape, contour, texture, etc. of characters in the text region, which are presented on a visual layer. The visual features are obtained by performing a preset process on the text regions in the image, such as one or both of convolution pooling and the like. For convenience of description, the visual features corresponding to the image to be processed are referred to as target visual features, and the visual features corresponding to the sample image are referred to as sample visual features.

3. Continuous time series Classification (CTC) model: the CTC model can be used for scene text recognition, and sequentially comprises a convolutional network, a depth coding network and a decoding network, wherein the depth coding network comprises a Long Short-Term Memory (LSTM) network.

4. Attention coding and decoding model: the method can be used for scene text recognition, including convolutional networks, depth coding networks and decoding networks based on attention mechanism, and because the attention mechanism is introduced into the decoding networks, the correlation between the decoding results and the output of the depth coding networks can be improved, so that the accuracy of text recognition can be improved.

5. Weighted Finite-State transmitters (WFST): the method belongs to one of Finite Automaton (FA), the FA consists of five elements (A, Q, E, I and F), wherein Q is a state node set and represents each node which is connected respectively, A is a label set and is a symbol on a connecting line between each node, E is a transfer function set, edges between two state nodes and labels and weights on the edges form a transfer (transition), and a search starting point can be searched through each transfer in the FA, so that a search end point is output, and a decoding process is realized. WFST is a finite state acceptor that includes a plurality of transitions, each transition having an input symbol, an output symbol, and a weight. In decoding, the tags, lexicon, and language model can be represented as separate WFSTs, which are effectively fused into a single search graph using a highly optimized FST library, such as OpenFST, to implement the tag decoding process.

6. Predicted text label distribution and true text label distribution: the predicted text label distribution refers to predicted characters corresponding to each position in the model text region, and the real text label distribution refers to real characters corresponding to each position in the text region, for example, the real text label distribution corresponding to the text region is "happy", and the predicted text label distribution is "happy". For convenience of description, the distribution of the predicted text labels output by the text recognition model is referred to as a first predicted text label distribution, and the distribution of the predicted text labels output by the attention coding and decoding model is referred to as a second predicted text label distribution.

7. Cloud technology (Cloud technology): based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied in the cloud computing business model, a resource pool can be formed and used as required, and the cloud computing business model is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.

8. Cloud computing (cloud computing): the method is a computing mode, and distributes computing tasks on a resource pool formed by a large number of computers, so that various application systems can acquire computing power, storage space and information services according to needs. The network that provides the resources is referred to as the "cloud". Resources in the "cloud" appear to the user as being infinitely expandable and available at any time, available on demand, expandable at any time, and paid for on-demand.

9. Artificial Intelligence (AI): the method is a theory, method, technology and application system for simulating, extending and expanding human intelligence by using a digital computer or a machine controlled by the digital computer, sensing the environment, acquiring knowledge and obtaining the best result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

10. Computer Vision technology (Computer Vision, CV): computer vision is a science for researching how to make a machine "see", and further, it means that a camera and a computer are used to replace human eyes to perform machine vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

11. Machine Learning (ML): the method is a multi-field cross discipline and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

12. n-gram: refers to n words occurring in succession in the text. The n-gram based language model can express the probability of the sentence occurring by combining the occurrence probability of each word in the sentence.

In the current text recognition mode, depth time sequence coding needs to be performed on image features, so that the efficiency of text recognition is relatively low. In view of this, embodiments of the present application provide a text recognition method, a text recognition device, a text recognition apparatus, and a text recognition medium, and a design concept of the text recognition method in the embodiments of the present application is introduced below.

In the embodiment of the application, the visual characteristics of the image are extracted through the text recognition model, the visual characteristics are decoded, and therefore a text recognition result is obtained, and depth time sequence coding is not needed to be carried out on the visual characteristics, so that the text recognition process is simplified, the text recognition efficiency is improved, and in the process of training the text recognition model, the text recognition model is supervised through the attention coding and decoding model for training, so that the text recognition model can learn more effective visual characteristics, the recognition accuracy of the text recognition model is improved, and the recognition accuracy of the text recognition model on difficult scene texts such as complex background texts is improved. Moreover, since the depth coding of the visual features is not required, the hardware resource consumption can be relatively reduced, for example, the video memory consumption is reduced.

Based on the above design concept, an application scenario of the text recognition method according to the embodiment of the present application is described below.

Referring to fig. 1A, a first scenario of an application of the text recognition method according to the embodiment of the present application is shown, where the first scenario includes: a terminal 110 and a gadget 111 running in the terminal 110.

The gadget 111 is, for example, a code segment provided in the terminal 110, and the gadget 111 can independently operate directly depending on hardware resources of the terminal 110. For example, the terminal 110 obtains the image to be recognized 100 according to an input operation performed by the user in the widget 111. The terminal 110 recognizes the text in the image 100 to be recognized to obtain a text recognition result, and presents the text recognition result in the widget 111. The specific process of recognizing text will be discussed below.

Referring to fig. 1B, a schematic view of an application scenario of a text recognition method according to an embodiment of the present application is shown, where the scenario includes: a terminal 110, a server 120, and an application 112 running in the terminal 110.

Unlike the gadget 111 in fig. 1A, the application 112 in fig. 1B needs to run by depending on a server, the server 120 can be understood as a server corresponding to the application 112, and the application 112 can be a web page, an application preinstalled in the terminal 110, an applet, or the like.

In the scenario shown in fig. 1B, after the terminal 110 obtains the image to be recognized through the application 112, the image to be recognized may be fed back to the server 120. After receiving the image to be recognized, the server 120 recognizes a text in the image to be recognized, and feeds back a text recognition result to the terminal 110.

It should be noted that the text recognition method in the embodiment of the present application may be applied to various specific application scenarios, for example, the text recognition method is applied to intelligent driving, for example, during navigation, an image may be input, and a terminal or a server automatically recognizes a text in the image to obtain a destination of the current driving. The text recognition method in the embodiment of the application can also be applied to virtual reality, augmented reality, intelligent furniture, intelligent office, intelligent wearing, intelligent transportation, smart cities, unmanned aerial vehicles, robots and other various application related scenes, and the specific use scene of the text recognition method is not limited in the application.

The terminal 110 is an electronic device used by a user, and the electronic device may be a computer device having a certain computing capability and running software and a website, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a game device, a smart television, or a smart wearable device. The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like. The server 120 may be implemented using cloud computing.

In one possible application scenario, the server 120 may deploy servers in different regions respectively in order to reduce communication delay between devices. Or for the load balancing server 120, different servers may serve the terminals respectively.

For example, a plurality of servers can share data through a blockchain, which is equivalent to forming a data sharing system. For example, a terminal is located at a site a and is in communication connection with one server, and another terminal is located at a site b and is in communication connection with a server other than the one server among the plurality of servers.

Each server in the data sharing system has a node identifier corresponding to the server, and each server in the data sharing system can store node identifiers of other servers in the data sharing system, so that the generated blocks can be broadcast to the other servers in the data sharing system according to the node identifiers of the other servers. Each server may maintain a node identifier list as shown in the following table, and store the server name and the node identifier in the node identifier list correspondingly. The node identifier may be an Internet Protocol (IP) address and any other information that can be used to identify the node, and only the IP address is used as an example in table 1.

TABLE 1

Server name	Node identification
		Node
1	119.115.151.173
		Node 2	118.116.189.145
…	…
		Node N	119.124.789.258

The content browsing guidance method provided by the exemplary embodiment of the present application is described below with reference to the accompanying drawings in conjunction with the application scenarios described above, and it should be noted that the application scenarios described above are only shown for the convenience of understanding the spirit and principles of the present application, and the embodiments of the present application are not limited in this respect.

The text recognition method in the embodiment of the present application is described below by taking the example in which the terminal in fig. 1A executes the text recognition method. Please refer to fig. 2, which is a flowchart of a text recognition method in an embodiment of the present application, where the flowchart includes:

and S21, acquiring the image to be recognized.

And S22, extracting the text area to be recognized from the image to be recognized.

The image to be recognized has a text, the terminal can segment the text region to be recognized in the image to be recognized after obtaining the image to be recognized, and the terminal can detect the text region to be recognized in the image to be recognized by using target detection. The image to be recognized may include one or more text regions, and each text region may be regarded as a text region to be recognized, respectively.

In order to improve the accuracy of the subsequent text detection, as an embodiment, the terminal may also determine each text line in the text region to be recognized, and the terminal may use one text line as the text region to be recognized to recognize the text recognition result corresponding to the text line.

For example, referring to fig. 3, in an example of a text recognition process provided in this embodiment of the application, the terminal may obtain an image 310 to be recognized as shown in (1) in fig. 3 according to an input operation of a user, and the terminal segments a text region in the image 310 to be recognized, so as to obtain a text region 320 to be recognized as shown in (2) in fig. 3.

And S23, acquiring target visual characteristics corresponding to the text region to be recognized by adopting the trained target text recognition model, wherein the target text recognition model is acquired by performing joint training on the text recognition model to be trained and the attention coding and decoding model, and the input of the attention coding and decoding model to be trained is the output of a convolutional network in the text recognition model to be trained.

The terminal can input the text region to be recognized into the convolutional network in the trained target text recognition model, and the feature of the text region to be recognized is extracted through the convolutional network, so that the target visual feature of the text region to be recognized is obtained. The trained target text recognition model can be pre-trained by the terminal or obtained from the server. The target text recognition model is obtained by performing joint training on a text recognition model to be trained and an attention coding and decoding model. The text recognition model to be trained comprises a convolutional network, the output of the convolutional network is input into an attention coding and decoding model, and the attention coding and decoding model performs coding and decoding operations on the output of the convolutional network, so that the text recognition model to be trained is supervised and trained, and the convolutional network in the text recognition model to be trained can recognize deeper visual features.

And S24, decoding the target visual characteristics by adopting the target text recognition model to obtain a text recognition result of the text region to be recognized.

After the target visual features are extracted, the terminal may perform a decoding operation on the target visual features, so as to obtain a text recognition result corresponding to the text region to be recognized.

For example, with continued reference to the example shown in fig. 3, after performing feature extraction and decoding operations on the text region to be recognized 320 shown in (2) in fig. 3, the terminal obtains a text recognition result 330 of the image to be recognized as shown in (3) in fig. 3, specifically, "terrestrial vegetation" as shown in (3) in fig. 3.

In the embodiment of the application, the terminal can sequentially perform feature extraction and decoding operations on the text region to be recognized through the target text recognition model, so that the text recognition result of the text region to be recognized in the image can be recognized, the text region to be recognized does not need to be subjected to depth coding, the text recognition efficiency is improved, and the text recognition model is supervised and trained by means of the attention coding and decoding model, so that the convolutional network in the text recognition model can learn the depth features of the text region, and the accuracy of the text recognition of the target text recognition model is ensured.

The structure of the trained target text recognition model is the same as that of the text recognition model to be trained, and only the parameters corresponding to the structures are different, so that the text recognition model to be trained comprises a convolution network, and the corresponding target text recognition model also comprises the convolution network. In addition to the target text recognition model comprising a convolutional network, in one possible embodiment the target text recognition model may also comprise a decoding network.

Referring to fig. 4, a schematic structural diagram of a target text recognition model according to an embodiment of the present application is shown, where a target text recognition model 400 in the schematic structural diagram includes a convolutional network 410 and a decoding network 420, where the convolutional network 410 is configured to perform feature extraction on a text region to be recognized to obtain a target visual feature of the text region to be recognized, and decode the target visual feature through the decoding network 420 to obtain a target text recognition result.

When the feature extraction is performed on the text region to be recognized, the fact that the receptive fields corresponding to different regions in the text region to be recognized are actually extracted is equivalent to that the receptive fields corresponding to different regions in the text region to be recognized are actually reflected by the obtained target visual features. The convolutional network 410 may specifically perform a convolution operation, a pooling operation, and the like on the text region to be recognized, so as to extract the visual features in the text region to be recognized.

As an embodiment, the convolution network 410 may be implemented in a variety of convolution neural networks, for example, the convolution network 410 may adopt a ResNet-34 network, and a residual structure in the ResNet-34 network structure may extract deeper and more accurate visual features, which is beneficial to extracting visual features of a text region in a complex background, and may relatively improve the generalization performance of a text recognition model.

For example, the specific structure of the ResNet-34 network can be found in Table 2 below:

TABLE 2

The ResNet-34 network specifically includes 32 units, where each unit includes a first convolution layer Conv1_ x, a first convolution pooling layer Conv2_ x, a second convolution pooling layer Conv3_ x, a third convolution pooling layer Conv4_ x, and a second convolution layer Conv5_ x. The parameters of each layer structure respectively correspond to the size (kernel size) of the convolution kernel of the network layer, the step sizes (stride) in the first direction and the second direction, the padding size (padding) and the number of channels, for example, taking the first convolution sublayer in the first convolution layer Conv1_ x as an example, the parameters specifically include that one convolution kernel is 3 × 3, the step sizes in the first direction and the second direction are 1 × 1, the padding size is 1 × 1, the number of channels is 32, and so on.

Since the ResNet-34 network has a size requirement on an input image, before the text region to be recognized is input into the ResNet-34 network, the text region to be recognized may be preprocessed, and in this embodiment of the application, the terminal may scale the text region to be recognized equally until a first size in a first direction of the scaled text region to be recognized is a first preset size, where the first direction is, for example, a height direction of the text region to be recognized.

In the embodiment of the application, the text region to be recognized is scaled in an equal proportion, so that the scaled text region to be recognized can meet the input size requirement of a convolutional network, the image information distribution in the text region to be recognized is not changed, that is, the image information distribution of the text region to be recognized is reserved, and the text recognition result in the text region to be recognized can be recognized accurately in the follow-up process.

After the convolutional network obtains the target visual features, the target visual features may be input into a decoding network, and the decoding network may perform a decoding operation on the target visual features.

As an embodiment, the specific implementation form of the decoding network 420 in the text recognition model may be various, for example, the decoding network 420 in the text recognition model includes a linear layer, and the linear layer is used to perform linear transformation on the target visual features output by the convolutional network 410 to obtain a corresponding decoding result, for example, the decoding network 420 in the text recognition model may adopt the decoding network in the CTC model, and the structure of the CTC model may refer to the content discussed above, and is not described here again.

The target visual feature may be in the form of a matrix, each column of elements in the matrix actually corresponds to a character in the target text region or a receptive field corresponding to a background, and the background refers to a region other than the character in the target text region, so that the decoding network may perform a decoding operation on each column of elements in the target visual feature to obtain a decoding result corresponding to each column of elements. For example, if a column of elements corresponds to a character in the target text region, the result obtained by performing the decoding operation on the column of elements is the character, a column corresponds to the background in the target text region, the result obtained by performing the decoding operation on the column of elements is an invalid output character, and if two columns of elements correspond to a character in the target text region, the results obtained by decoding the two columns of elements may be the same character. Therefore, after the terminal performs the decoding operation on each column element, the obtained decoding result may be a character or an invalid output character, and the terminal may obtain the text recognition result according to the character or the invalid output character corresponding to each column. An invalid output indicates that the corresponding character does not exist for the column element, for example, the invalid output may be indicated by "-".

For example, the terminal sequentially combines the characters or invalid output characters corresponding to each column element to obtain a target text label distribution, and the terminal may obtain a text recognition result based on the target text label distribution, where the following describes an example of a manner of obtaining the text recognition result:

example one: and deleting the invalid output character in the target text label distribution by the terminal so as to obtain a text recognition result.

For example, referring to fig. 5, the target visual characteristics are [ a1, a2, a3, a4], the terminal respectively performs decoding operations on a1, a2, a3 and a4 through the decoding network, so as to correspondingly obtain target text label distributions as follows: s, a and a, deleting invalid exports in the target text label distribution, thereby obtaining a text recognition result, specifically 'saa'.

Example two: the terminal may delete an invalid outsider in the target text label distribution and delete one of two adjacent repeated characters to obtain a text recognition result.

Because two adjacent columns of elements correspond to the same character, the terminal can delete the invalid output character in the target text label distribution and delete one of two adjacent repeated characters, so that a text recognition result is obtained.

For example, continuing with the example of fig. 5, the terminal obtains the target text label distribution as: s, a, and a, the invalid exporter in the target text label distribution and one of the adjacent repeated characters are deleted, thereby obtaining a text recognition result, specifically "sa".

Example three: and fusing the weight of the pre-trained language model and the target text label distribution to obtain a text recognition result.

When the terminal obtains the text recognition result according to the target text label distribution, if one of two adjacent repeated characters in the target text label distribution is deleted, the obtained text recognition result may be inaccurate, for example, when the text itself has two adjacent repeated characters, one of two adjacent repeated characters in the target text label distribution is directly deleted, the obtained text recognition result is inaccurate, and therefore, in order to improve the accuracy of the text recognition result, in the embodiment of the present application, the terminal decodes the target text label distribution by means of the weight of the pre-trained language model, so that the text recognition result more conforming to the semantic features is obtained.

The pre-trained language model is obtained by training based on a large amount of linguistic data in advance, the language model in the embodiment of the application can adopt a n-gram-based language model, and the n-gram-based language model can represent the probability of a sentence. Generally, a sentence is segmented and then modeled, after the sentence is segmented, the probability of occurrence of a sentence can be expressed as the product of the probabilities of occurrence of each word in the current context, so as to construct an expression of a language model, wherein the expression of the language model is exemplified as follows:

P(W)＝P(w1，w2…wn)

where w1, w2 … wn are the probabilities of each word in the sentence appearing in the current context, respectively.

The terminal can input the target text label into the language model in a distributed mode to obtain a text recognition result.

Because the language model learns corresponding sentences in advance, the text recognition result which is more consistent with semantic features can be output according to the target text labels through the language model.

The terminal can also fuse the pre-trained language model and the target text label distribution according to the WFST, so as to obtain a text recognition result, and the method specifically comprises the following steps: s1.1, respectively converting pre-trained language models, dictionaries and target text label distribution into corresponding WFST representations; s1.2, fusing the obtained WSFT representations into a single search graph, and searching according to the search graph to obtain a text recognition result.

The dictionary contains spellings of a plurality of words. The manner in which the pre-trained language model is converted into WFST representation is, for example, by setting the inputs and outputs on each side of WFST as words and setting the weights as the probability of a word appearing on the current path, thereby obtaining WFST representation of the language model. Similarly, the dictionary and the target text label distribution can be converted into WFST representations, after the pre-trained language model, the dictionary and the target text label distribution are respectively converted into corresponding WFST representations, the terminal can combine the WFST representations, and perform two special WFST operations, namely, a determining operation and a minimizing operation, on the combined result to compress a search space, thereby accelerating the decoding speed to generate a search graph, and then perform a search based on the search graph, thereby obtaining a text recognition result.

For example, referring to fig. 6, which is an exemplary diagram of WFST representation of the language model, in the 0-state node, the input symbol is going, the output symbol is going probability is 1, and so on.

In the embodiment of the application, the character semantic information of the context can be effectively mined by means of a WFST fusion language model, so that the text recognition result is more consistent with semantic features, and the accuracy of the obtained text recognition result is improved.

The target text recognition model in the embodiment of the present application is obtained through training in advance, and an example of a training process of the target text recognition model is described below with reference to a method flowchart for training the text recognition model shown in fig. 7:

and S71, extracting corresponding sample text regions from each sample image in the sample image set respectively, and obtaining a sample text region set.

The terminal can obtain each sample image from a network resource or a server, each sample image has a text region, the terminal extracts the sample text region in each sample image by using target detection, so as to obtain a plurality of sample text regions, and for convenience of description, the obtained plurality of sample text regions are referred to as a sample text region set.

As an example, at least two different images in the background exist in each sample image. That is to say, each image with different backgrounds exists in the sample image, so that the training data has richer backgrounds, and the recognition capability of the text recognition model to the texts under different backgrounds is improved, so that the generalization capability of the text recognition model is improved.

S72, based on the sample text region set, performing multiple rounds of iterative training on the text recognition model to be trained until a model convergence condition is met, wherein the text recognition model further comprises a decoding network, and each round of iterative training comprises the following operations:

s721, inputting each sample text area selected from the sample text area set into a convolution network, and respectively extracting the respective sample visual characteristics of each sample text area;

s722, respectively inputting the obtained visual characteristics of each sample into a decoding network and an attention coding and decoding model, and respectively obtaining a first predicted text label distribution and a second predicted text label distribution which respectively correspond to each sample text region;

s723, determining a first training loss of the text recognition model to be trained based on the obtained first predicted text label distribution, and determining a second training loss of the attention coding and decoding model based on the obtained second predicted text label distribution;

and S724, determining the joint training loss based on the first training loss and the second training loss, and adjusting parameters of the text recognition model based on the joint training loss.

After obtaining the sample text region set, the terminal may perform multiple rounds of iterative training on the text recognition model based on the sample text region set. In each iteration training, batch training may be adopted, that is, each iteration training adopts at least two sample text regions to train the text recognition model to be trained.

Specifically, the text recognition model to be trained comprises a convolutional network and a decoding network, and in each iteration training, feature extraction is respectively carried out on each sample text region through the convolutional network in the text recognition model to be trained to obtain a sample visual feature corresponding to each sample text region; and respectively carrying out decoding operation on the visual features of each sample through a decoding network in the text recognition model to be trained, and respectively obtaining the first predicted text label distribution corresponding to the visual features of each sample. Meanwhile, the visual characteristics of each sample are input into the attention coding and decoding model, and coding operation and decoding operation are sequentially carried out on the visual characteristics of each sample through the attention coding and decoding model, so that second predicted text label distribution corresponding to text regions of each sample is obtained.

And the terminal determines a first training loss corresponding to the text recognition model according to the distribution of each first predicted text label, and determines a second training loss corresponding to the attention coding and decoding model according to the distribution of each second predicted text label. And the terminal calculates the joint training loss according to the first training loss and the second training loss, and adjusts the parameters of the text recognition model by using the joint training loss.

In the embodiment of the application, the attention coding and decoding model is utilized to carry out combined training on the text recognition model, so that the visual characteristics of the convolution network learning in the text recognition model with text recognition significance can be supervised, the output accuracy of the text recognition model is ensured, the structure of the text recognition model is simplified, and the efficiency of subsequent text recognition is improved.

When the text recognition model is trained by using each sample text region, as an embodiment, the terminal may perform preprocessing on each sample text region to obtain each sample text region meeting the size requirement of the text recognition model, because the text recognition model may have a requirement on the size of the input sample text region.

The terminal respectively scales each sample text region in an equal proportion until the first size of each sample text region in the first direction is a first preset size; and filling the scaled sample text regions along the second direction until the scaled sample text region is determined and the second size in the second direction reaches a second preset size.

Wherein the first direction and the second direction are perpendicular to each other, for example, the first direction is a height direction of the sample text region, and the second direction is a width direction of the sample text region.

In the embodiment of the application, the sizes of the sample text regions in each round of iterative training can be kept the same, the input size requirement of the text recognition model is met, the distribution of image information in the sample text regions is not changed, and the subsequent more accurate recognition of the text in the sample text regions is facilitated.

As an embodiment, a specific implementation example of obtaining the first predictive text label distribution in S722 is as follows:

s2.1, based on a decoding network, respectively executing the following operations on each column of elements contained in each sample visual feature: performing linear operation on one column of elements in each column of elements to obtain probability distribution of the one column of elements on a preset label category, wherein the preset label category comprises a plurality of characters and invalid output symbols, and determining a prediction label to which the one column of elements belongs based on the probability distribution;

and S2.2, obtaining the first predictive text label distribution corresponding to each sample visual feature based on the predictive label corresponding to each row of elements contained in each sample visual feature.

As discussed above, the form of the sample visual features is a matrix, and each column in the matrix corresponds to a receptive field corresponding to a character or a background corresponding to a sample text region, so that the terminal may perform a linear operation, such as a normalization operation, on each column of elements included in each sample visual feature through the decoding network, so as to obtain a probability distribution of the column of elements on a preset tag category, where the preset tag category includes a plurality of characters and invalid output symbols, and equivalently, obtain a probability that the column of elements belongs to each character and invalid output symbol, and may use the character or invalid output symbol corresponding to the highest probability as a prediction tag corresponding to the column of elements.

After the preset labels corresponding to the columns of elements included in one sample visual feature are obtained, the predicted text labels corresponding to the columns of elements included in one sample visual feature can be combined to obtain initial predicted text label distribution corresponding to the sample visual feature, the terminal can delete an invalid output character in the initial predicted text label distribution, and can delete one of two adjacent repeated characters in the initial predicted text label distribution, so that first predicted text label distribution corresponding to the sample visual feature is obtained, and in the same way, the first predicted text label distribution corresponding to each sample visual feature can be obtained.

For example, referring to fig. 8A, a schematic diagram of a principle of training a text recognition model according to an embodiment of the present application is provided, in which a convolutional network includes units as shown in C1-C5 in fig. 8A, and after each sub-region in a text region to be recognized passes through the convolutional network, f1, f2, f3, and f4 shown in fig. 8A are respectively output, and in fig. 8A, the distribution of real text labels in the text region to be recognized is taken as "STATE" as an example.

As an embodiment, a specific implementation manner of obtaining the second predictive text label distribution in S722 is as follows:

s3.1, respectively carrying out bidirectional coding on each sample visual feature based on the attention coding and decoding model to obtain sample text semantic features corresponding to each sample visual feature;

and S3.2, respectively decoding semantic features of each sample text by using an attention mechanism to obtain second predicted text label distribution corresponding to each sample text region.

The terminal respectively carries out bidirectional coding on each row of elements in each sample visual characteristic through a coding network in the attention coding and decoding model, when the sample visual characteristics are coded bidirectionally, the coding results corresponding to one row of elements are coded depending on the information of the front row and the rear row of the elements, and the relevance among the rows of elements can be deeply mined so as to improve the accuracy of the obtained text semantic characteristics. The bidirectional coding may be implemented in various ways, for example, a Bi-directional Long Short-Term Memory (Bi-LSTM) may be used as the coding network to implement bidirectional coding, and a Bi-directional transform network may also be used as the coding network to implement bidirectional coding.

After the terminal obtains the sample text semantic features corresponding to each column of elements, the terminal can perform decoding operation on each sample text semantic feature corresponding to each sample visual feature through a decoding network in an attention coding and decoding model based on an attention mechanism, so as to obtain a plurality of predicted text labels corresponding to the sample text region, and obtain second predicted text label distribution corresponding to the sample text region based on the plurality of predicted text labels. And performing decoding operation by using an attention mechanism, so that a decoding network can capture specific parts between each sample text semantic feature and each predicted text label, and the correlation between each output predicted text label and the sample text semantic feature is enhanced, thereby improving the distribution accuracy of the obtained second predicted text label. The decoding network in the attention codec model is, for example: a Recurrent Neural Network (RNN) decoding network based on the attention mechanism may be employed.

Continuing with the example shown in FIG. 8A, after the f1, f2, f3 and f4 pass through the coding network in attention codec (as shown in A1-A4 in FIG. 8A), the coded output (as shown in y1, y2, y3 and y4 in FIG. 8A) is obtained as shown in FIG. 8A, and the coded output passes through the decoding network in attention codec model, so as to obtain the posterior probability corresponding to the second predicted text label distribution, specifically, P (state) as shown in FIG. 8A.

It should be noted that the order of obtaining the first predictive text label distribution and the second predictive text label distribution may be arbitrary, and the present application is not limited thereto.

As an embodiment, a specific implementation manner example of obtaining the first training loss of the text recognition model in S723 is as follows:

s4.1, determining the forward probability and the backward prediction probability corresponding to each predicted text label contained in each first predicted text label distribution by utilizing dynamic programming;

and S4.2, obtaining an initial training loss corresponding to the distribution of the first predicted text label based on the forward probability and the backward probability corresponding to each predicted text label.

In order to avoid marking the position corresponding to each character, in the embodiment of the application, there may be no strict correspondence between each column in the sample visual feature and the predicted text label, so that a decoding network in the text recognition model may perform inverse mapping calculation on each predicted text label in a first predicted text label distribution by introducing a specific mapping rule, thereby obtaining a generation probability of each predicted text label in a mapping space, thereby implementing automatic alignment of the sample visual feature and the corresponding first predicted text label distribution.

Respectively performing linear operation on each column in the sample visual features f-f 1 and f2 … fT through a decoding network in a text recognition model, wherein T represents the column number of the sample visual features, and the probability distribution of each fT on a preset label type C is obtained, wherein the C comprises all Chinese characters and one representative invalid output symbol, and the invalid output symbol can be understood as an additional blank symbol; based on the probability distribution, obtaining an initial predicted text label distribution, and the initial text label distribution passes through a sequence mapping function B to obtain a first predicted text label distribution, wherein the sequence mapping function B comprises two operations: one of the two repeated characters is removed, and the invalid output character is removed.

Since the initial predicted text label distribution has invalid exporters and repeated characters, etc., there may be multiple paths for obtaining the first predicted text label distribution, and all the posterior probabilities mapped as the label sequence w are summed by the sequence mapping function B:

(w|c)＝∑_π:B(π)＝w p(π|c)

wherein the posterior probability of pi occurrence is defined as

Is at time stamp t corresponding to predicted tag pi_tThe probability of (n) denotes all paths through the sequence mapping function B into the first predictive text label distribution w, which paths can actually be understood as sequences consisting of characters and invalid output characters.

Referring to fig. 8B, for a plurality of paths that are possible to generate a first predicted text label distribution provided in the embodiment of the present application, in fig. 8B, the first predicted text label distribution is "state", and the sample visual characteristics respectively include 11 columns, which respectively correspond to outputs of the first column to the eleventh column, the initial predicted text label distribution obtained by the terminal according to each type is "-S-t-a-t-e-", the time series is 8, which respectively correspond to t1-t8 in fig. 8B, and the path that the terminal determines to generate the first predicted text label distribution is specifically the path shown by the arrow in fig. 8B, for example, "— St-ate" in fig. 8B is one of the paths that generate "state".

However, since the complexity of brute force calculation p (w | c) is high, in the embodiment of the present application, the complexity of calculating p (w | c) can be reduced by calculating posterior probability by using dynamic programming.

The terminal determines the forward probability and the backward probability corresponding to each predicted text label in the distribution of the first predicted text labels by using dynamic programming, and calculates the first training loss based on the forward probability and the backward probability corresponding to each predicted text label.

When calculating the forward probability corresponding to a predicted text label, the sum of the probabilities of each prefix sub-path passing through the predicted text label at a preset time in each candidate path can be calculated and determined, so as to obtain the forward probability corresponding to the predicted text label. Similarly, when calculating the backward probability corresponding to a predicted text label, determining the probability sum of each suffix sub-path passing through a predicted text label at a preset time in each candidate path to obtain the backward probability corresponding to the predicted text label. And obtaining an initial training loss corresponding to the distribution of the first predicted text labels based on the forward probability and the backward probability corresponding to each predicted text label. The candidate paths broadly refer to paths that generate the first predicted text label distribution possibilities.

For example, continuing with the example shown in fig. 8B, taking the example that the predicted text label is the fourth element "t" in "-S-t-a-t-e-" and the preset time is t3, the probability sum of all prefix sub-paths corresponding to the predicted text label may be represented as:

a₃(4)＝p(-st)+p(sst)+p(s-t)+p(stt)

wherein, a₃(4) Representing the probability sum corresponding to the predicted text label, wherein, "-st", "sst", "s-t" and "stt" are the prefix sub-paths corresponding to the predicted text label respectively, and p (-st) represents the probability corresponding to the "-st" prefix sub-path; p (sst) represents the probability that the prefix sub-path "sst" corresponds; p (s-t) represents the probability that the prefix sub-path "s-t" corresponds to; p (stt) represents the probability that the prefix sub-path "stt" corresponds.

After calculating the posterior probability, and thus obtaining the initial training loss corresponding to the first predicted text label distribution according to the posterior probability, an example of a calculation formula is:

loss_ctc＝-ln(p(w|c)

by analogy, after the initial training loss corresponding to each first predicted text label is obtained, the initial training loss corresponding to each first predicted text label may be weighted, so as to obtain the first training loss of the text recognition model.

Continuing with the example shown in FIG. 8A, after the f1, f2, f3, and f4 pass through the decoding network in the text recognition model, the posterior probability P (state) shown in FIG. 8A is obtained.

As an embodiment, a specific implementation manner example of obtaining the second training loss of the text recognition model in S723 is as follows:

and calculating a second training loss according to the second preset text label distribution and the real text label distribution, wherein the second training loss can be represented by a cross entropy loss function. The real text label distribution can be understood as the real text result in the sample text region.

In S734, the terminal may perform a weighted summation on the first training loss and the second training loss to obtain a joint training loss, where an equation for calculating the joint training loss is, for example:

loss＝α*loss_ctc+(1-α)*loss_attn

where loss represents the joint training loss, α represents the first weight, loss_ctcIs shown asLoss of training_attnRepresenting a second training loss.

It should be noted that the order of obtaining the first training loss and the second training loss may be arbitrary, and the present application is not limited thereto.

In the embodiment of the application, the current position and semantic features between positions before the current position can be fully learned through an attention mechanism in an attention coding and decoding model, the posterior probability can be calculated through a text recognition model, feature information of the positions after the current position is fully utilized, and meanwhile, the text recognition model is incorporated into the attention coding model for joint training, so that the convergence speed of a network can be accelerated, and the recognition capability of the text recognition model can be improved.

After obtaining the joint training loss, the terminal may adjust parameters of the text recognition model according to the joint training loss until the text recognition model satisfies a model convergence condition, such as that the joint training loss reaches a preset value, or that the training times reach preset times, to obtain the trained target text recognition model. For example, the terminal may adjust parameters of the text recognition model using an AdaDelta optimizer.

As an example, the AdaDelta optimizer may have a decay rate set to 0.95, the number of sample text regions used per pass may be set to 128, and the gradient pruning amplitude may be set to 5.

As an embodiment, in order to improve the training effect of the text recognition model, the attention coding and decoding model may be a model trained in advance, or parameters of the attention coding and decoding model may be adjusted at the same time when parameters of the text recognition model are adjusted.

In order to improve the training effect of the text recognition model, when the parameters of the text recognition model are adjusted, the value of the first weight in the text can be adjusted, so that the proportion of the first training loss and the second training loss in the joint training loss is dynamically adjusted, and the training effect of the text recognition model is improved.

In the following, an example of a training process of a text recognition model in the embodiment of the present application is described with reference to a flowchart of a method for training a text recognition model shown in fig. 9, where the flowchart of the method includes:

and S91, extracting a sample text area of the sample image.

The manner of obtaining the sample text region can refer to the content discussed above, and is not described herein.

And S92, extracting the sample visual characteristics of the sample text area.

And extracting the sample visual characteristics of the sample text region through the convolution network in the text recognition model.

S93, outputting a first predictive label distribution based on the decoding network in the text recognition model.

And S94, outputting a second prediction label distribution based on the decoding network in the attention coding and decoding model.

Here, the order of steps of S93 and S94 may be arbitrary.

And S95, calculating the joint training loss.

The process of calculating the joint training loss can refer to the content discussed above, and is not described here.

And S96, adjusting the parameters of the text recognition model until the text recognition model meets the model convergence condition.

When the text recognition model is trained, the attention coding and decoding model is combined for joint training, so that the attention coding and decoding model can supervise the convolutional network learning in the text recognition model to have the characteristic of text recognition significance, and relatively weaken the characteristics with smaller significance to the text recognition, such as fonts, colors, character sizes, backgrounds and the like, thereby improving the accuracy of the text recognition model in recognizing the text.

Based on the application scenario shown in fig. 1B, the following describes, by way of example, an interaction process diagram between the terminal and the server shown in fig. 10, a text recognition method in the embodiment of the present application is described:

s101, the terminal acquires an image to be identified.

And S102, the terminal sends the image to be identified to a server.

S103, the server extracts a text area to be recognized from the image to be recognized.

The text region to be recognized can be obtained by referring to the content discussed above, and the details are not repeated here.

And S104, acquiring target visual features corresponding to the text area to be recognized by adopting a target text recognition model.

The training process of the target text recognition model, the meaning of the target visual features, and the manner of obtaining the target visual features may refer to the contents discussed above, and are not described herein again.

And S105, decoding the target visual characteristics by adopting the target text recognition model to obtain a text recognition result of the text region to be recognized.

The specific process of the decoding operation can refer to the content discussed above, and is not described herein again.

And S106, the server sends the text recognition result to the terminal.

In the embodiment of the application, the terminal and the server cooperatively execute the text recognition method, so that the processing amount of the terminal can be relatively reduced, and the text recognition efficiency can be improved because complicated coding of the target visual characteristics is not required.

For example, referring to fig. 11, in an exemplary view of an interface change of a text recognition method provided in an embodiment of the present application, a terminal presents an image input control 1101 in an input box of a navigation address, when a user can remotely operate the image input control 1101, a corresponding image to be recognized can be selected and input, after obtaining the image to be recognized, the terminal can send the image to be recognized to a server, the server performs text recognition on the image to be recognized, obtains a text recognition result, determines a destination according to the text recognition result, and sends the destination to the terminal, so that the terminal can display the destination 1102 and generate a navigation route 1103 reaching the destination from a current place.

Based on the same inventive concept, an embodiment of the present application provides a text recognition apparatus, which can implement the functions of the foregoing terminal or server, with reference to fig. 12, and the apparatus includes:

an image obtaining module 1201, configured to obtain an image to be identified;

a region extraction module 1202, configured to extract a text region to be recognized from an image to be recognized;

a text recognition module 1203, configured to obtain a target visual feature corresponding to the text region to be recognized by using the trained target text recognition model, and perform a decoding operation on the target visual feature to obtain a text recognition result of the text region to be recognized;

the trained target text recognition model is obtained by performing joint training on a text recognition model to be trained and an attention coding and decoding model, wherein the input of the attention coding and decoding model to be trained is the output of a convolution network in the text recognition model to be trained.

In one possible embodiment, the apparatus further includes a model training module 1204, where the model training module 1204 is configured to train to obtain the target text recognition model by:

based on the sample text region set, performing multiple rounds of iterative training on a text recognition model to be trained until a model convergence condition is met, wherein the text recognition model further comprises a decoding network, and each round of iterative training comprises the following operations:

inputting each sample text area selected from the sample text area set into a convolution network, and respectively extracting the respective sample visual characteristics of each sample text area;

respectively inputting the obtained visual characteristics of each sample into a decoding network and an attention coding and decoding model, and respectively obtaining a first predicted text label distribution and a second predicted text label distribution which respectively correspond to each sample text region;

determining a first training loss of a text recognition model to be trained based on the obtained first predicted text label distribution, and determining a second training loss of the attention coding and decoding model based on the obtained second predicted text label distribution;

and determining a joint training loss based on the first training loss and the second training loss, and adjusting parameters of the text recognition model based on the joint training loss.

In one possible embodiment, model training module 1204 is specifically configured to:

based on the decoding network, respectively executing the following operations on each column element contained in each sample visual feature:

performing linear operation on one column of elements in each column of elements to obtain probability distribution of the one column of elements on a preset label category, wherein the preset label category comprises a plurality of characters and invalid output symbols;

determining a prediction label to which a list of elements belongs based on the probability distribution;

and obtaining the first predicted text label distribution corresponding to each sample visual feature based on the predicted labels corresponding to each column of elements contained in each sample visual feature.

based on the attention coding and decoding model, aiming at each sample visual feature, the following operations are respectively executed:

and decoding the semantic features of the sample text by using an attention mechanism to obtain a second predicted text label distribution corresponding to a sample text region.

In one possible embodiment, model training module 1204 is further configured to:

inputting each sample text area selected from the sample text area set into a convolutional network, and selecting each sample text area from the sample text area set before respectively extracting the sample visual characteristics of each sample text area;

aiming at each selected sample text region, the following operations are respectively executed:

scaling a sample text region in each sample text region in an equal proportion to obtain a scaled sample text region, wherein the first size of the scaled sample text region in the first direction is a first preset size;

and filling the scaled one sample text region along a second direction until the scaled one sample text region is determined, wherein a second size in the second direction reaches a second preset size, and the first direction and the second direction are perpendicular to each other.

determining forward probability and backward probability corresponding to each predicted text label in a first predicted text label distribution by utilizing dynamic programming, wherein one first predicted text label distribution is one of the first predicted text label distributions;

obtaining an initial training loss corresponding to the distribution of a first predicted text label based on the forward probability and the backward probability corresponding to each predicted text label;

based on the obtained initial training losses, a first training loss of the text recognition model to be trained is determined.

for each predicted text label in a first distribution of predicted text labels, performing the following operations:

determining the probability sum of each prefix sub-path of a predicted text label in the distribution of the first predicted text labels at a preset moment in each candidate path to obtain the forward probability corresponding to the predicted text label;

and determining the probability sum of each suffix sub-path passing through a predicted text label at a preset time in each candidate path to obtain the backward probability corresponding to the predicted text label.

based on the first weight, carrying out weighted summation on the first training loss and the second training loss to obtain a joint training loss;

when the parameters of the text recognition model are adjusted based on the joint training loss, the method further comprises the following steps:

and adjusting the value of the first weight.

In a possible embodiment, the text recognition module 1203 is specifically configured to:

decoding the target visual characteristics to obtain target text label distribution corresponding to the text area to be identified;

and fusing the weight of the pre-trained language model and the target text label distribution to obtain a text recognition result.

As an example, model training module 1204 of FIG. 12 is an optional module.

It should be noted that the apparatus shown in fig. 12 may also implement any of the text recognition methods discussed above, and will not be described herein again.

Based on the same inventive concept, the present application provides a computer device, which can implement the functions of the foregoing terminal or server, please refer to fig. 13, and the computer device includes a processor 1301 and a memory 1302.

The processor 1301 may be a Central Processing Unit (CPU), a digital processing unit, or the like. The specific connection medium between the memory 1302 and the processor 1301 is not limited in the embodiments of the present application. In the embodiment of the present application, the memory 1302 and the processor 1301 are connected through a bus 1303 in fig. 13, the bus 1303 is shown by a thick line in fig. 13, and the connection manner between other components is merely an illustrative description and is not limited thereto. The bus 1303 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 13, but this is not intended to represent only one bus or type of bus.

The memory 1302 may be a volatile memory (volatile memory), such as a random-access memory (RAM); the memory 1302 may also be a non-volatile memory (non-volatile memory) such as, but not limited to, a read-only memory (rom), a flash memory (flash memory), a Hard Disk Drive (HDD) or a solid-state drive (SSD), or the memory 1302 may be any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Memory 1302 may be a combination of the above.

The processor 1301, which is used to execute the text recognition method as discussed above when calling the computer program stored in the memory 1302, may also be used to implement the functionality of the apparatus as shown in fig. 12 above.

Based on the same inventive concept, embodiments of the present application provide a computer storage medium storing computer instructions that, when executed on a computer, cause the computer to perform any of the text recognition methods discussed above.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Based on the same inventive concept, the embodiments of the present application provide a computer program product, which includes computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform any of the text recognition methods described above.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media capable of storing program codes, such as a removable Memory device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, and an optical disk.

Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A text recognition method, comprising:

acquiring an image to be identified;

extracting a text area to be recognized from the image to be recognized;

2. The method of claim 1, wherein the target text recognition model is trained by:

3. The method of claim 2, wherein said inputting the obtained sample visual features into the decoding network to obtain a first predicted text label distribution corresponding to each of the sample text regions comprises:

4. The method of claim 2, wherein the inputting the obtained sample visual features into the attention coding model to obtain a second predicted text label distribution corresponding to each sample text region comprises:

5. The method of claim 2, wherein before inputting each sample text region selected from the set of sample text regions into the convolutional network to extract a respective sample visual feature of each respective sample text region, the method further comprises:

each sample text region selected from the set of sample text regions;

6. The method of claim 2, wherein determining a first training loss for the text recognition model to be trained based on the obtained respective first predictive text label distributions comprises:

7. The method of claim 6, wherein determining a forward probability and a backward probability for each predictive text label in a first distribution of predictive text labels using dynamic programming comprises:

8. The method of any of claims 2-7, wherein determining a joint training loss based on the first training loss and the second training loss comprises:

and adjusting the value of the first weight.

9. The method according to any one of claims 1 to 6, wherein the decoding operation on the target visual feature to obtain a text recognition result of the text region to be recognized comprises:

10. The method of claim 9, wherein fusing the weights of the pre-trained language model and the target text label distribution to obtain a text recognition result comprises:

11. A text recognition apparatus, comprising:

the image acquisition module is used for acquiring an image to be identified;

12. A computer device, comprising:

at least one processor, and

a memory communicatively coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor, the at least one processor implementing the method of any one of claims 1 to 10 by executing the instructions stored by the memory.

13. A computer storage medium having stored thereon computer instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 10.