CN112784586A

CN112784586A - Text recognition method and related product

Info

Publication number: CN112784586A
Application number: CN201911089386.7A
Authority: CN
Inventors: 程苗苗; 蔡晓聪; 侯军
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2019-11-08
Filing date: 2019-11-08
Publication date: 2021-05-11

Abstract

The embodiment of the application discloses a text recognition method and a related product, wherein the method comprises the following steps: coding a target image to obtain a forward characteristic sequence and a reverse characteristic sequence of the target image; decoding the forward characteristic sequence to obtain a first characteristic sequence; decoding the reverse characteristic sequence to obtain a second characteristic sequence; obtaining a target text recognition result based on the first characteristic sequence and the second characteristic sequence; the text recognition precision can be effectively improved.

Description

Text recognition method and related product

Technical Field

The present application relates to the field of text recognition, and in particular, to a text recognition method and a related product.

Background

The text recognition by using the computer vision technology is widely applied to a plurality of fields, and in the current text recognition technology aiming at real scenes, the recognition precision is not high enough, and the text recognition technology with higher recognition precision needs to be researched.

Disclosure of Invention

The embodiment of the application discloses a text recognition method and a related product.

In a first aspect, an embodiment of the present application provides a text recognition method, where the method includes: coding a target image to obtain a forward characteristic sequence and a reverse characteristic sequence of the target image; decoding the forward characteristic sequence to obtain a first characteristic sequence; decoding the reverse characteristic sequence to obtain a second characteristic sequence; and obtaining a target text recognition result based on the first characteristic sequence and the second characteristic sequence.

The forward signature sequence and the reverse signature sequence may be two signature sequences arranged in an opposite order. For example, the M eigenvectors included in the forward direction eigenvector sequence are sequentially the first eigenvector, the second eigenvector up to the mth eigenvector, and the M eigenvectors included in the reverse direction eigenvector sequence are sequentially the mth eigenvector, the (M-1) th eigenvector up to the first eigenvector; wherein M is an integer greater than 1.

According to the text recognition method and device, the text recognition result is obtained through prediction processing by combining the first feature sequence and the second feature sequence (namely the feature sequences in two directions), and compared with the text recognition result obtained through prediction processing by only using the feature sequence in one direction, the text recognition precision is higher.

In an optional implementation manner, the encoding processing on the target image to obtain the forward feature sequence and the reverse feature sequence of the target image includes: coding the target image to obtain the forward characteristic sequence; and carrying out reverse mapping on the forward characteristic sequence to obtain the reverse characteristic sequence.

In this implementation, by mapping the forward feature sequence in the reverse direction, a feature sequence that is the same as the feature data included in the forward feature sequence and has the reverse arrangement order can be obtained quickly.

In an optional implementation manner, before the target image is subjected to encoding processing to obtain a forward feature sequence and a reverse feature sequence of the target image, the method further includes: correcting the target image to obtain the corrected target image; the encoding processing of the target image to obtain the forward characteristic sequence and the reverse characteristic sequence of the target image includes: and coding the corrected target image to obtain the forward characteristic sequence and the reverse characteristic sequence.

In the implementation mode, firstly, the target image is corrected to obtain a corrected target image, and then the corrected target image is used for text recognition; the accuracy of text recognition can be further improved.

In an optional implementation manner, the obtaining a target text recognition result based on the first feature sequence and the second feature sequence includes: performing classification prediction processing based on the first feature sequence to obtain a first text recognition result and a first confidence coefficient of the first text recognition result; performing classification prediction processing based on the second feature sequence to obtain a second text recognition result and a second confidence of the second text recognition result; determining the target text recognition result from the first text recognition result and the second text recognition result based on the first confidence degree and the second confidence degree.

Confidence, also referred to as reliability, or confidence level, confidence coefficient. The first confidence level may represent a probability that the first text recognition result is a correct text recognition result, and the second confidence level may represent a probability that the second text recognition result is a correct text recognition result. It is understood that the first confidence level is higher than the second confidence level, indicating that the probability that the first text recognition result is the correct text recognition result is higher than the probability that the second text recognition result is the correct text recognition result. In the implementation mode, a better text recognition result is selected from the text recognition results in two directions as a final text recognition result, so that the accuracy of text recognition can be effectively improved, and the implementation is simple.

In an alternative implementation, the determining the target text recognition result from the first text recognition result and the second text recognition result based on the first confidence level and the second confidence level includes: determining the first text recognition result as the target text recognition result if the first confidence is higher than the second confidence.

In the implementation mode, the text recognition result with high confidence in the two text recognition results is determined as the final target text recognition result, and one text recognition result with higher precision can be accurately selected.

In an optional implementation manner, the performing classification prediction processing based on the second feature sequence to obtain a second text recognition result and a second confidence of the second text recognition result includes: reversely mapping the second characteristic sequence to obtain a third characteristic sequence; and performing classification prediction processing on the third feature sequence to obtain the second text recognition result and the second confidence.

In this implementation manner, by mapping the second feature sequence in the reverse direction, a third feature sequence that is the same as the feature data included in the second feature sequence and is reverse in arrangement order can be obtained quickly, and then prediction processing is performed on the third feature sequence to obtain another text recognition result.

In an alternative implementation, the decoding process of the forward signature sequence and the reverse signature sequence is performed by an attention mechanism network.

In the implementation mode, the attention mechanism network is adopted for decoding, the context attention information of the characteristic sequence can be fully utilized, and the decoding accuracy is high.

In an alternative implementation, the attention mechanism network includes a gated loop unit GRU.

In an alternative implementation, the target text recognition result includes a first text recognition result and a second text recognition result; obtaining a target text recognition result based on the first feature sequence and the second feature sequence, including: classifying and predicting the first characteristic sequence through a text recognition network to obtain a first text recognition result; classifying and predicting a second feature sequence through the text recognition network to obtain a second text recognition result; the method further comprises the following steps: updating a parameter of the text recognition network based on the first text recognition result and the second text recognition result.

In the implementation mode, the parameters of the text recognition network are updated based on the two text recognition results, and the efficiency of training the text recognition network can be improved.

In an optional implementation manner, the updating the parameter of the text recognition network based on the first text recognition result and the second text recognition result includes: obtaining a first network loss based on the first text recognition result and the expected recognition result of the target image; obtaining a second network loss based on the second text recognition result and the expected recognition result; updating parameters of the text recognition network based on the first network loss and the second network loss.

In an alternative implementation, the updating the parameters of the text recognition network based on the first network loss and the second network loss includes: calculating a weighted sum of the first network loss and the second network loss to obtain a third network loss; updating parameters of the text recognition network with the third network loss.

In this implementation, the parameters of the text recognition network are updated with the third network loss, so that the updated text recognition network can better perform the text recognition processing from two directions.

In an optional implementation manner, the decoding the forward signature sequence to obtain a first signature sequence includes:

calculating to obtain a context information vector based on the parameters of the current hidden layer of the gate control circulation unit GRU and the forward characteristic sequence; the context information vector is used for representing the incidence relation among the features included in the forward feature sequence;

converting a reference text into an embedded vector through an embedded layer of the GRU, wherein the reference text is obtained by performing classification prediction processing on a vector output by the GRU;

fusing the context information vector and the embedded vector to obtain a target vector; the target vector is included in the first sequence of features.

Optionally, the context information vector is calculated based on the parameters of the current hidden layer of the GRU and the forward direction feature sequence by using the following formula:

wherein, a_t(s) represents a calculated context information vector, h_tThe parameters that represent the hidden layer are,

representing the forward signature sequence described above.

May be a similarity calculation function. The obtaining of the target vector by fusing the context information vector and the embedded vector may be obtained by adding corresponding elements of the context information vector and the embedded vector, or may be obtained by linking the context information vector and the embedded vector.

The GRU is suitable for processing sequence problems, and compared with a convolutional neural network, the GRU has higher decoding precision and higher processing speed compared with a long-term and short-term memory network.

In the implementation mode, the GRU is adopted to carry out decoding processing on the forward characteristic sequence, and the decoding precision and the decoding speed are high.

In an optional implementation manner, before calculating a context information vector based on a parameter of a current hidden layer of a gating loop unit GRU and the forward direction feature sequence, the method further includes: inputting the forward characteristic sequence into a GRU to obtain parameters of a hidden layer; and taking the obtained parameters of the hidden layer as the input of the GRU and obtaining the parameters of the new hidden layer.

In a second aspect, an embodiment of the present application provides a text recognition apparatus, including: the encoding unit is used for encoding a target image to obtain a forward characteristic sequence and a reverse characteristic sequence of the target image; the decoding unit is used for decoding the forward characteristic sequence to obtain a first characteristic sequence; the decoding unit is further configured to decode the reverse characteristic sequence to obtain a second characteristic sequence; and the processing unit is used for obtaining a target text recognition result based on the first characteristic sequence and the second characteristic sequence.

In an optional implementation, the apparatus further comprises: the encoding unit is specifically configured to perform encoding processing on the target image to obtain the forward feature sequence; and the reverse mapping unit is used for performing reverse mapping on the forward characteristic sequence to obtain the reverse characteristic sequence.

In an optional implementation, the apparatus further comprises: the correction unit is used for correcting the target image to obtain the corrected target image; the encoding unit is specifically configured to perform encoding processing on the corrected target image to obtain the forward feature sequence and the reverse feature sequence.

In an optional implementation manner, the processing unit is specifically configured to perform classification prediction processing based on the first feature sequence to obtain a first text recognition result and a first confidence of the first text recognition result; performing classification prediction processing based on the second feature sequence to obtain a second text recognition result and a second confidence of the second text recognition result; determining the target text recognition result from the first text recognition result and the second text recognition result based on the first confidence degree and the second confidence degree.

In an optional implementation manner, the processing unit is specifically configured to determine the first text recognition result as the target text recognition result when the first confidence level is higher than the second confidence level.

In an optional implementation manner, the processing unit is specifically configured to perform reverse mapping on the second feature sequence to obtain a third feature sequence; and performing classification prediction processing on the third feature sequence to obtain the second text recognition result and the second confidence.

In an alternative implementation, the target text recognition result includes a first text recognition result and a second text recognition result; the processing unit is specifically configured to perform classification prediction processing on the first feature sequence through a text recognition network to obtain the first text recognition result; classifying and predicting a second feature sequence through the text recognition network to obtain a second text recognition result; the device further comprises: and the updating unit is used for updating the parameters of the text recognition network based on the first text recognition result and the second text recognition result.

In an optional implementation manner, the updating unit is specifically configured to obtain a first network loss based on the first text recognition result and the expected recognition result of the target image; obtaining a second network loss based on the second text recognition result and the expected recognition result; updating parameters of the text recognition network based on the first network loss and the second network loss.

In an optional implementation manner, the updating unit is specifically configured to calculate a weighted sum of the first network loss and the second network loss, so as to obtain a third network loss; updating parameters of the text recognition network with the third network loss.

In an optional implementation manner, the decoding unit is specifically configured to calculate a context information vector based on a parameter of a current hidden layer of a gated round-robin unit GRU and the forward direction feature sequence; the context information vector is used for representing the incidence relation among the features included in the forward feature sequence; converting a reference text into an embedded vector through an embedded layer of the GRU, wherein the reference text is obtained by performing classification prediction processing on a vector output by the GRU; fusing the context information vector and the embedded vector to obtain a target vector; the target vector is included in the first sequence of features.

In an optional implementation manner, the processing unit is further configured to input the forward feature sequence to the GRU to obtain a parameter of an implicit layer; and taking the obtained parameters of the hidden layer as the input of the GRU and obtaining the parameters of the new hidden layer.

In a third aspect, an embodiment of the present application provides an electronic device, including: a memory for storing a program; a processor for executing the program stored in the memory, the processor being configured to perform the method of the first aspect and any one of the alternative implementations as described above when the program is executed.

In a fourth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a data interface, and the processor reads instructions stored on a memory through the data interface to perform the method according to the first aspect and any optional implementation manner.

In a fifth aspect, the present application provides a computer-readable storage medium, where a computer program is stored, where the computer program includes program instructions, and when the program instructions are executed by a processor, the processor is caused to execute the method of the first aspect and any optional implementation manner.

In a sixth aspect, the present application provides a computer program product, which includes program instructions, and when executed by a processor, causes the processor to execute the method of the first aspect and any optional implementation manner.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments or the background art of the present application, the drawings required to be used in the embodiments or the background art of the present application will be described below.

Fig. 1 is a schematic structural diagram of a text recognition network according to an embodiment of the present disclosure;

fig. 2 is a flowchart of a training method for a text recognition network according to an embodiment of the present disclosure;

fig. 3 is a flowchart of a text recognition method according to an embodiment of the present application;

fig. 4 is a schematic diagram of a text recognition process provided in an embodiment of the present application;

FIG. 5 is a schematic diagram of a bidirectional feature decoding process according to an embodiment of the present application;

fig. 6 is a flowchart of a decoding processing method according to an embodiment of the present application;

FIG. 7 is a flow chart of another text recognition method provided in the embodiments of the present application;

fig. 8 is a flowchart of another text recognition method provided in the embodiment of the present application

Fig. 9 is a schematic structural diagram of a text recognition apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a terminal device according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

The terms "first," "second," and "third," etc. in the description and claims of the present application and the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprises" and "comprising," as well as any variations thereof, are intended to cover a non-exclusive inclusion, such as a list of steps or elements. A method, system, article, or apparatus is not necessarily limited to those steps or elements explicitly listed, but may include other steps or elements not explicitly listed or inherent to such process, system, article, or apparatus.

Text recognition is a technique for automatically recognizing characters by using a computer, and is an important field of pattern recognition applications. Text recognition generally includes several parts, such as collection of text information, analysis and processing of information, classification and discrimination of information, and the like. The text recognition method provided by the embodiment of the application can be applied to text recognition scenes such as image text recognition, video text recognition and the like. The following respectively briefly introduces the application of the text recognition method provided in the application embodiment to an image text recognition scene and a video text recognition scene.

Image text recognition 1: the terminal device collects an image including one or more characters, performs text recognition on the image, and displays the recognized characters. For example, a user takes an image of a sign using a mobile phone, and the mobile phone performs text recognition on the image and displays text on the sign. For another example, the user uses a mobile phone to capture an image containing a segment of english, and the mobile phone performs text recognition on the image and displays the chinese character obtained by translating the segment of english.

Image text recognition 2: the terminal equipment sends the acquired image to a server; the server performs text recognition on the image and sends a text recognition result obtained by recognition to the terminal equipment; and the terminal equipment receives and displays the text recognition result. For example, a monitoring device on a road acquires an image including a license plate number of a vehicle and sends the image to a server, and the server identifies the license plate number in the image. For another example, the user uses a mobile phone to take an image of a signboard and sends the image to the server; the server performs text recognition on the image to obtain a text recognition result, and sends the text recognition result to the terminal equipment; the terminal device displays the text recognition result.

Video text recognition 1: the terminal equipment collects a section of video and respectively performs text recognition on each frame of image in the video. For example, a user uses a mobile phone to shoot a video, wherein a plurality of frames of images in the video comprise at least one character; and the mobile phone respectively performs text recognition on each frame of image in the video to obtain and display a text recognition result.

Video text recognition 2: the method comprises the steps that a piece of video is collected by terminal equipment and sent to a server; and the server respectively performs text recognition on each frame of image in the video to obtain a text recognition result. For example, a monitoring device on a road acquires a section of video, wherein at least one frame of image in the section of video comprises a license plate number; the monitoring equipment sends the video to a server; the server performs text recognition on each frame of image in the video to obtain at least one license plate number.

In the above scenario, the text recognition device (i.e., the device that executes the text recognition processing) can realize text recognition, and the text recognition accuracy is high, so that the user requirements can be better met.

The following first introduces an architecture diagram of a text recognition network provided in an embodiment of the present application. The text recognition device can adopt the text recognition network to perform text recognition on the image, and the recognition speed and the recognition accuracy are high.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating an architecture of a text recognition network according to an embodiment of the present disclosure. As shown in fig. 1, the text recognition network may include: a remediation network 104, an encoding network 101, a decoding network 102, and a classification network 103. The rectification network 104 is configured to perform rectification processing on an input target image (i.e., an image to be rectified), so as to obtain a rectified target image. Wherein the remediation network 104 is optional, but not necessary. In a real scene, the text may have problems such as an incorrect angle, a text side view angle, and the like, which are difficult to be recognized by original scanning, and problems such as font changes and slight occlusion. The correction network 104 corrects the target image, and the corrected target image is easier to recognize text. An encoding network 101, configured to perform feature extraction (also referred to as encoding processing) on the corrected target image output by the correction network 104 to obtain a forward feature sequence; and the method can also be used for extracting the features of the input target image to obtain a forward feature sequence. The encoding network 101 may be a convolutional neural network, or may be another neural network that can extract features in an image, and the embodiment of the present application is not limited thereto. In some embodiments, the decoding network 102 is configured to decode the forward signature sequence output by the encoding network 102 to obtain a first signature sequence; the reverse characteristic sequence is decoded to obtain a second characteristic sequence; the reverse characteristic sequence and the forward characteristic sequence comprise the same characteristic data and are arranged in an opposite sequence. In this embodiment, the decoding network 102 may decode the forward signature sequence first, and then decode the reverse signature sequence; or the reverse characteristic sequence may be decoded first, and then the forward characteristic sequence may be decoded. In some embodiments, decoding network 102 includes a first sub-decoding network for decoding the forward signature sequence output by encoding network 102 to obtain a first signature sequence and a second sub-decoding network for decoding the reverse signature sequence to obtain a second signature sequence. In this embodiment, the decoding network 102 decodes the forward signature sequence and the reverse signature sequence at the same time, so as to improve the decoding efficiency.

Alternatively, the decoding network 102 may be an attention mechanism network, such as a gated round robin unit (GRU). The GRU is a gated recurrent neural network. For example, a forward feature sequence extracted by the encoding network 101, such as a feature sequence with a size of [20, 2048], is input in the decoding network 102; after processing by the decoding network 102, a decoded first signature sequence, for example, a signature sequence of size [20,512], is output. The classification network 103 is configured to perform prediction processing on the first feature sequence output by the decoding network 102 to obtain a first text recognition result (i.e., a text sequence) and a first confidence of the first text recognition result; and is further configured to perform prediction processing on the second feature sequence output by the decoding network 102 to obtain a second text recognition result (i.e., a text sequence) and a second confidence of the second text recognition result. The classification network 103 may be a fully connected layer or other classification networks. The text recognition network outputs the first text recognition result under the condition that the first confidence coefficient is higher than the second confidence coefficient; and outputting the second text recognition result under the condition that the second confidence coefficient is higher than the first confidence coefficient. It can be understood that the text recognition network can obtain two text recognition results and output a text recognition result with higher confidence, and the text recognition precision is high.

It should be understood that the encoding network in the embodiment of the present disclosure may include an encoder or further include other components, the decoding network may include a decoder or further include other components, the classification network may include a classifier, or the text recognition network may include other networks or does not include part of the network in fig. 1, which is not limited by the embodiment of the present disclosure.

The process of training to obtain the text recognition network of FIG. 1 is described below.

Referring to fig. 2, fig. 2 is a flowchart of a training method of a text recognition network according to an embodiment of the present disclosure. In some embodiments, the training apparatus trains the text recognition network of FIG. 1 in the following manner: correcting the original sample through a correction network 104 to obtain a training sample; coding the training samples through a coding network 101 to obtain a first training characteristic sequence; decoding the first training characteristic sequence through a decoding network 102 to obtain a third training characteristic sequence; predicting the third training feature sequence through a classification network 103 to obtain a first recognition result; performing reverse mapping on the first training characteristic sequence to obtain the second training characteristic sequence; decoding the second training feature sequence through a decoding network 102 to obtain a fourth training feature sequence; performing reverse mapping on the fourth training characteristic sequence to obtain a fifth training characteristic sequence; predicting the fifth training feature sequence through a classification network 103 to obtain a second recognition result; determining a first network loss according to the first identification result and the standard identification result; determining a second network loss according to the second recognition result and the standard recognition result; calculating a weighted sum of the first network loss and the second network loss to obtain a third network loss; with the third network loss described above, the parameters of the classification network 103 (corresponding to 105 in fig. 2), the encoding network 102 (corresponding to 106 in fig. 2), the decoding network 101 (corresponding to 107 in fig. 2), and the rectification network 104 (corresponding to 108 in fig. 2) are updated by back propagation. The standard recognition result is a pre-labeled recognition result, namely a recognition result expected by performing text recognition on the original sample. Optionally, the training apparatus uses a gradient descent method to update the parameters of the networks in fig. 2. Optionally, during training, the feature sequences decoded in two directions (i.e., the third training feature sequence and the fourth training feature sequence) are respectively combined with the forward label and the backward label to calculate loss functions, and the obtained loss functions are added and then subjected to gradient back-propagation to update parameters of the text recognition network. During training, the training method provided by the embodiment of the application adopts a bidirectional decoding mode, so that the supervision on a correction network is enhanced, and the robustness of a text recognition network is improved.

In some embodiments, the decoding network 102 includes a first sub-decoding network and a second sub-decoding network, and the training device trains the text recognition network in fig. 1 as follows: correcting the original sample through a correction network 104 to obtain a training sample; coding the training samples through a coding network 101 to obtain a first training characteristic sequence; decoding the first training feature sequence through a first sub-decoding network in the decoding network 102 to obtain a third training feature sequence; predicting the third training feature sequence through a classification network 103 to obtain a first recognition result; performing reverse mapping on the first training characteristic sequence to obtain the second training characteristic sequence; decoding the second training feature sequence through a second sub-decoding network in the decoding network 102 to obtain a fourth training feature sequence; performing reverse mapping on the fourth training characteristic sequence to obtain a fifth training characteristic sequence; predicting the fifth training feature sequence through a classification network 103 to obtain a second recognition result; determining a first network loss according to the first identification result and the standard identification result; determining a second network loss according to the second recognition result and the standard recognition result; calculating a weighted sum of the first network loss and the second network loss to obtain a third network loss; updating the parameters of the classification network 103 (corresponding to 105 in fig. 2), the coding network 102 (corresponding to 106 in fig. 2) (corresponding to 107 in fig. 2), and the rectification network 104 (corresponding to 108 in fig. 2) by back propagation with the above-mentioned third network loss; updating parameters of the first sub-decoding network by using the first network loss; and updating the parameters of the second sub-decoding network by using the second network loss.

The training device updates parameters of the text recognition network based on the first network loss and the second network loss; therefore, two text recognition results with higher recognition precision can be obtained by performing text recognition by using the trained text recognition network.

Referring to fig. 3, fig. 3 is a flowchart of a text recognition method according to an embodiment of the present disclosure. As shown in fig. 3, the text recognition method may be implemented by the text recognition network in fig. 1, and the method may include:

301. and the text recognition device carries out coding processing on the target image to obtain a forward characteristic sequence and a reverse characteristic sequence of the target image.

The text recognition device can be a mobile phone, a tablet computer, a wearable device, a notebook computer, a desktop computer and other terminal devices, and can also be a server. The target image may be an image including at least one character, such as an image obtained by photographing a license plate number. Optionally, before performing step 301, the text recognition apparatus may perform the following operations: the target image is corrected by the correction network 101 to obtain a corrected target image. Illustratively, the correction Network is a Spatial Transform Network (STN).

For example, the text recognition device performs encoding processing on the target image to obtain the forward feature sequence and the reverse feature sequence of the target image in the following implementation manner: coding the target image to obtain the forward characteristic sequence; and carrying out reverse mapping on the forward characteristic sequence to obtain the reverse characteristic sequence. The encoding process performed on the target image to obtain the forward feature sequence may be: the target image is input to the coding network 101 for coding processing, and the forward characteristic sequence is obtained. The reverse mapping of the forward characteristic sequence to obtain the reverse characteristic sequence may be: and reversing the sequence of each feature vector included in the forward feature sequence to obtain the reverse feature sequence. For example, the M eigenvectors included in the forward direction eigenvector sequence are sequentially the first eigenvector, the second eigenvector up to the mth eigenvector, and the M eigenvectors included in the reverse direction eigenvector sequence are sequentially the mth eigenvector, the (M-1) th eigenvector up to the first eigenvector; wherein M is an integer greater than 1.

302. And decoding the forward characteristic sequence to obtain a first characteristic sequence.

For example, a forward feature sequence extracted by the encoding network 101, such as a feature sequence with a size of [20, 2048], is input in the decoding network 102; after the encoding process of the decoding network 102, a decoded first signature sequence, for example, a signature sequence with a size of [20,512], is output. In this example, the forward feature sequence includes 20 feature vectors, each feature vector is 2048 in length, that is, each feature vector includes 2048 elements; the first feature sequence comprises 20 feature vectors, each feature vector having a length of 512, i.e. each feature vector comprises 512 elements. Optionally, the decoding network is a self-attention mechanism network, such as a GRU.

303. And decoding the reverse characteristic sequence to obtain a second characteristic sequence.

The reverse signature sequence and the forward signature sequence comprise the same signature data and are arranged in an opposite order. Optionally, before performing step 303, the text recognition apparatus may perform the following operations: and carrying out reverse mapping on the forward characteristic sequence to obtain the reverse characteristic sequence.

Fig. 4 is a schematic diagram of a text recognition process according to an embodiment of the present application. As shown in fig. 4, the text recognition process may sequentially go through the following three steps: feature encoding, bi-directional feature decoding, and class prediction. One implementation of feature encoding is step 301, where the target image is feature encoded to obtain a feature sequence, such as a forward feature sequence. One implementation of bi-directional feature decoding includes

steps

302 and 303, with step 302 decoding features for one direction and step 303 decoding features for the other direction. Fig. 5 is a schematic diagram of a bidirectional feature decoding process according to an embodiment of the present application. As shown in fig. 5, the implementation of bi-directional feature decoding includes the following steps: performing reverse mapping on the forward characteristic sequence obtained by executing the characteristic coding to obtain a reverse characteristic sequence; and respectively inputting the forward characteristic sequence and the reverse characteristic sequence into a decoding network for decoding processing to obtain a forward decoded characteristic sequence (corresponding to a first characteristic sequence) and a reverse decoded characteristic sequence (corresponding to a second characteristic sequence). One implementation of classification prediction may be: performing prediction processing on the first feature sequence through a classification network to obtain a first text recognition result and a first confidence coefficient of the first text recognition result; and performing prediction processing on the second feature sequence through a classification network to obtain a second text recognition result and a second confidence coefficient of the second text recognition result.

304. And obtaining a target text recognition result based on the first characteristic sequence and the second characteristic sequence.

In an optional implementation manner, the text recognition network further includes a classification network, and the obtaining a target text recognition result based on the first feature sequence and the second feature sequence includes: performing classification prediction processing based on the first feature sequence to obtain a first text recognition result and a first confidence coefficient of the first text recognition result; performing classification prediction processing based on the second feature sequence to obtain a second text recognition result and a second confidence coefficient of the second text recognition result; and determining the target text recognition result from the first text recognition result and the second text recognition result based on the first confidence level and the second confidence level. For example, the classification prediction processing based on the first feature sequence to obtain the first text recognition result and the first confidence of the first text recognition result may be: and performing classification prediction processing on the first feature sequence through the classification network to obtain a first text recognition result and a first confidence coefficient of the first text recognition result. For example, the classification prediction processing based on the first feature sequence to obtain the first text recognition result and the first confidence of the first text recognition result may be: and performing classification prediction processing on the second feature sequence through the classification network to obtain a second text recognition result and a second confidence coefficient of the second text recognition result. Optionally, one implementation manner of performing classification prediction processing on the second feature sequence through the classification network to obtain a second text recognition result and a second confidence of the second text recognition result is as follows: reversely mapping the second characteristic sequence to obtain a third characteristic sequence; and inputting the third feature sequence into the classification network for classification prediction processing to obtain the second text recognition result and the second confidence.

For example, based on the first confidence level and the second confidence level, the determining the target text recognition result from the first text recognition result and the second text recognition result may be: and determining the first text recognition result as the target text recognition result when the first confidence is higher than the second confidence. The first confidence level may indicate a probability that the first text recognition result is a correct text recognition result, and the second confidence level may indicate a probability that the second text recognition result is a correct text recognition result. Assuming that the first text recognition result includes two recognized words with confidence levels of 0.8 and 0.9, respectively, the first confidence level of the first text recognition result is (0.8 × 0.9). Optionally, the first text recognition result and the first confidence level correspond to a likelihood matrix, that is, a likelihood matrix output by the classification network, where the likelihood matrix includes probabilities of characters or characters obtained by prediction processing performed by the classification network. The text recognition device can obtain the first text recognition result and the first confidence according to the probability matrix. For example, one likelihood matrix for the classification network output includes probabilities for 10 words, with the probability (i.e., confidence) for each word representing the probability that the word is the correct result. Optionally, the second text recognition result and the second confidence level correspond to another likelihood matrix, that is, another likelihood matrix of the classification network output. In practical application, the classification network performs classification prediction processing on the decoded characteristic sequence input by the classification network, so as to obtain a possibility matrix; the text recognition device obtains a text recognition result and the confidence of the text recognition result from the likelihood matrix. In some embodiments, the text recognition device may calculate a confidence level for each text recognition result based on the output of the classification network. It is understood that the first confidence level is higher than the second confidence level, indicating that the probability that the first text recognition result is the correct text recognition result is higher than the probability that the second text recognition result is the correct text recognition result. In this implementation, two text recognition results can be calculated by using a bidirectional decoding method; and selecting a better text recognition result from the text recognition results in two directions as a final text recognition result, so that the accuracy of text recognition can be effectively improved, and the realization is simple.

Compared with the traditional decoding network with a single direction, the text recognition method provided by the embodiment of the application improves the text recognition precision by about 1 percentage point in a plurality of data sets such as ICDAR 15.

The method flow in fig. 3 may be understood as a flow in which the text recognition device performs text recognition on the target image, or may be understood as a partial flow in a process in which the text recognition device trains the text recognition network. That is, the text recognition apparatus in fig. 3 may execute the training method flow in fig. 2, that is, the text recognition apparatus may also be a training apparatus. It should be understood that if the method flow in fig. 3 is a flow in which the text recognition device performs text recognition on the target image, then the steps in fig. 3 are all steps that the text recognition device needs to perform. If the method flow in fig. 3 is part of a process in which the text recognition device trains the text recognition network, the text recognition device may also need to perform an operation of updating parameters of the text recognition network.

In some embodiments, the target text recognition result includes a first text recognition result and a second text recognition result; the obtaining a target text recognition result based on the first feature sequence and the second feature sequence (i.e., step 304) may include: classifying and predicting the first characteristic sequence through a text recognition network to obtain a first text recognition result; classifying and predicting the second characteristic sequence through the text recognition network to obtain a second text recognition result; after performing step 304, the text recognition apparatus may further perform the following operations: and updating the parameters of the text recognition network based on the first text recognition result and the second text recognition result. For example, the updating of the parameter of the text recognition network based on the first text recognition result and the second text recognition result is implemented as follows: obtaining a first network loss based on the first text recognition result and the standard text recognition result; the standard text recognition result is a result expected by processing the target image through the text recognition network; obtaining a second network loss based on the second text recognition result and the standard text recognition result; calculating a weighted sum of the first network loss and the second network loss to obtain a third network loss; and updating the parameters of the text recognition network by using the third network loss.

The foregoing embodiments do not describe in detail the implementation of the decoding process performed by the decoding network 102. In some embodiments, the decoding network may be a Long Short-Term Memory network (LSTM), a GRU, or the like. The decoding process implemented by the decoding network 102 is described below with the decoding network as a GRU as an example.

Fig. 6 is a flowchart of a decoding processing method according to an embodiment of the present application. As shown in fig. 6, the decoding processing method may include:

601. the text recognition device inputs the feature map into a single-layer gating circulation unit to obtain the parameters of the hidden layer.

The feature map may be the first feature sequence or the second feature sequence.

602. And taking the obtained parameter of the hidden layer as the input of the single-layer gating circulation unit and obtaining the new parameter of the hidden layer.

603. And calculating to obtain a context information vector through the parameters of the current hidden layer and the input feature diagram.

Illustratively, the context information vector is calculated based on the parameters of the current hidden layer of the GRU and the above feature map by using the following formula:

the characteristic diagram is shown.

May be a similarity calculation function.

604. The text that has been recognized at present is converted into an embedding vector by an embedding layer (embed).

For example, what the embedding layer embed does here is to express the word "deep" that has been recognized as a vector [0.32,0.02,0.48,0.21,0.56,0.15 ]. It should be understood that the text recognition means need not perform

steps

604 and 605 before it recognizes the text, but rather outputs the current context information vector directly to the classification network.

605. And fusing the context information vector and the embedded vector, and outputting the fused vector to the classification network.

Optionally, the text recognition apparatus may fuse the context information vector and the embedded vector by linking the two vectors, or may sum the two vectors to obtain a new vector. The fused vector is subjected to classification prediction processing by a classification network to obtain a text recognition result, and the text recognition result is connected with a normalized exponential function (softmax) to obtain a final possibility matrix (namely a confidence matrix). The likelihood matrix may include probabilities that the text recognition result is a respective word or character. In some embodiments, the text recognition apparatus may perform steps 601 to 605 once to obtain a part of the text recognition result (e.g. a word), and the text recognition apparatus may loop steps 601 to 605 until a complete text recognition result is obtained.

Fig. 7 is a flowchart of another text recognition method according to an embodiment of the present application. As shown in fig. 7, the method may include:

701. the text recognition device collects a text image.

The text recognition device can be a mobile phone with a camera, a tablet computer and other electronic equipment. For example, a user may launch a camera application of a cell phone and take an image that includes at least one character or text, resulting in a text image. For example, a user uses a mobile phone (i.e., a text recognition device) to capture a courier note, a business card, a plaque, a piece of text, etc. to obtain a text image.

702. The text recognition device receives a text recognition instruction input by a user.

The text recognition instruction is used for instructing the text recognition device to perform text recognition on the text image.

702. And the text recognition device inputs the text image into a text recognition network for text recognition to obtain a text recognition result.

Optionally, the text recognition device inputs the text image to the text recognition network in fig. 1 for text recognition, so as to obtain the text recognition result. The implementation of step 702 can be seen in the method flow of fig. 3. The text recognition device can rapidly and accurately recognize the text recognition result by adopting the text recognition network in FIG. 1.

703. The text recognition means displays the text recognition result.

In the embodiment of the application, the user acquires the text image by using the text recognition device and performs text recognition on the text image, and the text recognition precision is high.

Fig. 8 is a flowchart of another text recognition method according to an embodiment of the present application. As shown in fig. 8, the method may include:

801. the terminal device collects a text image.

The terminal device can be an electronic device such as a mobile phone and a tablet computer with a camera. The text image is an image including at least one character. For example, a user may launch a camera application of a cell phone and take an image that includes at least one character or text, resulting in a text image. For example, a user uses a mobile phone (i.e., a terminal device) to photograph an express note, a business card, a plaque, a piece of text, etc. to obtain a text image.

802. And the terminal equipment sends the acquired text image to a server.

803. And the server inputs the text image into a text recognition network for text recognition to obtain a text recognition result.

Optionally, the server inputs the text image to the text recognition network in fig. 1 for text recognition, so as to obtain the text recognition result. The implementation of step 803 can refer to the method flow in fig. 3. The server is provided with the text recognition network in fig. 1, and the server can rapidly and accurately recognize the text recognition result by using the text recognition network in fig. 1.

804. And the server sends the text recognition result to the terminal equipment.

For example, the terminal device sends an image including a plurality of texts to the server, the server performs text recognition on the image to obtain a text recognition result, the server generates a file including the text recognition result and sends the file to the terminal device, and a user can edit the file to obtain a file required by the user by using the terminal device.

In some embodiments, after performing step 803, the server may further perform the following operations: storing the text recognition result or updating the database using the text recognition result. For example, a terminal device (i.e., a monitoring device) on a road acquires an image including a license plate number; the terminal equipment sends the image to a server; the server performs text recognition on the image to obtain at least one license plate number; the server stores the license plate number and records the time when the image is received. For another example, the terminal device (e.g., a courier) shoots a courier receipt to obtain a courier receipt image; the terminal equipment sends the express bill image to a server; the server performs text recognition on the express bill image to obtain express information; and updating the database by using the express delivery information. The database may include quick information for multiple users.

It should be understood that the server often has computation advantages and storage advantages that terminal devices (such as mobile phones) cannot compare with, so that the terminal devices send collected text images to the server for text recognition, text recognition results can be obtained more quickly, and recognition accuracy is higher.

Fig. 9 is a schematic structural diagram of a text recognition apparatus according to an embodiment of the present application, and as shown in fig. 9, the apparatus includes:

an encoding unit 901, configured to perform encoding processing on a target image to obtain a forward feature sequence and a reverse feature sequence of the target image;

a decoding unit 902, configured to perform decoding processing on the forward characteristic sequence to obtain a first characteristic sequence;

a decoding unit 902, further configured to perform decoding processing on the reverse characteristic sequence to obtain a second characteristic sequence;

and the processing unit 903 is configured to obtain a target text recognition result based on the first feature sequence and the second feature sequence.

In an optional implementation manner, the encoding unit 901 is specifically configured to perform encoding processing on the target image to obtain the forward feature sequence; the above-mentioned device still includes:

a reverse mapping unit 904, configured to perform reverse mapping on the forward characteristic sequence to obtain the reverse characteristic sequence.

In an optional implementation manner, the apparatus further includes:

a correcting unit 905 configured to perform correction processing on the target image to obtain a corrected target image;

the encoding unit 901 is specifically configured to perform encoding processing on the corrected target image to obtain the forward feature sequence and the reverse feature sequence.

In an optional implementation manner, the processing unit 903 is specifically configured to perform classification prediction processing based on the first feature sequence to obtain a first text recognition result and a first confidence of the first text recognition result;

performing classification prediction processing based on the second feature sequence to obtain a second text recognition result and a second confidence coefficient of the second text recognition result;

and determining the target text recognition result from the first text recognition result and the second text recognition result based on the first confidence level and the second confidence level.

In an optional implementation manner, the processing unit 903 is specifically configured to determine the first text recognition result as the target text recognition result when the first confidence is higher than the second confidence.

In an optional implementation manner, the processing unit 903 is specifically configured to perform reverse mapping on the second feature sequence to obtain a third feature sequence; and performing classification prediction processing on the third feature sequence to obtain the second text recognition result and the second confidence.

In an optional implementation manner, the target text recognition result includes a first text recognition result and a second text recognition result;

a processing unit 903, configured to perform classification prediction processing on the first feature sequence through a text recognition network to obtain the first text recognition result; classifying and predicting the second characteristic sequence through the text recognition network to obtain a second text recognition result; the above-mentioned device still includes:

an updating unit 906, configured to update a parameter of the text recognition network based on the first text recognition result and the second text recognition result.

In an alternative implementation, the updating unit 906 is specifically configured to obtain a first network loss based on the first text recognition result and the expected recognition result of the target image; obtaining a second network loss based on the second text recognition result and the expected recognition result; updating parameters of the text recognition network based on the first network loss and the second network loss.

In an optional implementation manner, the updating unit 906 is specifically configured to calculate a weighted sum of the first network loss and the second network loss, so as to obtain a third network loss; updating parameters of the text recognition network with the third network loss.

In an optional implementation manner, the decoding unit 902 is specifically configured to calculate a context information vector based on a parameter of a current hidden layer of a gated round-robin unit GRU and the forward direction feature sequence; the context information vector is used for representing the incidence relation among the features included in the forward feature sequence; converting a reference text into an embedded vector through an embedded layer of the GRU, wherein the reference text is obtained by performing classification prediction processing on a vector output by the GRU; fusing the context information vector and the embedded vector to obtain a target vector; the target vector is included in the first sequence of features.

In an optional implementation manner, the processing unit 903 is further configured to input the forward feature sequence to the GRU to obtain a parameter of a hidden layer; and taking the obtained parameters of the hidden layer as the input of the GRU and obtaining the parameters of the new hidden layer.

In an alternative implementation, the text recognition apparatus is a server, and the apparatus further includes:

a receiving unit 907 for receiving a target image from a terminal device;

a sending unit 908, configured to send the target text recognition result to the terminal device.

It should be understood that the above division of the units of the text recognition apparatus is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. For example, the above units may be processing elements which are set up separately, or may be implemented by integrating the same chip, or may be stored in a storage element of the controller in the form of program codes, and a certain processing element of the processor calls and executes the functions of the above units. In addition, the units can be integrated together or can be independently realized. The processing element may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the method or the units above may be implemented by hardware integrated logic circuits in a processor element or instructions in software. The processing element may be a general-purpose processor, such as a Central Processing Unit (CPU), or may be one or more integrated circuits configured to implement the above method, such as: one or more application-specific integrated circuits (ASICs), one or more microprocessors (DSPs), one or more field-programmable gate arrays (FPGAs), etc.

Fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 10, the electronic apparatus 100 includes a processor 1001, a memory 1002, and a communication interface 1003; the processor 1001, the memory 1002, and the communication interface 1003 are connected to each other by a bus. The electronic device in fig. 10 may be the text recognition apparatus or the training apparatus in the foregoing embodiments.

The memory 1002 includes, but is not limited to, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM), or a compact read-only memory (CDROM), and the memory 1002 is used for related instructions and data. The communication interface 1003 is used for receiving and transmitting data.

The processor 1001 may be one or more Central Processing Units (CPUs), and in the case where the processor 1001 is one CPU, the CPU may be a single-core CPU or a multi-core CPU. The steps performed by the text recognition means in the above-described embodiment may be based on the structure of the electronic device shown in fig. 10. In particular, the processor 1001 may implement the functions of the units in fig. 9.

The processor 1001 in the electronic device 100 is configured to read the program code stored in the memory 1002 and execute the text recognition method or the training method in the foregoing embodiments.

Fig. 11 is a schematic structural diagram of a server according to an embodiment of the present disclosure, where the server 1100 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 1122 (e.g., one or more processors) and a memory 1132, and one or more storage media 1130 (e.g., one or more mass storage devices) for storing an application program 1142 or data 1144. Memory 1132 and storage media 1130 may be, among other things, transient storage or persistent storage. The program stored on the storage medium 1130 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 1122 may be provided in communication with the storage medium 1130 to execute a series of instruction operations in the storage medium 1130 on the server 1100. The server 1100 may be a text recognition device and/or a training device as provided herein.

The server 1100 may also include one or more power supplies 1126, one or more wired or wireless network interfaces 1150, one or more input-output interfaces 1158, and/or one or more operating systems 1141, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and so forth.

The steps performed by the text recognition means and the training means in the above-described embodiment may be based on the server structure shown in fig. 11. Specifically, the central processor 1122 may implement the functions of the units in fig. 9.

In an embodiment of the present application, there is provided a computer-readable storage medium storing a computer program which, when executed by a processor, implements: coding a target image to obtain a forward characteristic sequence and a reverse characteristic sequence of the target image; decoding the forward characteristic sequence to obtain a first characteristic sequence; decoding the reverse characteristic sequence to obtain a second characteristic sequence; obtaining a target text recognition result based on the first characteristic sequence and the second characteristic sequence

In an embodiment of the present application, there is provided another computer-readable storage medium storing a computer program which, when executed by a processor, implements: coding the training samples through a coding network to obtain a first training characteristic sequence; obtaining a first network loss based on a processing result obtained by decoding the first training characteristic sequence through a decoding network; obtaining a second network loss based on a processing result obtained by decoding the second training characteristic sequence through the decoding network; the second training signature sequence and the first training signature sequence comprise the same signature data and are arranged in an opposite order; and updating parameters of the encoding network and the decoding network based on the first network loss and the second network loss.

Embodiments of the present application provide a computer program product containing instructions, which when run on a computer, cause the computer to perform the text recognition method or the training method provided by the foregoing embodiments.

While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of text recognition, the method comprising:

coding a target image to obtain a forward characteristic sequence and a reverse characteristic sequence of the target image;

decoding the forward characteristic sequence to obtain a first characteristic sequence;

decoding the reverse characteristic sequence to obtain a second characteristic sequence;

and obtaining a target text recognition result based on the first characteristic sequence and the second characteristic sequence.

2. The method according to claim 1, wherein the encoding the target image to obtain the forward feature sequence and the backward feature sequence of the target image comprises:

coding the target image to obtain the forward characteristic sequence;

and carrying out reverse mapping on the forward characteristic sequence to obtain the reverse characteristic sequence.

3. The method according to claim 1 or 2, wherein before the target image is encoded to obtain the forward feature sequence and the backward feature sequence of the target image, the method further comprises:

correcting the target image to obtain the corrected target image;

the encoding processing of the target image to obtain the forward characteristic sequence and the reverse characteristic sequence of the target image includes:

and coding the corrected target image to obtain the forward characteristic sequence and the reverse characteristic sequence.

4. The method according to any one of claims 1 to 3, wherein the obtaining a target text recognition result based on the first feature sequence and the second feature sequence comprises:

performing classification prediction processing based on the first feature sequence to obtain a first text recognition result and a first confidence coefficient of the first text recognition result;

performing classification prediction processing based on the second feature sequence to obtain a second text recognition result and a second confidence of the second text recognition result;

determining the target text recognition result from the first text recognition result and the second text recognition result based on the first confidence degree and the second confidence degree.

5. A text recognition apparatus, comprising:

the encoding unit is used for encoding a target image to obtain a forward characteristic sequence and a reverse characteristic sequence of the target image;

the decoding unit is used for decoding the forward characteristic sequence to obtain a first characteristic sequence;

the decoding unit is further configured to decode the reverse characteristic sequence to obtain a second characteristic sequence;

and the processing unit is used for obtaining a target text recognition result based on the first characteristic sequence and the second characteristic sequence.

6. The apparatus of claim 5,

the encoding unit is specifically configured to perform encoding processing on the target image to obtain the forward feature sequence; the device further comprises:

and the reverse mapping unit is used for performing reverse mapping on the forward characteristic sequence to obtain the reverse characteristic sequence.

7. The apparatus of claim 5 or 6, further comprising:

the correction unit is used for correcting the target image to obtain the corrected target image;

the encoding unit is specifically configured to perform encoding processing on the corrected target image to obtain the forward feature sequence and the reverse feature sequence.

8. The apparatus according to any one of claims 5 to 7,

the processing unit is specifically configured to perform classification prediction processing based on the first feature sequence to obtain a first text recognition result and a first confidence of the first text recognition result;

9. A computer-readable storage medium, in which a computer program is stored, the computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the method of any one of claims 1 to 4.

10. An electronic device, comprising: a memory for storing a program; a processor for executing the program stored by the memory, the processor being configured to perform the method of any of claims 1 to 4 when the program is executed.