CN112633290A

CN112633290A - Text recognition method, electronic device and computer readable medium

Info

Publication number: CN112633290A
Application number: CN202110238093.1A
Authority: CN
Inventors: 姜明; 刘霄; 熊泽法
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2021-03-04
Filing date: 2021-03-04
Publication date: 2021-04-09

Abstract

The embodiment of the application discloses a text recognition method, electronic equipment and a computer readable medium. The text recognition method comprises the following steps: performing feature extraction on a text image to be recognized to obtain corresponding image features; performing self-attention calculation processing based on image features to obtain corresponding feature coding vectors; performing character position enhancement processing based on the image characteristics to obtain corresponding character position coding vectors; splicing the feature coding vector and the character position coding vector, and extracting semantic features of the spliced coding vector to obtain a semantic feature vector; and decoding the semantic feature vector to obtain a corresponding text character. Through the scheme of the embodiment of the application, the finally decoded character is more accurate in character recognition accuracy and character bit sequence.

Description

Text recognition method, electronic device and computer readable medium

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a text recognition method, electronic equipment and a computer readable medium.

Background

Text recognition is a technique for detecting an image containing text and acquiring text information corresponding to the image.

When an image containing a text is recognized by the existing text recognition technology, the recognition result is not accurate due to interference of external factors, such as image definition, image exposure, different text fonts and the like. In many scenarios, such as student assignments or test papers, besides printed text, handwritten text is also available, taking text fonts as an example. However, the handwritten text has no normalization of printed text, so that the style difference of the handwritten text of different students is large. Therefore, the existing text recognition model with better recognition effect on the print body text has poorer recognition effect on the handwritten body text, and cannot adapt to the style change of the handwritten body text.

Therefore, how to accurately recognize texts of text images becomes an urgent problem to be solved.

Disclosure of Invention

The application provides a text recognition method, an electronic device and a computer readable medium, so as to improve the accuracy of text recognition.

According to a first aspect of embodiments of the present application, there is provided a text recognition method, including:

performing feature extraction on a text image to be recognized to obtain corresponding image features;

performing self-attention calculation processing based on the image features to obtain corresponding feature coding vectors; performing character position enhancement processing based on the image characteristics to obtain corresponding character position coding vectors;

splicing the feature coding vector and the character position coding vector, and extracting semantic features of the spliced coding vector to obtain a semantic feature vector;

and decoding the semantic feature vector to obtain a corresponding text character.

According to a second aspect of embodiments of the present application, there is provided a text recognition apparatus, the apparatus including:

the image characteristic obtaining module is used for extracting the characteristics of the text image to be recognized to obtain the corresponding image characteristics;

a coding vector obtaining module, configured to perform self-attention calculation processing based on the image features to obtain corresponding feature coding vectors; performing character position enhancement processing based on the image characteristics to obtain corresponding character position coding vectors;

the semantic feature vector obtaining module is used for splicing the feature coding vector and the character position coding vector and extracting semantic features of the spliced coding vector to obtain a semantic feature vector;

and the text character obtaining module is used for decoding the semantic feature vector to obtain a corresponding text character.

According to a third aspect of embodiments herein, there is provided an electronic apparatus, the apparatus comprising: one or more processors; a computer readable medium configured to store one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the text recognition method according to the first aspect.

According to a fourth aspect of embodiments of the present application, there is provided a computer-readable medium, on which a computer program is stored, which when executed by a processor, implements the text recognition method according to the first aspect.

According to the scheme provided by the embodiment of the application, when the text image is identified, on one hand, the extracted image features are further extracted by self-attention calculation processing, so that the obtained feature coding vectors can more accurately and more specifically represent the features of the text image; on the other hand, by the character position enhancement processing, character position information can be added based on the extracted image features to accurately express the context relationship between individual text characters in the text image. Furthermore, semantic feature extraction is performed based on the feature coding vector and the coding vector formed by splicing the character position coding vector, so that the extracted semantic feature vector is more accurate, and text features in the text image can be more effectively represented. Therefore, the finally decoded character is more accurate in character recognition accuracy and character bit sequence.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

fig. 1 is a flowchart illustrating steps of a text recognition method according to a first embodiment of the present application;

FIG. 2 is a flowchart illustrating steps of a text recognition method according to a third embodiment of the present application;

fig. 3 is a schematic structural diagram of a neural network model provided in an embodiment of the present application;

fig. 4 is a schematic diagram of a text image obtained after a text image to be recognized is subjected to size normalization;

FIG. 5 is a schematic view of a mask image;

fig. 6 is a schematic view of the text image after the text image shown in fig. 4 is pasted to the center position of the mask image;

FIG. 7 is a schematic diagram of a data processing flow of a self-attention portion of a neural network model provided in an embodiment of the present application;

FIG. 8 is a comparison graph of the effect of text recognition;

fig. 9 is a schematic diagram of a text recognition process according to a third embodiment of the present application;

fig. 10 is a schematic structural diagram of a text recognition apparatus according to a fourth embodiment of the present application;

fig. 11 is a schematic structural diagram of an electronic device in a fifth embodiment of the present application;

fig. 12 is a hardware structure of an electronic device according to a sixth embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of a related application and are not limiting of that application. It should be noted that, for the convenience of description, only the parts related to the related applications are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Example one

Referring to fig. 1, a flowchart illustrating steps of a text recognition method according to a first embodiment of the present application is shown.

The text recognition method of the embodiment comprises the following steps:

step 101, performing feature extraction on a text image to be recognized to obtain corresponding image features.

The text recognition method in the embodiment of the application can be applied to recognition of various texts, for example, can be used for recognition of a text image only containing a print text; it can also be used for recognition of text images containing only handwritten text; or for recognition of text images containing both print and handwritten text; in addition, the text recognition method in the embodiment of the application can also be used for recognizing long texts containing a large number of texts. And is particularly applicable to text recognition of text images containing handwritten text. Meanwhile, in the embodiment of the present application, the text image to be recognized may also be a blurred image or a distorted image, or a text image with a variable font size.

The image features can effectively represent the image information of the text image, and the feature extraction of the text image can be implemented in a suitable manner, for example, the feature extraction of the text image can be implemented by an image algorithm, or the feature extraction of the text image can also be implemented by a neural network model, such as a CNN convolutional neural network model, and so on. The image features obtained in this step may include one or more of the following features: color features, texture features, shape features and spatial relationship features of the image, grayscale features, and the like. The color feature is a global feature and is used for describing the surface property of a scene corresponding to an image or an image area; the texture feature is also a global feature, and is also used for describing the surface property of a scene corresponding to an image or an image area; the shape features are represented by two types, one is outline features, the other is region features, the outline features of the image mainly aim at the outer boundary of the object, and the region features of the image are related to the whole shape region; the spatial relationship characteristic refers to the mutual spatial position or relative direction relationship among a plurality of targets segmented from the image, and these relationships can be also divided into a connection/adjacency relationship, an overlapping/overlapping relationship, an inclusion/containment relationship, and the like.

102, performing self-attention calculation processing based on image features to obtain corresponding feature coding vectors; and character position enhancement processing is carried out based on the image characteristics, and a corresponding character position encoding vector is obtained.

Through the self-attention calculation processing, more expressive features can be obtained based on the existing image features, so that the obtained feature coding vectors can more accurately and more specifically represent the features of the text image.

In the embodiment of the application, the character position enhancement processing is to add character position information on the basis of image characteristics, so that the processed characteristic coding vector can not only represent the image characteristics of a text image, but also carry the position information of each text character.

And 103, splicing the feature coding vector and the character position coding vector, and extracting semantic features of the spliced coding vector to obtain a semantic feature vector.

Because the feature code vector obtained in step 102 can more accurately and more specifically represent the features of the text image, and meanwhile, the character position code vector can not only represent the image features of the text image, but also carry the position information of each text character, the feature code vector and the character position code vector are spliced, and the obtained spliced code vector not only contains the character position information, but also can more accurately and more specifically represent the features of the text image. Therefore, semantic feature extraction is performed based on the spliced encoding vector, so that the extracted semantic feature vector is more accurate, and text features in the text image can be represented more effectively.

The semantic feature extraction may be implemented in a suitable manner, for example, by a neural network model (e.g., RNN model, etc.) capable of performing semantic feature extraction, or by a part (e.g., RNN part of model) of the neural network model capable of performing semantic feature extraction.

And 104, decoding the semantic feature vector to obtain a corresponding text character.

For example, the semantic feature vectors may be decoded using CTC (connected Temporal Classifier) to obtain corresponding text characters.

The CTC may implement alignment of input features and output tags, i.e., based on semantic feature vectors, a final text character may be obtained.

In the embodiment of the application, when text recognition is performed on a text image, on one hand, the extracted image features are further extracted by self-attention calculation processing, so that the obtained feature coding vectors can more accurately and more specifically represent the features of the text image; on the other hand, by the character position enhancement processing, character position information can be added based on the extracted image features to accurately express the context relationship between individual text characters in the text image. Furthermore, semantic feature extraction is performed based on the feature coding vector and the coding vector formed by splicing the character position coding vector, so that the extracted semantic feature vector is more accurate, and text features in the text image can be more effectively represented. Therefore, the finally decoded character is more accurate in character recognition accuracy and character bit sequence.

The text recognition method of the present embodiment may be performed by any suitable electronic device having data processing capabilities, including but not limited to: servers, PCs, even high performance mobile terminals, etc.

Example II,

Optionally, in an embodiment of the present application, performing self-attention calculation processing based on image features in step 102 to obtain corresponding feature encoding vectors may include:

carrying out full-connection feature extraction processing on the image features to obtain < Q, K, V > triple vectors; and performing self-attention calculation processing based on the triple vectors to obtain corresponding feature encoding vectors.

Since the self-attention calculation processing is mainly implemented based on the < Q, K, V > triplet vector, and the result obtained in step 101 is an image feature, the image feature may be subjected to full-connection feature extraction processing (usually, full-connection operation may be performed on the image feature) first to obtain the < Q, K, V > triplet vector, so that the self-attention calculation processing is performed based on the < Q, K, V > triplet vector subsequently to obtain a corresponding feature encoding vector.

Optionally, in an embodiment of the present application, the performing self-attention calculation processing based on the triplet vector to obtain a corresponding feature encoding vector may include: performing matrix multiplication on the Q vector and the K vector in the triple vector to obtain a first operation result; scaling the first operation result by using a scaling factor, and obtaining a self-attention feature vector according to the processing result; and performing matrix multiplication operation on the self-attention feature vector and the V vector to obtain a feature coding vector.

In the above steps, the scaling processing is performed on the first operation result, so that a smaller data can be obtained, and subsequently, based on the smaller data, the operation amount can be reduced, and the model operation speed can be increased. The scaling factor may be selected according to the size of the image feature and the actual computing power of the device that executes the text recognition method provided in the embodiment of the present application, and here, the value of the scaling factor is not limited.

Based on the scheme of the first embodiment, optionally, in an embodiment of the present application, performing character position enhancement processing based on image features in step 102 to obtain a corresponding character position encoding vector may include:

acquiring position characteristics among text characters in a text image; and obtaining a character position encoding vector based on the position characteristic and the image characteristic.

In one possible approach, after the image features are obtained in step 101, a preset position coding formula is used to generate position features consistent with the image feature sizes as position features between text characters in the text image based on the sizes of the image features.

For example, the preset position encoding formula for obtaining the position feature may be:

PE _{（pos，2i）}=sin（pos/10000 ^2i/dclasses ）

PE _{（pos，2i+1）}=cos（pos/10000 ^2i/dclasses ）

wherein,PEfor the location features between individual text characters in the text image,posindicating the position of each text character in the input feature map (feature map form of the image feature of step 101);dclassesa dimension representing a feature of the image; 2iA dimensional feature representing an even number bit of the PE; 2i+1Representing the dimension characteristic of the odd-numbered bits of the PE.

After the position features among the text characters in the text image are obtained in the above manner, the position features and the corresponding elements in the image features may be fused, such as added, to obtain a character position encoding vector.

In the second embodiment of the present application, when text recognition is performed on a text image, on one hand, the extracted image features are further extracted by self-attention calculation processing, so that the obtained feature coding vectors can more accurately and more specifically represent the features of the text image; on the other hand, by the character position enhancement processing, character position information can be added based on the extracted image features to accurately express the context relationship between individual text characters in the text image. Furthermore, semantic feature extraction is performed based on the feature coding vector and the coding vector formed by splicing the character position coding vector, so that the extracted semantic feature vector is more accurate, and text features in the text image can be more effectively represented. Therefore, the finally decoded character is more accurate in character recognition accuracy and character bit sequence.

Example III,

Referring to fig. 2, a flowchart of steps of a text recognition method according to a third embodiment of the present application is shown.

In this embodiment, the text recognition method is performed based on a preset neural network model.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a neural network model provided in an embodiment of the present application, where the neural network model may include: an image feature extraction section; a self-attention part and a position encoding part connected in parallel behind the image feature extraction part; a splice section connected to the self-attention section and the position-coding section; and the semantic feature extraction part is connected with the splicing part.

Wherein:

and the image feature extraction part is used for extracting features of the text image to be recognized and outputting corresponding image features. Alternatively, the image feature extraction section may be realized by CNN.

And the self-attention part is used for performing self-attention calculation processing based on the image features and outputting corresponding feature coding vectors. Alternatively, the SELF-ATTENTION portion may be implemented by a SELF-ATTENTION (SELF-ATTENTION) layer. Further alternatively, it may be implemented by a multi-headed self-attentive layer.

And the position coding part is used for performing character position enhancement processing based on the image characteristics and outputting a corresponding character position coding vector. Alternatively, the position encoding part may also be implemented by CNN.

And the splicing part is used for splicing the characteristic coding vector and the character position coding vector and outputting the spliced coding vector. Alternatively, the spliced portion may be implemented using a spliced layer that can splice vectors.

And the semantic feature extraction part is used for extracting semantic features of the spliced coding vectors and outputting the semantic feature vectors. Alternatively, the semantic feature extraction part may be implemented by RNN.

Based on this, the text recognition method of the present embodiment includes the following steps:

step 201, inputting a text image to be recognized into a preset neural network model, and performing feature extraction on the text image to be recognized through an image feature extraction part in the neural network model to obtain image features corresponding to the text image to be recognized.

The text image to be recognized is a text image containing a handwritten text. Alternatively, a text image in which the text is entirely handwritten text may be used.

In the embodiment of the present application, the method for acquiring the text image to be recognized is not limited, for example: the method comprises the steps that a text image to be recognized can be obtained in a mode of photographing by a camera of the mobile equipment; the text image to be recognized and the like can also be acquired by means of scanning.

Optionally, in some embodiments, before step 201, the text image to be recognized may be subjected to normalization preprocessing, so that the image feature extraction portion can better extract the image features.

For example:

first, a text image to be recognized may be subjected to size normalization processing, so that the text images to be recognized have the same height size, see fig. 4, where fig. 4 is a schematic diagram of a text image obtained after the text image to be recognized is subjected to size normalization, and the text image has a normalized height value. Specifically, the method comprises the following steps: the Image normalization height Nh may be set first, then the Image scaling Ratio = Nh/Image _ h is calculated according to the actual heights Image _ h and Nh of the text Image, and then the text Image is scaled in an equal proportion based on the Ratio to obtain a scaled text Image. When the neural network model is trained, in order to ensure the training effect and improve the training efficiency, the training samples with the same size are usually adopted for model training, and in the practical application process, the size of the text image to be recognized is not fixed, so that the text image to be recognized can be subjected to size normalization processing firstly to obtain a better recognition result, and the text image to be recognized has the same height size.

Secondly, a mask image with a larger preset size can be initialized (the pixel values of all pixel points in the mask image are the same), and then the zoomed mask image is pasted to the central position of the mask image, so that a final text image to be recognized which can be input into the neural network model is obtained. Referring to fig. 5 and 6, wherein fig. 5 is a schematic view of a mask image; fig. 6 is a schematic view of the text image after the text image shown in fig. 4 is pasted to the center position of the mask image.

The size normalization processing of the text image to be recognized can ensure that the height of the text image is consistent, but may not enable the width of the text image to be consistent, that is, after the size normalization processing of the text image to be recognized, the text image input to the neural network model may not be ensured to have the same size. Therefore, a mask image with a preset size can be initialized, and the zoomed text image is pasted to the central position of the mask image to form a new text image. Since the new text image has the same size as the mask image, the size of the text image input to the neural network model can be ensured to be consistent by using the new image formed after pasting as the input image of the neural network model.

In addition, because the mask image is large in size, after the zoomed text image is pasted to the central position of the mask image to form a new text image, the edge information of the left side and the right side of the zoomed text image can be enabled to be not in the edge position of the new text image, so that the problem that the edge information of the left side and the right side of the zoomed text image is lost in the text recognition process can be effectively avoided, and the text recognition efficiency is improved.

And step 202, performing self-attention calculation processing on the basis of image features through a self-attention part in the neural network model to obtain corresponding feature coding vectors.

Through the self-attention calculation processing, the weight of the feature data which can express the text image in the image features can be increased, so that the obtained feature coding vector can express the features of the text image more accurately and more specifically.

Referring to fig. 7, fig. 7 is a schematic diagram of a data processing flow of a self-attention part in a neural network model provided in the embodiment of the present application, where the processing procedure is as follows:

firstly, obtaining image features extracted by an image feature extraction part in a neural network model, and carrying out full connection feature extraction processing on the image features by a self-attention part in the neural network model to obtain < Q, K, V > triple vectors.

Specifically, the method comprises the following steps: the image features obtained by the image feature extraction part are respectively input into three single-layer neural network layers (full connection layers) WQ, WK and WV from the attention part, so that the WQ, WK and WV respectively carry out full connection feature extraction processing (linear transformation) on the image features and respectively output a Q vector, a K vector and a V vector, namely a < Q, K, V > triple vector is obtained;

and secondly, performing self-attention calculation processing on the self-attention part in the neural network model based on the triple vector to obtain a corresponding feature coding vector.

Specifically, the method comprises the following steps: performing matrix multiplication operation on the Q vector and the K vector to obtain atten (namely a first operation result); in order to reduce the data operand and improve the convergence speed of the model, scaling treatment can be carried out on atten to obtain atten _ s; performing softmax operation on atten _ s to obtain an attention map (namely a self-attention feature vector); and carrying out matrix multiplication operation on the attribute map and the V vector to obtain attn _ map (namely the feature code vector).

And step 203, carrying out character position enhancement processing on the basis of image features through a position coding part in the neural network model to obtain a corresponding character position coding vector.

The character position enhancement processing is to increase character position information on the basis of image characteristics, so that the processed characteristic coding vector can not only represent the image characteristics of a text image, but also carry the position information of each text character.

Optionally, in one embodiment, the position features between the text characters in the text image can be obtained by a position coding part in the neural network model; and obtaining a character position encoding vector based on the position characteristic and the image characteristic.

In one possible approach, after the image feature is obtained in step 201, a preset position coding formula is used to generate a position feature that is consistent with the size of the image feature as a position feature between text characters in the text image based on the size of the image feature.

For example, the preset position encoding formula may be:

PE _{（pos，2i）}=sin（pos/10000 ^2i/dclasses ）

PE _{（pos，2i+1）}=cos（pos/10000 ^2i/dclasses ）

wherein,PEfor the location features between individual text characters in the text image,posindicating the position of each text character in the input feature map (feature map form of image feature);dclassesa dimension representing a feature of the image; 2iA dimensional feature representing an even number bit of the PE; 2i+1Representing the dimension characteristic of the odd-numbered bits of the PE.

After the position features among the text characters in the text image are obtained in the above manner, the position features and the corresponding elements in the image features can be added to obtain a character position encoding vector.

It should be noted that, in this embodiment, the step 202 is executed first, and then the step 203 is executed as an example, but it should be understood by those skilled in the art that, in practical application, the execution sequence of the

steps

202 and 203 may be executed in any order, or may be executed in parallel.

And 204, splicing the characteristic coding vector and the character position coding vector through a splicing part in the neural network model to obtain a spliced coding vector.

Because the feature code vector obtained in step 202 can more accurately and more specifically represent the features of the text image, and meanwhile, the character position code vector obtained in step 203 can not only represent the image features of the text image, but also carry the position information of each text character, the feature code vector and the character position code vector are spliced, and the obtained spliced code vector not only contains the character position information, but also can more accurately and more specifically represent the features of the text image.

In this embodiment, the splicing of the two vectors is realized in a concat manner.

And step 205, performing semantic feature extraction on the spliced coding vectors through a semantic feature extraction part in the neural network model to obtain semantic feature vectors.

The spliced encoding vector obtained in step 204 not only contains character position information, but also can more accurately and more specifically represent the characteristics of the text image. Therefore, semantic feature extraction is performed based on the spliced encoding vector, so that the extracted semantic feature vector is more accurate, and text features in the text image can be represented more effectively.

Optionally, in some embodiments, the neural network model may be trained before step 201. Specifically, the neural network model may be trained using a first training sample set and a second training sample set, where the first training sample set may be used to pre-train the neural network model, and the second training sample set may be used to fine-tune the neural network model to complete training of the neural network model.

In the embodiment of the present application, the types of texts in the training samples included in the first training sample set and the second training sample set may be set by those skilled in the art according to actual requirements. In this embodiment, the first training sample set uses a printed text training set, and the second training sample set uses a handwritten text training set.

Optionally, in some embodiments, training the neural network model using the first training sample set and the second training sample set may include:

training a first neural network model by using a first training sample set to obtain a pre-training model, wherein training samples in the first training sample set are printed text image samples; obtaining model parameters of an image feature extraction part in a pre-training model after training; initializing model parameters of the image feature extraction part in the second neural network model using the model parameters of the image feature extraction part; and training the second neural network model after the model parameter initialization operation of the image feature extraction part is carried out by using a second training sample set, wherein the training samples in the second training sample set are handwritten text image samples.

Specifically, the first neural network model and the second neural network model have the same network structure, and both include: an image feature extraction section; a self-attention part and a position encoding part connected in parallel behind the image feature extraction part; a splice section connected to the self-attention section and the position-coding section; and the semantic feature extraction part is connected with the splicing part. In addition, the initial model parameters of the first neural network model and the second neural network model can be completely the same; or may be completely different; the components may be partially the same or different, and are not limited in the embodiments of the present application.

In the training process, the neural network model is pre-trained by printing the text image sample, then the model parameters of the image feature extraction part in the pre-trained model are used as the initial model parameters of the image feature extraction part in the second neural network model, and the training of the second neural network model is carried out based on the handwritten text image sample. Therefore, when the handwritten text image sample is adopted to train the second neural network model, on the basis of training other parts in the second neural network model, only model parameters of the image feature extraction part in the second neural network model need to be adjusted slightly, and the training of the whole model can be completed. Therefore, the training process can effectively reduce the calculated amount of model training, improve the convergence rate and improve the model training efficiency.

In addition, because the training samples in the first training sample set are printed text image samples, the normalization is better, and therefore, in some optional embodiments of the present application, the printed text image samples of the labels can be automatically generated by an open-source text labeling tool. Compared with manual labeling, the labeling process of the training samples can be automatically completed, so that the neural network model is trained through the automatically labeled first training sample set to obtain the process of pre-training the model, the labor cost can be further saved, and the training efficiency is improved.

Step 206, decoding the semantic feature vector to obtain the corresponding text character.

For example, the semantic feature vectors may be decoded using a CTC algorithm to obtain corresponding text characters. The specific decoding process may be: starting from a first frame (column) in a semantic feature vector (probability distribution matrix), acquiring a category corresponding to the maximum probability value in each frame aiming at each frame, and combining the categories into a character string; finally, aiming at the character strings, a greedy method, a bundle search method or a prefix bundle search method is adopted to obtain an optimal text recognition result, namely, the corresponding text characters are obtained.

The text recognition method provided by the embodiment of the application and the existing text recognition method are respectively tested by using the collected handwritten Chinese long text data set, and then the following results are found: by adopting the text recognition method provided by the embodiment of the application, the accuracy rate of text recognition can reach 95%, and the accuracy rate of the existing text recognition method is 84%. In addition, the text recognition method provided by the embodiment of the application has higher recognition efficiency on long text sequences, strong semantic sequences, fuzzy fonts and distorted texts.

Referring to fig. 8, fig. 8 is a comparison graph of the effect of text recognition, specifically: the comparison chart of the text recognition effect by respectively adopting the text recognition method (abbreviated as the present application in the table) and the existing text recognition method (abbreviated as the existing recognition method in the table) provided by the embodiment of the application is provided. For the first picture in fig. 8, the recognition result of the present application is completely correct, but the existing recognition method has a plurality of errors, which are: misidentifying "limit" as a "leg"; omission of the fine word; misidentifying "world" as "also"; for the second picture in the table above, although both methods fail to obtain a completely correct recognition result, compared with the existing recognition method, the accuracy of the method is higher; for the third picture in the above table, the recognition result of the present application is completely correct, while the existing recognition method recognizes the bath as "huge" by mistake.

In summary, compared with the existing text recognition method, the text recognition method provided by the embodiment of the application has obviously higher recognition accuracy.

The text recognition method provided by the embodiment of the present application may be executed by any suitable device with data processing capability, including but not limited to: a terminal, a mobile terminal, a PC, a server and the like.

Referring to fig. 9, fig. 9 is a schematic diagram of a text recognition process according to a third embodiment of the present application; the following briefly describes, with reference to fig. 9, a text recognition process provided in the third embodiment of the present application, which mainly includes:

the first step is as follows: and collecting a text image and carrying out size normalization. Specifically, the texts in the text image may be all handwritten texts; the image is scaled based on the scaling ratio to obtain an image with a normalized height value.

And secondly, performing Mask preprocessing on the image. Specifically, the method comprises the following steps: and pasting the normalized text image to a mask image with a larger preset size, so as to ensure the consistency of the size of the text image input to the neural network model.

Third, CNN is feature encoded in combination with the self-attention mechanism. Namely: an image feature extraction part (for example, CNN) in the neural network model performs feature extraction on an input text image; and the self-attention part connected behind the image feature extraction part carries out self-attention calculation processing to obtain a feature coding vector.

Specifically, the self-attention part after the image feature extraction part performs self-attention calculation processing, and the specific process of obtaining the feature coding vector may include: 1. obtaining image features extracted by an image feature extraction part in the neural network model; 2. the image features obtained by the image feature extraction section are input to three single-layer neural network layers (full-link layers) W, respectively_Q、W_KAnd W_V，So that W is_Q、W_KAnd W_VRespectively carrying out full-connection feature extraction processing (linear transformation) on the image features, and respectively outputting a Q vector, a K vector and a V vector; 3. performing matrix multiplication operation on the Q vector and the K vector to obtain a first atten (namely a first operation result), and performing scaling processing on atten to obtain atten _ s; 4. performing softmax operation on atten _ s to obtain attention map (i.e. self-attention)A feature vector); and carrying out matrix multiplication operation on the attribute map and the V vector to obtain attn _ map (namely the feature code vector).

For example, a feature map obtained by the last convolution layer of the CNN part (image feature extraction part) is obtained, the feature map has a height H =1, and the channels = 256. And passing the feature maps through Wq, Wk and Wv respectively to obtain features Q, K and V, wherein Wq, Wk and Wv are three full connection layers with in _ features =256 and out _ features = 256. And performing matrix multiplication on Q and K to obtain a characteristic atten, and dividing the characteristic atten by a scaling factor Scale =8, so that the inner product obtained after the matrix multiplication is not too large, and the convergence of a model training stage and the quick operation of a testing stage are facilitated. And performing softmax operation on the feature atten to obtain an attention feature map, and performing matrix multiplication operation on the attention feature map and V to obtain a feature map attn _ map (namely a feature coding vector) with enhanced inter-character semantic information. Fourthly, enhancing character position coding. Specifically, the method comprises the following steps: constructing a position feature which is consistent with the image feature size based on the size of the image feature as the position feature between each text character in the text image; and obtaining a character position encoding vector based on the position characteristic and the image characteristic.

In the text recognition task, the outline feature and the semantic feature of the text image are important, and the position information of the characters is also important. Extracting outline characteristics of the text through the CNN in the steps, extracting semantic information among characters through self attribute, introducing Position Embedding for better representing the Position characteristics of the characters in the text sequence, and adding the Position characteristics in a form of adding Position codes. The method of adding the position code may be: and constructing a matrix which is consistent with the attn _ map dimension obtained by self attribute, then performing concat operation with the attn _ map, and taking the output as the input of the RNN decoder, wherein the position coding formula can adopt the position coding formula in the embodiment to obtain the PE.

It should be noted that, in the specific flow, the third step is executed first, and then the fourth step is executed as an example, but it should be understood by those skilled in the art that, in practical application, the execution order of the third step and the fourth step may be executed in any order, or may be executed in parallel.

And fifthly, splicing the coding vectors and extracting semantic features. Specifically, the method comprises the following steps: after the characteristic coding vector is obtained in the third step and the character position coding vector is obtained in the fourth step, the splicing part of the neural network model splices the characteristic coding vector and the character position coding vector to obtain a spliced coding vector; and the semantic feature extraction part of the neural network model extracts semantic features of the spliced coding vectors to obtain semantic feature vectors.

And sixthly, fine adjustment is carried out on the improved neural network model.

Specifically, the method comprises the following steps: training a first neural network model by using a first training sample set to obtain a pre-training model, wherein training samples in the first training sample set are printed text image samples; obtaining model parameters of an image feature extraction part in a pre-training model after training; initializing model parameters of the image feature extraction part in the second neural network model using the model parameters of the image feature extraction part; and training the second neural network model after the model parameter initialization operation of the image feature extraction part is carried out by using a second training sample set, wherein the training samples in the second training sample set are handwritten text image samples.

For example, a neural network model fusing self attribute and position embedding (e.g., a modified CRNN model fusing self attribute and position embedding) is trained using open source printing chinese text data to obtain a pre _ model. And (3) importing parameters of a CNN part before a self attribute part in a feature encoder in the pre _ model into an improved CRNN model to be trained, training by using prepared handwritten Chinese text data, and waiting for the completion of the training. It should be noted that this step is a step of the training phase, and it should be understood by those skilled in the art that this step need not be performed in the testing phase.

And seventhly, decoding the CTC to obtain the optimal recognition result of the text sequence. The semantic feature vectors may be decoded using a CTC algorithm to obtain corresponding text characters. The specific decoding process may be: starting from a first frame (column) in a semantic feature vector (probability distribution matrix), acquiring a category corresponding to the maximum probability value in each frame aiming at each frame, and combining the categories into a character string; finally, aiming at the character strings, a greedy method, a bundle search method or a prefix bundle search method is adopted to obtain an optimal text recognition result, namely, the corresponding text characters are obtained.

For example, in a testing stage, a handwritten text image subjected to Mask preprocessing is sent to an improved CRNN model, a probability matrix Pro is obtained through CTC decoding, the size of the probability matrix Pro is calculated, the height of the probability matrix Pro obtained through CTC decoding is equal to the blank category plus the number N of character categories to be identified, so Pro _ h = N +1, and the width of the probability matrix is equal to the width of a feature map obtained through convolution of the handwritten text image through the CRNN model. Assuming that the handwritten text image has a width of 896, the width of the probability matrix obtained by CTC decoding is T = 227, i.e., Pro _ w = 227.

Performing argmax () operation on the probability matrix, which comprises the following specific steps: taking the text in the handwritten text image as "low-head thinking and deceiving village" as an example, from the first image frame, the category corresponding to the maximum probability value of the image frame is taken, the category corresponding to the maximum probability value of the first image frame is assumed to be "low", the category corresponding to the maximum probability value of the second image frame is "head", and so on, and the category corresponding to the maximum probability value of the last image frame is "-", so that the character string result is "low-head-," thinking-village- "(the character string length is the probability matrix width T), and the optimal recognition result is" low-head thinking and deceiving village "through B transformation in the CTC algorithm.

Example four,

Referring to fig. 10, fig. 10 is a schematic structural diagram of a text recognition apparatus in the fourth embodiment of the present application.

The text recognition device provided by the embodiment of the application comprises:

an image feature obtaining module 1001, configured to perform feature extraction on a text image to be recognized, and obtain corresponding image features;

a code vector obtaining module 1002, configured to perform self-attention calculation processing based on image features to obtain corresponding feature code vectors; performing character position enhancement processing based on the image characteristics to obtain corresponding character position coding vectors;

a semantic feature vector obtaining module 1003, configured to splice the feature coding vector and the character position coding vector, and perform semantic feature extraction on the spliced coding vector to obtain a semantic feature vector;

a text character obtaining module 1004, configured to decode the semantic feature vector to obtain a corresponding text character.

Optionally, in an embodiment of the present application, when the code vector obtaining module 1002 performs a self-attention calculation process based on image features to obtain a corresponding feature code vector, the method specifically includes: a triple vector obtaining unit and a feature coding vector obtaining unit;

the triplet vector obtaining unit is used for carrying out full-connection feature extraction processing on the image features to obtain < Q, K, V > triplet vectors;

and the feature coding vector obtaining unit is used for performing self-attention calculation processing on the basis of the triple vector to obtain a corresponding feature coding vector.

Optionally, in an embodiment of the present application, the feature encoding vector obtaining unit is specifically configured to:

performing matrix multiplication on the Q vector and the K vector in the triple vector to obtain a first operation result;

scaling the first operation result by using a scaling factor, and obtaining a self-attention feature vector according to the processing result;

and performing matrix multiplication operation on the self-attention feature vector and the V vector to obtain a feature coding vector.

Optionally, in an embodiment of the present application, when the step of performing character position enhancement processing based on image features to obtain a corresponding character position encoding vector is executed by the encoding vector obtaining module 902, specifically, the encoding vector obtaining module is configured to:

acquiring position characteristics among text characters in a text image;

and obtaining a character position encoding vector based on the position characteristic and the image characteristic.

Optionally, in an embodiment of the present application, the method is performed based on a preset neural network model;

the neural network model comprises an image feature extraction part, a self-attention part and a position coding part which are connected behind the image feature extraction part in parallel, a splicing part connected with the self-attention part and the position coding part, and a semantic feature extraction part connected with the splicing part;

wherein:

the image feature extraction part is used for extracting features of the text image to be recognized and outputting corresponding image features;

the self-attention part is used for carrying out self-attention calculation processing based on image features and outputting corresponding feature coding vectors;

a position encoding part for performing character position enhancement processing based on the image characteristics and outputting a corresponding character position encoding vector;

the splicing part is used for splicing the characteristic coding vector and the character position coding vector and outputting the spliced coding vector;

and the semantic feature extraction part is used for extracting semantic features of the spliced coding vectors and outputting the semantic feature vectors.

Optionally, in an embodiment of the present application, the apparatus further includes: a model training module;

a model training module to:

obtaining model parameters of an image feature extraction part in a pre-training model after training;

initializing model parameters of the image feature extraction part in the second neural network model using the model parameters of the image feature extraction part;

and training the second neural network model after the model parameter initialization operation of the image feature extraction part is carried out by using a second training sample set, wherein the training samples in the second training sample set are handwritten text image samples.

The text recognition apparatus of this embodiment is used to implement the corresponding text recognition method in the foregoing multiple method embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein again. In addition, the functional implementation of each module in the text recognition apparatus of this embodiment can refer to the description of the corresponding part in the foregoing method embodiment, and is not repeated here.

Example V,

Referring to fig. 11, fig. 11 is a schematic structural diagram of an electronic device in the fifth embodiment of the present application; the electronic device may include:

one or more processors 1101;

a computer-readable medium 1102, which may be configured to store one or more programs,

when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the text recognition method as in the above embodiments one to three.

Example six,

Referring to fig. 12, fig. 12 is a hardware structure of an electronic device according to a sixth embodiment of the present application; as shown in fig. 12, the hardware structure of the electronic device may include: a processor 1201, a communication interface 1202, a computer readable medium 1203, and a communication bus 1204;

wherein the processor 1201, the communication interface 1202, and the computer readable medium 1203 are in communication with each other via a communication bus 1204;

alternatively, the communication interface 1202 may be an interface of a communication module, such as an interface of a GSM module;

the processor 1201 may be specifically configured to: performing feature extraction on a text image to be recognized to obtain corresponding image features; performing self-attention calculation processing based on image features to obtain corresponding feature coding vectors; performing character position enhancement processing based on the image characteristics to obtain corresponding character position coding vectors; splicing the feature coding vector and the character position coding vector, and extracting semantic features of the spliced coding vector to obtain a semantic feature vector; and decoding the semantic feature vector to obtain a corresponding text character.

The Processor 1201 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The computer-readable medium 1203 may be, but is not limited to, a Random Access Memory (RAM), a Read-Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like.

In particular, according to an embodiment of the present application, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer-readable medium, the computer program comprising program code configured to perform the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section, and/or installed from a removable medium. The computer program, when executed by a Central Processing Unit (CPU), performs the above-described functions defined in the method of the present application. It should be noted that the computer readable medium of the present application can be a computer readable signal medium or a computer readable storage medium or any combination of the two. The computer readable medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access storage media (RAM), a read-only storage media (ROM), an erasable programmable read-only storage media (EPROM or flash memory), an optical fiber, a portable compact disc read-only storage media (CD-ROM), an optical storage media piece, a magnetic storage media piece, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code configured to carry out operations for the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may operate over any of a variety of networks: including a Local Area Network (LAN) or a Wide Area Network (WAN) -to the user's computer, or alternatively, to an external computer (e.g., through the internet using an internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions configured to implement the specified logical function(s). In the above embodiments, specific precedence relationships are provided, but these precedence relationships are only exemplary, and in particular implementations, the steps may be fewer, more, or the execution order may be modified. That is, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present application may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes an image feature obtaining module, a coding vector obtaining module, a semantic feature vector obtaining module, and a text character obtaining module. The names of the modules do not limit the modules themselves in some cases, for example, the image feature obtaining module may also be described as a module that performs feature extraction on a text image to be recognized to obtain corresponding image features.

As another aspect, the present application also provides a computer-readable medium on which a computer program is stored, which when executed by a processor, implements the text recognition method as described in the above embodiment one or three.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: performing feature extraction on a text image to be recognized to obtain corresponding image features; performing self-attention calculation processing based on image features to obtain corresponding feature coding vectors; performing character position enhancement processing based on the image characteristics to obtain corresponding character position coding vectors; splicing the feature coding vector and the character position coding vector, and extracting semantic features of the spliced coding vector to obtain a semantic feature vector; and decoding the semantic feature vector to obtain a corresponding text character.

The expressions "first", "second", "first" or "second" used in various embodiments of the present disclosure may modify various components regardless of order and/or importance, but these expressions do not limit the respective components. The above description is only configured for the purpose of distinguishing elements from other elements. For example, the first user equipment and the second user equipment represent different user equipment, although both are user equipment. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of the present disclosure.

When an element (e.g., a first element) is referred to as being "operably or communicatively coupled" or "connected" (operably or communicatively) to "another element (e.g., a second element) or" connected "to another element (e.g., a second element), it is understood that the element is directly connected to the other element or the element is indirectly connected to the other element via yet another element (e.g., a third element). In contrast, it is understood that when an element (e.g., a first element) is referred to as being "directly connected" or "directly coupled" to another element (a second element), no element (e.g., a third element) is interposed therebetween.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A text recognition method, comprising:

2. The method according to claim 1, wherein the performing a self-attention calculation process based on the image features to obtain corresponding feature encoding vectors comprises:

carrying out full-connection feature extraction processing on the image features to obtain < Q, K, V > triple vectors;

and performing self-attention calculation processing based on the triple vector to obtain a corresponding feature coding vector.

3. The method of claim 2, wherein the performing a self-attention computing process based on the triplet vectors to obtain corresponding feature-encoded vectors comprises:

scaling the first operation result by using a scaling factor, and obtaining a self-attention feature vector according to a processing result;

and carrying out matrix multiplication operation on the self-attention feature vector and the V vector to obtain the feature coding vector.

4. The method according to claim 1, wherein said performing character position enhancement processing based on said image features to obtain corresponding character position encoding vectors comprises:

acquiring position characteristics among all text characters in the text image;

and obtaining the character position coding vector based on the position characteristic and the image characteristic.

5. The method according to any one of claims 1-4, wherein the method is performed based on a preset neural network model;

wherein:

the self-attention part is used for carrying out self-attention calculation processing based on the image characteristics and outputting corresponding characteristic coding vectors;

the position coding part is used for performing character position enhancement processing based on the image characteristics and outputting corresponding character position coding vectors;

the splicing part is used for splicing the characteristic coding vector and the character position coding vector and outputting a spliced coding vector;

6. The method of claim 5, further comprising:

before feature extraction is carried out on a text image to be recognized through the neural network model to obtain corresponding image features, the neural network model is trained through a first training sample set and a second training sample set.

7. The method of claim 6, wherein training the neural network model using the first set of training samples and the second set of training samples comprises:

training a first neural network model by using a first training sample set to obtain a pre-training model, wherein training samples in the first training sample set are printed text image samples;

obtaining model parameters of an image feature extraction part in the pre-training model after training;

initializing model parameters of an image feature extraction part in a second neural network model using the model parameters of the image feature extraction part;

and training a second neural network model after the model parameter initialization operation of the image feature extraction part is performed by using a second training sample set, wherein the training samples in the second training sample set are handwritten text image samples.

8. The method according to claim 1, wherein the text image to be recognized is a text image containing handwritten text.

9. An electronic device, characterized in that the device comprises:

one or more processors;

a computer readable medium configured to store one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the text recognition method of any of claims 1-8.

10. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the text recognition method according to any one of claims 1 to 8.