CN115273103A

CN115273103A - Text recognition method and device, electronic equipment and storage medium

Info

Publication number: CN115273103A
Application number: CN202210898184.2A
Authority: CN
Inventors: 秦勇
Original assignee: Shenzhen Xingtong Technology Co ltd
Current assignee: Shenzhen Xingtong Technology Co ltd
Priority date: 2022-07-28
Filing date: 2022-07-28
Publication date: 2022-11-01

Abstract

The disclosure provides a text recognition method, a text recognition device, electronic equipment and a storage medium, and belongs to the field of image processing. The method comprises the following steps: acquiring a text image to be recognized; the first text recognition unit based on a text recognition model processes the text image to be recognized to determine the correct probability of at least one written text in the text image to be recognized; when the target error text exists in the text image to be recognized based on the correct probability, determining the error category of the target error text based on a second text recognition unit of the text recognition model; and determining a text recognition result of the text image to be recognized based on the target error text and the error category thereof. With the adoption of the method and the device, the error category of the error text can be identified.

Description

Text recognition method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of image processing, and in particular, to a text recognition method and apparatus, an electronic device, and a storage medium.

Background

In an education scene or a scene of dictating words in homework correction, it is important to judge which word a student wrongly writes and indicate where he wrongly writes.

The existing text recognition method can be divided into single-line recognition and multi-line recognition according to the number of text lines in an input image, the method is based on characters and sequences according to a labeling mode, generally, the single-line and sequence-based method is the mainstream, a text recognition method paradigm of sequentially combining a correction part, a feature extraction part and a recognition decoding part is formed, most methods follow the paradigm, and specific improvement is carried out on various problems such as bent texts, fuzzy texts and the like.

However, as for chinese recognition, there are few methods for specifically recognizing wrong words, and most of them are classified into two types, i.e., whether a wrong word is recognized or not, but where a specific mistake is recognized.

Disclosure of Invention

In view of this, embodiments of the present disclosure provide a text recognition method, an apparatus, an electronic device, and a storage medium, so as to solve the problem that an error category of an erroneous word cannot be recognized.

According to an aspect of the present disclosure, there is provided a text recognition method, including:

acquiring a text image to be recognized;

the first text recognition unit based on a text recognition model processes the text image to be recognized to determine the correct probability of at least one written text in the text image to be recognized;

when the target error text exists in the text image to be recognized based on the correct probability, determining the error category of the target error text based on a second text recognition unit of the text recognition model;

and determining a text recognition result of the text image to be recognized based on the target error text and the error category thereof.

According to another aspect of the present disclosure, there is provided a text recognition apparatus, including:

the acquisition module is used for acquiring a text image to be recognized;

the processing module is used for processing the text image to be recognized based on a first text recognition unit of a text recognition model so as to determine the correct probability of at least one written text in the text image to be recognized; when the target error text exists in the text image to be recognized based on the correct probability, determining the error category of the target error text based on a second text recognition unit of the text recognition model;

and the determining module is used for determining a text recognition result of the text image to be recognized based on the target error text and the error category thereof.

According to another aspect of the present disclosure, there is provided an electronic device including:

a processor; and

a memory for storing a program, wherein the program is stored in the memory,

wherein the program includes instructions that, when executed by the processor, cause the processor to perform the text recognition method.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the text recognition method.

In the disclosure, a correct probability of each written text in the text image may be determined based on the first text recognition unit in the text recognition model, and when a target erroneous text exists in the written text, an error category of the target erroneous text may be determined based on the second text recognition unit in the text recognition model. That is to say, not only can judge whether the wrong word exists, but also can identify the wrong category of the wrong word, namely, identify the specific position of the mistake, and improve the accuracy of wrong word identification.

Drawings

Further details, features and advantages of the disclosure are disclosed in the following description of exemplary embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 illustrates a flow chart of a text recognition method provided in accordance with an exemplary embodiment of the present disclosure;

FIG. 2 illustrates an error category identification schematic provided in accordance with an exemplary embodiment of the present disclosure;

FIG. 3 illustrates a diagram of correct text recognition provided in accordance with an exemplary embodiment of the present disclosure;

FIG. 4 illustrates a recognition diagram of a text image provided in accordance with an exemplary embodiment of the present disclosure;

FIG. 5 illustrates a similarity comparison model schematic provided in accordance with an exemplary embodiment of the present disclosure;

FIG. 6 shows a flowchart of a method for training a text recognition model provided in accordance with an exemplary embodiment of the present disclosure;

FIG. 7 shows a schematic block diagram of a text recognition apparatus provided in accordance with an exemplary embodiment of the present disclosure;

FIG. 8 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description. It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a" or "an" in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will appreciate that references to "one or more" are intended to be exemplary and not limiting unless the context clearly indicates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

In order to clearly clarify the technical solutions provided by the present disclosure, a description is first made of the technical background related to the present disclosure.

At present, a common text recognition scheme mainly aims at a print form and a handwritten form, and the print form and the handwritten form comprise different languages, so research on a text recognition method mainly focuses on improving image quality, increasing semantic information, fully utilizing position information, recognizing multiple languages and the like, and basically does not aim at the recognition of wrongly-written characters. The main reasons include that firstly, wrongly written characters hardly exist in a printed form, secondly, wrongly written characters in handwritten form information are few, thirdly, a recognition model needs a dictionary to realize conversion from a probability position to characters, and one character may have various writing errors and is difficult to process. In general, the current recognition method can classify the wrong words into two categories (classify all the wrong words into one category), but cannot specifically indicate where a certain word is wrong. However, in an education scene or a scene of dictating words in homework batch correction, it is important to judge which word the student wrongly wrote and indicate where he wrongly happens.

For Chinese recognition, there are few specific recognition methods for wrong words, and because the dictionary is huge, each single word may have multiple possibilities of writing errors, for example, if the word "one" is written less, the word "prisoner" is obtained, the wrong word is not calculated, but the word "one" is written more, the word "husband" is obtained, the wrong word is obtained, or the word "one" is written into the word "wrong", the wrong word is also obtained, so if all the possible wrong writing methods are added into the dictionary, not only the number of classified categories is increased, but also data samples corresponding to various wrong writing methods are difficult to collect, which causes unbalanced sample distribution and poor recognition effect.

If a staged model is adopted, the difficulty in identifying wrong characters can be greatly reduced through a two-stage or multi-stage mode, and the method has certain feasibility, but the multi-stage model is complex in structure and difficult to realize, and even if the multi-stage model is made, the effect is not great, because a dictionary needs to be built in advance, once the model is fixed, a new wrong character is difficult to add, and the model is difficult to update easily.

In order to solve the technical problem, the disclosure provides a text recognition method, which can determine the error category of the wrong word based on the technical concept of image similarity contrast, can avoid the problem of unbalanced samples, and can solve the problem that a dictionary is fixed and difficult to update. The method may be performed by a terminal, server, and/or other processing-capable device. The method provided by the embodiment of the present disclosure may be completed by any one of the above devices, or may be completed by a plurality of devices together, which is not limited in the present disclosure.

The method will be described with reference to the flow chart of the text recognition method shown in fig. 1. The method comprises the following steps 101-103.

Step 101, a text image to be recognized is obtained.

In a possible implementation manner, when the text in the image needs to be recognized, a signal for triggering text recognition can be generated, and the image of the text to be recognized can be acquired. For example, the user may use the terminal to take an image and click on an option to recognize text, which in turn triggers a text recognition signal. For another example, the user may press the image displayed on the terminal for a long time, and click on the option after the terminal displays the option for recognizing the text, thereby triggering the signal for text recognition. The embodiment does not limit the specific scenario for triggering the text recognition signal.

In some application scenarios, the text image to be recognized may be an image including handwriting, i.e., a written text image, and since there may be pen errors, the correct text and/or the wrong text may be included in the image to be recognized.

And 102, processing the text image to be recognized based on a first text recognition unit of the text recognition model to determine the correct probability of at least one written text in the text image to be recognized.

Optionally, the method further includes processing the text image based on the first text recognition unit, and recognizing to obtain a second correct text corresponding to each written text.

In one possible implementation, the text recognition model may include a first text recognition unit and a second text recognition unit, the first text recognition unit and the second text recognition unit being in parallel. Before the text recognition model is used, it may be trained, and a specific training process will be described in another embodiment, which is not described in detail in this embodiment.

When the text image to be recognized is recognized, the text image to be recognized can be used as the input of the text recognition model, and the correct probability of each written text in the text image to be recognized is calculated through the text recognition model and is used for judging whether each written text in the text image to be recognized is correctly written, namely whether wrong characters exist. The correct probability of each written text is compared with a preset probability threshold (e.g. 0.1), and when the correct probability of the written text is smaller than the preset probability threshold, the written text can be used as a target error text, and the following step 103 is continuously executed.

In some possible embodiments, the first text recognition unit may further recognize each written text in the text image to be recognized, and determine a correct text corresponding to each written text. That is, in these embodiments, the first text recognition unit may determine a probability of correctness for each written text and the corresponding correct text. If there is a wrong word in the written text, the correct text recognized at this time may be the correct text predicted for the wrong word. For convenience of description, the present embodiment refers to the correct text recognized by the first text recognition unit as the second correct text.

Optionally, a character dictionary may be used to determine the correct probability of writing a text, and correspondingly, the processing in step 102 may be as follows: for any one of the at least one written text, a first text recognition unit based on a text recognition model determines a probability that the written text belongs to each character in a preset character dictionary, and determines a correct probability of the written text based on the probability.

Accordingly, the process of determining the second correct text may be as follows: for any one of the at least one written text, a first text recognition unit based on a text recognition model determines a probability that the written text belongs to each character in a preset character dictionary, and determines a correct probability of the written text based on the probability.

In one possible embodiment, in the first text recognition unit, for each written text, a preset character dictionary may be used, and the probability that the written text belongs to each character in the character dictionary may be calculated, so that the corresponding character of each written text (i.e., the second correct text described above) may be determined. As an example, after processing the text image to be recognized, the first text recognition unit may obtain a character probability matrix of each written text, where the character probability matrix is used to represent a probability that the written text belongs to each character in the character dictionary, and further, may Search for a character corresponding to each written text in the character dictionary in a greedy decoding manner or a Beam Search (cluster Search) decoding manner. The present embodiment does not limit the specific manner of looking up the characters in the character dictionary, and for example, the character with the highest probability may also be determined as the character corresponding to the written text. The wrong word refers to a word without a Chinese character, the corresponding character cannot be found in the character dictionary, and when the wrong word (namely, the target wrong text) exists in the written text, the calculated probability is relatively low, so that the probability can be used as the correct probability of the written text to judge whether the written text is correct or not.

When there is an error word in the written text (i.e. the target error text), the following step 103 may be performed, further determining the corresponding error category by the second text recognition unit.

And 103, when the target error text exists in the text image to be recognized based on the correct probability, determining the error category of the target error text based on a second text recognition unit of the text recognition model.

Referring to the schematic diagram of error category identification shown in fig. 2, the specific process may be as follows: processing the text image to be recognized based on the second text recognition unit, and determining the font characteristic vector of the written text image corresponding to each written text; acquiring a first font characteristic vector of at least one preset wrong font image corresponding to a target wrong text; determining a first similarity between a font characteristic vector of a written text image corresponding to the target error text and a first font characteristic vector; and determining the error category of the target error text based on the first similarity.

The target error text is a written text of which the correct probability is smaller than a preset probability threshold in the at least one written text in the step 102.

In a possible implementation, each word may have at least one error writing method, corresponding to different error font styles, an image of each error font style may be collected in advance as a preset error font style image of the word, and a font characteristic vector may be extracted from the preset error font style image, and stored as a first font characteristic vector, and used as a recognition reference in the subsequent recognition of error categories. Or, the preset wrong font image of the word may be stored in a preset wrong font dictionary corresponding to the word, and the preset wrong font image may be used as a recognition reference in the subsequent recognition of the wrong category.

In the second text recognition unit, a corresponding glyph feature vector may be extracted for each written text. And when the probability calculated by the first text recognition unit is smaller than a preset probability threshold (such as 0.1), indicating that the written text is a wrong word, and acquiring a font characteristic vector corresponding to the written text. And comparing the font characteristic vector corresponding to the written text with the pre-stored first font characteristic vectors of the preset wrong font images, calculating the similarity between the font characteristic vector corresponding to the written text and each first font characteristic vector, and determining the preset wrong font image with the highest similarity so as to obtain corresponding error classification. Or, each preset wrong font image may be acquired from a preset wrong font dictionary corresponding to the written text, a first font feature vector is extracted from each preset wrong font image, then, the font feature vector corresponding to the written text may be compared with the first font feature vector of each preset wrong font image, the similarity between the font feature vector corresponding to the written text and each first font feature vector is calculated, and the wrong font image with the highest similarity is determined, so that the corresponding error classification is obtained.

For example, when the error font image with the highest similarity is obtained by comparison and is an image of a 'cause' font, a Chinese character frame and a 'Fu' font, an error classification one of the 'cause' font can be obtained, and a corresponding error writing method is indicated; and when the error font image with the highest similarity is obtained by comparison and is the image of the Chinese character of the ' cause ' in the ' Chinese character ' basket ', the error classification II of the ' cause ' character can be obtained, and the corresponding error writing method is indicated.

Optionally, referring to the schematic diagram of correct text recognition shown in fig. 3, in the second text recognition unit, in addition to finding out a corresponding misclassification by writing a glyph feature vector corresponding to a text, characters corresponding to the correct text may also be determined, and corresponding processing may be as follows: determining second similarity between the font characteristic vector of the written text image corresponding to each written text and second font characteristic vectors of a plurality of preset correct font images; and determining the first correct text corresponding to each written text based on the second similarity.

In a possible implementation manner, for each character, an image of a correct font corresponding to the character may be collected in advance to serve as a preset correct font image of the character, and a font feature vector is extracted from the preset correct font image and stored as a second font feature vector, and is used as a recognition reference in subsequent recognition of a correct text. Or, the preset correct font image of the character may be stored in a preset correct font dictionary, and the preset correct font image may be used as a recognition reference in subsequent recognition of a correct text.

In the second text recognition unit, a corresponding glyph feature vector may be extracted for each written text. And comparing the font characteristic vector corresponding to each written text with the pre-stored second font characteristic vector of each preset correct font image, calculating the similarity between the image of the written text and each preset correct font image, and determining the correct font image with the highest similarity so as to obtain the corresponding correct text (namely the first correct text). Or, each preset correct font image may be acquired from a preset correct font dictionary, a second font feature vector is extracted from each preset correct font image, and then, the font feature vector corresponding to the written text may be compared with the second font feature vector of each preset correct font image, the similarity between the font feature vector corresponding to the written text and each second font feature vector is calculated, and the preset correct font image with the highest similarity is determined, so as to obtain the corresponding correct text (i.e., the first correct text).

The image similarity calculation can be accelerated by using a GPU (Graphics Processing Unit), which can effectively improve Processing efficiency.

Optionally, the processing of obtaining the first glyph feature vector of the at least one preset error glyph image corresponding to the target error text may include: after a first target correct text corresponding to the target error text is determined, determining at least one preset error font image corresponding to the first target correct text; and determining the font characteristic vector of at least one preset wrong font image corresponding to the first target correct text as the first font characteristic vector of at least one preset wrong font image corresponding to the target wrong text.

The first correct text may correspond to at least one wrong text, and a predetermined wrong font image of each wrong text may be stored under the corresponding first correct text, indicating various wrong writing methods corresponding to the character. Based on this, after the first correct text of each written text is determined, if it is determined that an incorrect text exists in the first text recognition unit, at least one preset incorrect font image corresponding to the first correct text may be determined according to the first correct text corresponding to the target incorrect text, and the at least one preset incorrect font image corresponding to the target incorrect text is used as the at least one preset incorrect font image corresponding to the target incorrect text, so as to obtain a corresponding first font feature vector.

And 104, determining a text recognition result of the text image to be recognized based on the target error text and the error category thereof.

In a possible implementation manner, a corresponding target error text may be intercepted from the text image, or a found matching preset error font image may be obtained, and a corresponding error text recognition result may be formed in combination with the determined error category, as a text recognition result of the text image to be recognized.

Optionally, corresponding to the case of determining the first correct text and the second correct text, the processing of step 104 may further be: determining a correct text recognition result of the text image to be recognized based on a target correct text corresponding to each written text, wherein the target correct text is a correct text with high confidence coefficient in a first correct text and a second correct text corresponding to the written text; determining an error text recognition result of the text image to be recognized based on the error font image corresponding to the target error text and the error category thereof; and taking the correct text recognition result and the error text recognition result as the recognition result of the text image to be recognized.

In a possible implementation manner, the processing result of the text recognition model may include a correct text recognition result and an incorrect text recognition result, and the results may be integrated to obtain a recognition result. For example, when a correct text recognition result and an incorrect text recognition result exist at the same time, the characters of the correct text can be displayed, and a specific incorrect writing method is indicated, such as 'f' word wrongly written in 'Chinese word Kung' and 'large' word required to be modified.

Referring to the schematic identification diagram of the text image shown in fig. 4, for each written text, if the first correct text and the second correct text are the same text, one of the first correct text and the second correct text may be selected as the correct text corresponding to the written text; if the first correct text and the second correct text are different texts, a text with a higher confidence coefficient may be selected as the correct text corresponding to the written text, for example, in the above process, the first correct text is determined by using a correct probability, the second correct text is determined by using a compared similarity, and for the written text, when the correct probability is greater than the similarity, the first correct text may be selected as the correct text with a higher confidence coefficient. On the basis, the accuracy of identifying the correct text can be improved.

And for the condition that the wrong word exists, intercepting a corresponding target wrong text in the text image, or acquiring the searched matched preset wrong word image, and combining the determined error category to form a corresponding wrong text recognition result.

And integrating the correct text recognition result and the wrong text recognition result to obtain a text recognition result of the text image to be recognized.

In this embodiment, the correct probability of each written text in the text image may be determined based on the first text recognition unit in the text recognition model, and when a target error text exists in the written text, the error category of the target error text may be determined based on the second text recognition unit in the text recognition model. That is to say, not only can judge whether the wrong word exists, but also can identify the wrong category of the wrong word, namely, identify the specific wrong position, and improve the accuracy of wrong word identification.

Moreover, the second text recognition unit can adopt the font characteristic vectors for comparison to determine the error category, and when a new wrong word appears, the font characteristic vectors of the new wrong word are extracted to be used as a comparison reference, so that the processing of the text recognition model is not influenced, the text recognition model does not need to be modified or retrained, and the updating is facilitated.

In the above, the overall flow of the text recognition method is introduced, where the font feature vectors are extracted from the preset correct font image and the preset incorrect font image, this embodiment will provide an alternative font feature vector extraction method for the above-mentioned feature extraction of the preset correct font image and the preset incorrect font image.

The method comprises the following steps: acquiring preset correct font images of a plurality of correct texts and preset wrong font images of at least one wrong text corresponding to each correct text; and extracting the features of the preset wrong font image to obtain a first font feature vector, and extracting the features of the preset correct font image to obtain a second font feature vector.

In one possible implementation, a large number of text images (including straight text, oblique text, and curved text images from text layouts, conventional blurred and photocopied text images from graphics quality, and other text images) may be collected and manually labeled (i.e., labeling text character information thereon), labeling the entire character sequence, and labeling the coordinate box of a single character for a portion of the data, e.g., if there is an error in that text image, identifying the error as "EC" and labeling the coordinate box of the error.

According to the characters obtained by the marking, a character dictionary can be established, wherein the character dictionary comprises independent characters and is correct text. Further, each character can be transferred to an image with a specified size to form an image with a preset correct font, and optionally, the background of the image is generally pure white, and the character is pure black. And establishing a preset correct font dictionary based on the preset correct font image of each character.

And cutting the error text according to the error character coordinate frame obtained by labeling to obtain preset error character images corresponding to all the error texts, then sorting different error texts corresponding to the same correct text, and establishing a preset error character dictionary.

Furthermore, feature extraction can be respectively carried out on the font images in the preset correct font dictionary and the preset wrong font dictionary to obtain corresponding font feature vectors.

Optionally, feature extraction may be performed by using a feature extraction branch in the similarity comparison model, and the corresponding processing may be as follows:

constructing a similarity comparison model;

training the similarity comparison model based on training samples, wherein the training samples comprise positive samples and negative samples, the positive samples comprise text images with the same text, and the negative samples comprise text images with different texts;

after training is finished, feature extraction is carried out on the preset wrong font image based on the trained feature extraction branch to obtain a first font feature vector, and feature extraction is carried out on the preset correct font image to obtain a second font feature vector.

Referring to the schematic diagram of the similarity comparison model shown in fig. 5, the similarity comparison model may include a plurality of parallel feature extraction branches and a feature identification module connected in series with the plurality of parallel feature extraction branches, and each feature extraction branch is shared in weight.

In a possible implementation manner, a plurality of text images may be collected, a pair of images with the same text may be used as a positive sample, a pair of images with different texts may be used as a negative sample, and a plurality of positive samples and negative samples may be obtained as training samples of the similarity comparison model. Alternatively, the ratio of positive and negative samples may be set to 1:3. alternatively, the text image as the training sample of the similarity comparison model may be a single character image.

In the training process, a pair of images is input into the similarity comparison model to judge whether the images are similar. For example, the feature extraction branch may adopt a Resnet18 model (a residual error network), the feature determination module may first perform serial superposition on two sets of feature mappings obtained by the parallel feature extraction branches, then perform processing through 2 convolutional layers and 3 full-link layers, where the number of nodes in the last full-link layer is 2, determine whether the two input images are similar, and the loss function uses a two-class cross entropy loss function.

After training is completed, any feature extraction branch can be reserved, the feature extraction branch can have higher sensitivity on text images, and the accuracy of feature extraction is improved. Furthermore, the feature extraction branch can be used for carrying out feature extraction on the preset wrong font image to obtain a first font feature vector, and carrying out feature extraction on the preset correct font image to obtain a second font feature vector.

Optionally, after that, when a new error word occurs, the following processing may be performed: and when the wrong font image corresponding to the target wrong text does not belong to any preset wrong font image in a preset wrong font library, storing the wrong font image corresponding to the target wrong text into the preset wrong font dictionary. The new wrong font image is extracted to obtain the new added first font characteristic vector, and the specific implementation mode is the same as the above, and when the new added first font characteristic vector is obtained, the new added first font characteristic vector can be stored to the corresponding position and corresponds to the wrong text and the corresponding correct text.

In this embodiment, when a new wrong word occurs, the font feature vector of the new wrong word may be extracted as the comparison reference, and the processing of the text recognition model is not affected.

The feature extraction branch in the similarity comparison model is adopted for feature extraction, and whether a pair of images of the input model are similar or not is judged without depending on specific semantics contained in the images, so that the feature extraction branch is still suitable for feature extraction when new wrong characters appear, and the convenience of updating is improved.

The text recognition model used in the above disclosed embodiments may be a machine learning model that may be trained prior to the above-described processing using the text recognition model. The present embodiment will describe a training method of a text recognition model.

Referring to fig. 6, a flowchart of a method for training a text recognition model is shown, which includes the following steps 601-603.

Step 601, constructing an initial text recognition model, wherein the initial text recognition model comprises a feature extraction module and a text recognition module;

step 602, training an initial text recognition model based on a sample text image of a correct text;

step 603, after the training is finished, the trained initial text recognition model is used as a first text recognition unit, the trained feature extraction module is used as a second text recognition unit, and the text recognition model is constructed.

In one possible implementation, a large number of sample text images of correct text (including straight text, oblique text, and curved text images from text layouts, conventional blurred and photocopied text images from graphics quality, and other text images, but all of which are single-line text images) may be collected as training samples, and then manually labeled (i.e., labeled with text character information thereon) to label the entire character sequence. Optionally, the training sample can be divided into a first data set and a second data set, wherein the first data set comprises a text image which has a clean background and is written neatly and normatively and has no mispronounced words, and the second data set is a text image which does not contain mispronounced words except for the first data set, for example, a sloppy text image without mispronounced words.

An initial text recognition model is constructed. Illustratively, the initial text recognition model may include a Resnet18 network, two layers of two-way LSTM (Long Short-Term Memory), one attention layer, and one GRU (Gate Loop Unit) layer. The Resnet18 network may be configured to extract features, and the rest of the modules correspond to the text recognition module. The body of the Resnet18 is made up of 4 Stage blocks, each Stage block in turn contains a number of block blocks, each block in turn is made up of a number of convolution operations, and the output of each block is the input to the next block.

In the training process, each training sample can be input into the initial text recognition model, and a set of feature maps is obtained after processing through the Resnet18 network and is used as the input of the two-layer bidirectional LSTM. The two-layer bidirectional LSTM models the context information of the input and outputs the feature mapping with the same dimension as the input. The attention layer and the GRU layer form a decoder, the decoder takes the hidden state vector of the previous step of the GRU as a query vector Q, takes the feature mapping output by the LSTM as a key vector K and a value vector V, calculates to obtain an attention score, calculates to obtain a context vector according to the attention score, calculates based on the output of the previous step of the decoder and the context vector to obtain the hidden state vector of the current step, obtains a character probability matrix of each current predicted written text according to the hidden state vector of the current step and the context vector, and the character probability matrix is used for representing the probability that each character in the character dictionary belongs to the written text. The loss function uses a multi-class cross entropy loss function.

After the training is completed, the whole model may be reserved as the first text recognition unit, the Resnet18 network may be reserved as the second text recognition unit, and a text recognition model may be constructed based on the first text recognition unit and the second text recognition unit, so as to implement the text recognition process.

In this embodiment, the adopted training sample is a sample text image of a correct text, and is easy to obtain compared with a text image of an incorrect text, and the sample size is large, so that the problem of sample imbalance can be avoided.

The embodiment of the disclosure provides a text recognition device, which is used for realizing the text recognition method. As shown in fig. 7, a schematic block diagram of a text recognition apparatus 700 includes: an obtaining module 701, a processing module 702, and a determining module 703.

An obtaining module 701, configured to obtain a text image to be recognized;

a processing module 702, configured to process the to-be-recognized text image based on a first text recognition unit of a text recognition model to determine a correct probability of at least one written text in the to-be-recognized text image; when the target error text exists in the text image to be recognized based on the correct probability, determining the error category of the target error text based on a second text recognition unit of the text recognition model;

a determining module 703, configured to determine a text recognition result of the text image to be recognized based on the target error text and the error category thereof.

Optionally, the processing module 702 is configured to:

processing the text image to be recognized based on the second text recognition unit, and determining a font feature vector of the written text image corresponding to each written text;

acquiring a first font characteristic vector of at least one preset wrong font image corresponding to the target wrong text;

determining a first similarity between a font characteristic vector of a written text image corresponding to the target wrong text and the first font characteristic vector;

and determining the error category of the target error text based on the first similarity, wherein the target error text is a written text of which the correct probability is smaller than a preset probability threshold value in the at least one written text.

Optionally, the processing module 702 is further configured to:

determining second similarity between the font characteristic vector of the written text image corresponding to each written text and second font characteristic vectors of a plurality of preset correct font images;

and determining the first correct text corresponding to each written text based on the second similarity.

Optionally, the first correct text corresponds to at least one preset wrong font image;

the processing module 702 is configured to:

after a first target correct text corresponding to the target wrong text is determined, determining at least one preset wrong font image corresponding to the first target correct text;

and determining the font characteristic vector of at least one preset wrong font image corresponding to the first target correct text as the first font characteristic vector of at least one preset wrong font image corresponding to the target wrong text.

Optionally, the processing module 702 is configured to:

and for any one of the at least one written text, determining the probability that the written text belongs to each character in a preset character dictionary based on a first text recognition unit of a text recognition model, and determining the correct probability of the written text based on the probability.

Optionally, the processing module 702 is further configured to:

for any one of the at least one written text, determining the probability that the written text belongs to each character in a preset character dictionary based on a first text recognition unit of a text recognition model;

and determining a second correct text corresponding to the written text based on the probability that the written text belongs to each character in the preset character dictionary.

Optionally, the determining module 703 is configured to:

determining a correct text recognition result of the text image to be recognized based on a target correct text corresponding to each written text, wherein the target correct text is a correct text with high confidence level in a first correct text and a second correct text corresponding to the written text;

determining an error text recognition result of the text image to be recognized based on the error font image corresponding to the target error text and the error category thereof;

and taking the correct text recognition result and the wrong text recognition result as the recognition result of the text image to be recognized.

Optionally, each written text corresponds to a preset wrong font library dictionary, the preset wrong font library dictionary comprises a plurality of preset wrong font images,

wherein the apparatus further comprises an update module configured to:

and when the wrong font image corresponding to the target wrong text does not belong to any preset wrong font image in the preset wrong font library, storing the wrong font image corresponding to the target wrong text into the preset wrong font library dictionary.

An exemplary embodiment of the present disclosure also provides an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor. The memory stores a computer program executable by the at least one processor, the computer program, when executed by the at least one processor, is operative to cause the electronic device to perform a method according to embodiments of the disclosure.

The disclosed exemplary embodiments also provide a non-transitory computer readable storage medium storing a computer program, wherein the computer program, when executed by a processor of a computer, is adapted to cause the computer to perform a method according to an embodiment of the present disclosure.

The exemplary embodiments of the present disclosure also provide a computer program product comprising a computer program, wherein the computer program, when executed by a processor of a computer, is adapted to cause the computer to perform a method according to an embodiment of the present disclosure.

Referring to fig. 8, a block diagram of a structure of an electronic device 800, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the electronic device 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the electronic device 800 are connected to the I/O interface 805, including: an input unit 806, an output unit 807, a storage unit 808, and a communication unit 809. The input unit 806 may be any type of device capable of inputting information to the electronic device 800, and the input unit 806 may receive input numeric or text information and generate key signal inputs related to user settings and/or function controls of the electronic device. Output unit 807 can be any type of device capable of presenting information and can include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. The storage unit 808 may include, but is not limited to, a magnetic disk, an optical disk. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers, and/or chipsets, such as bluetooth devices, wiFi devices, wimax 7 devices, cellular communication devices, and/or the like.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 performs the respective methods and processes described above. For example, in some embodiments, the text recognition method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto the electronic device 800 via the ROM 802 and/or the communication unit 809. In some embodiments, the computing unit 801 may be configured to perform the text recognition method in any other suitable manner (e.g., by means of firmware).

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, causes the functions/acts specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

As used in this disclosure, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

Claims

1. A method of text recognition, the method comprising:

acquiring a text image to be recognized;

2. The method according to claim 1, wherein when it is determined that the target error text exists in the text image to be recognized based on the correct probability, determining an error category of the target error text based on a second text recognition unit of the text recognition model comprises:

processing the text image to be recognized based on the second text recognition unit, and determining a font characteristic vector of a written text image corresponding to each written text;

and determining the error category of the target error text based on the first similarity, wherein the target error text is a written text of which the correct probability is smaller than a preset probability threshold in the at least one written text.

3. The method of claim 2, further comprising:

4. The method of claim 3, wherein the first correct text corresponds to at least one preset incorrect glyph image;

the obtaining of the first font feature vector of the at least one preset wrong font image corresponding to the target wrong text includes:

after determining a first correct text corresponding to the target wrong text, determining at least one preset wrong font image corresponding to the first correct text;

and determining the font characteristic vector of at least one preset wrong font image corresponding to the first correct text as the first font characteristic vector of at least one preset wrong font image corresponding to the target wrong text.

5. The method according to any one of claims 1-4, wherein the first text recognition unit based on a text recognition model processes the text image to be recognized to determine a probability of correctness of at least one written text in the text image to be recognized, comprising:

6. The method according to any one of claims 1-4, further comprising:

7. The method according to claim 6, wherein the determining the text recognition result of the text image to be recognized based on the target error text and the error category thereof comprises:

8. The method according to any one of claims 1 to 4, wherein each written text corresponds to a predetermined wrong font dictionary including a plurality of predetermined wrong font images,

wherein the method further comprises:

and when the wrong font image corresponding to the target wrong text does not belong to any preset wrong font image in the preset wrong font library, storing the wrong font image corresponding to the target wrong text into the preset wrong font dictionary.

9. A text recognition apparatus, the apparatus comprising:

the acquisition module is used for acquiring a text image to be recognized;

10. An electronic device, comprising:

a processor; and

a memory for storing the program, wherein the program is stored in the memory,

wherein the program comprises instructions which, when executed by the processor, cause the processor to carry out the method according to any one of claims 1-8.

11. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-8.