CN115641598A

CN115641598A - Text recognition method and device, electronic equipment and storage medium

Info

Publication number: CN115641598A
Application number: CN202211158135.1A
Authority: CN
Inventors: 汤野骏; 刘鹏; 夏魁
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2022-09-22
Filing date: 2022-09-22
Publication date: 2023-01-24

Abstract

The disclosure relates to a text recognition method, a text recognition device, electronic equipment and a storage medium, and relates to the technical field of computers. The method and the device are used for reducing error correction of the text recognition result in the related technology and improving the text recognition precision. The method comprises the following steps: performing text feature recognition processing on an image to be recognized to obtain an initial recognition result; determining a target character with a first confidence coefficient smaller than or equal to a first preset threshold in the initial recognition result; semantic feature extraction processing is carried out on the initial recognition result, and characters of a target position in the initial recognition result are predicted to obtain a second candidate character set; and determining a replacement character of the target character based on the intersection of the second candidate character set and the first target candidate character set, and determining a target recognition result of the image to be recognized based on the replacement character and the initial recognition result.

Description

Text recognition method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a text recognition method and apparatus, an electronic device, and a storage medium.

Background

The image text recognition is a technology of scanning and recognizing texts and characters in an image by using an optical technology and a text recognition model, and finally converting the texts in the image into a text format for further editing and processing by text processing software. However, due to the reasons of writing blur, occlusion and the like of some images to be recognized, partial text recognition is wrong, and the text recognition accuracy is reduced.

Disclosure of Invention

The disclosure provides a text recognition method, a text recognition device, an electronic device and a storage medium, which are used for reducing error correction of a text recognition result in a related technology and improving the accuracy of text recognition. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a text recognition method, including: performing text feature recognition processing on an image to be recognized to obtain an initial recognition result; the initial recognition result comprises at least one initial character; each initial character has a corresponding first candidate character set; each initial character is a character in a first candidate character set corresponding to each initial character, and any first candidate character set is obtained by identifying a preset position in an image to be identified; determining target characters of which the first confidence degrees are smaller than or equal to a first preset threshold value in the initial recognition result; the first confidence coefficient is obtained in the process of determining the initial recognition result; semantic feature extraction processing is carried out on the initial recognition result, characters of a target position in the initial recognition result are predicted, and a second candidate character set is obtained; the target position is the position of the target character in the initial recognition result; determining a replacement character of the target character based on the intersection of the second candidate character set and the first target candidate character set; the first target candidate character set is a first candidate character set corresponding to the target character; and determining a target recognition result of the image to be recognized based on the replacing characters and the initial recognition result.

Optionally, recognizing the text in the image to be recognized to obtain an initial recognition result, including: inputting an image to be recognized into a first preset model to perform text feature recognition processing, and obtaining an initial recognition result; the first preset model is trained based on text feature recognition of a plurality of sample images.

Optionally, determining a target character with a first confidence degree smaller than or equal to a first preset threshold in the initial recognition result includes: acquiring a first confidence coefficient of each initial character in the initial recognition result; and determining the initial character with the first confidence coefficient smaller than or equal to a first preset threshold value as a target character.

Optionally, performing semantic feature extraction processing on the initial recognition result, predicting characters at a target position in the initial recognition result, and obtaining a second candidate character set, where the semantic feature extraction processing includes: inputting the initial recognition result and the target position into a second preset model to perform semantic feature extraction processing to obtain a second candidate character set; the second preset model is trained based on text feature recognition of a plurality of sample texts.

Optionally, determining a replacement character of the target character based on an intersection of the second candidate character set and the first candidate character set corresponding to the target character, includes: determining a target confidence coefficient of each candidate character in the intersection; the target confidence coefficient of each candidate character is the sum of the first confidence coefficient and the second confidence coefficient corresponding to each candidate character; the second confidence coefficient is obtained by the second preset model in the process of determining the second candidate character set; and determining the candidate character with the target confidence coefficient larger than or equal to the preset confidence coefficient in the intersection as the replacement character of the target character.

Optionally, determining a target recognition result of the image to be recognized based on the replaced character and the initial recognition result, including: under the condition that the replacing characters are different from the target characters in the initial recognition result, replacing the target characters in the initial recognition result with the replacing characters, and determining the replaced initial recognition result as the target recognition result of the image to be recognized; and in the case that the replacing character is the same as the target character in the initial recognition result, determining the initial recognition result as the target recognition result of the image to be recognized.

Optionally, performing text feature recognition processing on the image to be recognized to obtain an initial recognition result, including: performing text feature recognition processing on a preset position in an image to be recognized based on a first preset model to obtain a first candidate character set corresponding to the preset position; and determining the characters with the first confidence degrees larger than or equal to a second preset threshold value in the first candidate character set corresponding to the preset positions as initial characters.

According to a second aspect of an embodiment of the present disclosure, there is provided a text recognition apparatus including a processing unit and a determination unit; the processing unit is configured to execute text feature recognition processing on the image to be recognized to obtain an initial recognition result; the initial recognition result comprises at least one initial character; each initial character has a corresponding first candidate character set; each initial character is a character in a first candidate character set corresponding to each initial character, and any first candidate character set is obtained by recognizing a preset position in an image to be recognized; a determining unit configured to perform determining a target character of which a first confidence degree is less than or equal to a first preset threshold in the initial recognition result; the first confidence coefficient is obtained in the process of determining the initial recognition result; the processing unit is also configured to execute semantic feature extraction processing on the initial recognition result, predict characters at a target position in the initial recognition result and obtain a second candidate character set; the target position is the position of the target character in the initial recognition result; the determining unit is further configured to determine a replacing character of the target character based on the intersection of the second candidate character set and the first target candidate character set, and determine a target recognition result of the image to be recognized based on the replacing character and the initial recognition result; the first target candidate character set is a first candidate character set corresponding to the target character.

Optionally, the processing unit is specifically configured to perform: inputting an image to be recognized into a first preset model to perform text feature recognition processing, and obtaining an initial recognition result; the first preset model is trained based on text feature recognition of a plurality of sample images.

Optionally, the determining unit is specifically configured to perform: acquiring a first confidence coefficient of each initial character in the initial recognition result; and determining the initial character with the first confidence coefficient smaller than or equal to a first preset threshold value as a target character.

Optionally, the processing unit is specifically configured to perform: inputting the initial recognition result and the target position into a second preset model to perform semantic feature extraction processing to obtain a second candidate character set; the second preset model is trained based on text feature recognition of a plurality of sample texts.

Optionally, the determining unit is specifically configured to perform: determining a target confidence coefficient of each candidate character in the intersection; the target confidence coefficient of each candidate character is the sum of the first confidence coefficient and the second confidence coefficient corresponding to each candidate character; the second confidence coefficient is obtained by the second preset model in the process of determining the second candidate character set; and determining the candidate character with the target confidence coefficient larger than or equal to the preset confidence coefficient in the intersection as the replacement character of the target character.

Optionally, the determining unit is specifically configured to perform: under the condition that the replacing characters are different from the target characters in the initial recognition result, replacing the target characters in the initial recognition result with the replacing characters, and determining the replaced initial recognition result as the target recognition result of the image to be recognized; and in the case that the replacing character is the same as the target character in the initial recognition result, determining the initial recognition result as the target recognition result of the image to be recognized.

Optionally, the processing unit is specifically configured to perform: performing text feature recognition processing on a preset position in an image to be recognized based on a first preset model to obtain a first candidate character set corresponding to the preset position; and determining the characters with the first confidence degrees larger than or equal to a second preset threshold value in the first candidate character set corresponding to the preset positions as initial characters.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a processor, a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement the text recognition method of the first aspect.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon instructions which, when executed by a processor of an electronic device, enable the electronic device to perform the text recognition method of the first aspect as described above.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by a processor, implement the text recognition method as described in the first aspect above.

The technical scheme provided by the disclosure at least brings the following beneficial effects: the text recognition device recognizes a text in the image to be recognized based on the first preset model to obtain an initial recognition result comprising at least one initial character. Since an initial character is a character in a first candidate character set; the first candidate character set is obtained by identifying the same position in the image to be identified by the first preset model, so that the first candidate character set is equivalent to a candidate character set for predicting characters at a certain position in the image to be identified by the first preset model. Further, the text recognition device determines the target characters in the initial recognition result, wherein the first confidence coefficient of the target characters is smaller than or equal to a preset threshold value. Since the first confidence coefficient is obtained by the first preset model in the process of determining the initial recognition result, the target character has a high probability that the first preset model recognizes an inaccurate target character. The text recognition device performs semantic analysis on the initial recognition result based on the second preset model, and only the characters of the target position in the initial recognition result need to be predicted to obtain a second candidate character set, wherein the second candidate character set is equivalent to the candidate character set obtained after the target position is corrected by the second preset model. The text recognition device determines a replacement character of the target character based on the intersection of the second candidate character set and the first candidate character set corresponding to the target character, and determines a target recognition result of the image to be recognized based on the replacement character and the initial recognition result. Compared with the error correction problem caused by the fact that the error correction is only dependent on the language model (namely the second preset model) in the related technology, the method and the device are combined with the suggestion of the recognition model (namely the first preset model) and only correct the target character, so that the error correction rate of the second model can be reduced, and the recognition precision of the text is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 illustrates an image to be recognized according to an exemplary embodiment;

FIG. 2 is a schematic diagram of a structure of a TROCR model shown in accordance with an exemplary embodiment;

FIG. 3 is a schematic diagram illustrating the structure of the ABINet model according to an exemplary embodiment;

FIG. 4 is a block diagram illustrating a text recognition system in accordance with an exemplary embodiment;

FIG. 5 is one of the flow diagrams illustrating a method of text recognition according to an exemplary embodiment;

FIG. 6 is a schematic structural diagram illustrating a first pre-set model in accordance with an exemplary embodiment;

FIG. 7 is a schematic diagram illustrating a second pre-set model according to an exemplary embodiment;

FIG. 8 is a diagram illustrating recognition effects according to an exemplary embodiment;

FIG. 9 is a second flowchart illustration of a text recognition method in accordance with an exemplary embodiment;

FIG. 10 is a third flowchart illustrating a text recognition method in accordance with an exemplary embodiment;

FIG. 11 is a block diagram illustrating an arrangement of a text recognition device in accordance with an exemplary embodiment;

fig. 12 is a schematic structural diagram of an electronic device according to an exemplary embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the disclosure, as detailed in the appended claims.

In addition, in the description of the embodiments of the present disclosure, "/" indicates an inclusive meaning, for example, a/B may indicate a or B, unless otherwise specified. "and/or" herein is merely an association describing an associated object, and means that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, in the description of the embodiments of the present disclosure, "a plurality" means two or more than two.

It should be noted that, the user information (including but not limited to user device information, user personal information, user behavior information, etc.) and data (including but not limited to program code, etc.) referred to in the present disclosure are information and data authorized by the user or sufficiently authorized by each party.

Before explaining the embodiments of the present disclosure in detail, some related arts to which the embodiments of the present disclosure relate will be described.

Image text recognition is essentially Optical Character Recognition (OCR), which is a technology of scanning and recognizing texts and characters in an image by using an optical technology and a text recognition model, and finally converting the texts in the image into a text format for further editing and processing by text processing software. The text recognition model mainly relies on visual information to recognize texts in images. However, when some characters in the image have artistic deformation or blur, it is difficult for the text recognition model to recognize the correct text.

The language model learns a large amount of text semantic knowledge through large-scale pre-training, and then can correct the recognition result of the text recognition model. For example, the result of the recognition of the image shown in fig. 1 by the text recognition model is "MBA baijia ball" because the artistic words M and N in fig. 1 are visually indistinguishable, resulting in difficulty in recognition by the text recognition model. After the language model is pre-trained by billions, the language expression mode and the co-occurrence relationship of human are learned, so that the MBA baijia ball can be corrected to be the NBA baijia ball.

However, in practical applications, the frequency difference between different phrases is large, and the simple dependence on the language model usually causes error correction. The language model usually corrects the rare word groups in the recognition result into common word groups, and then corrects the original correct recognition result of the text recognition model, and outputs the recognition result after error correction. If the image has a clear text 'study', the recognition result of the text recognition model on the image is 'study', and the language model can correct the recognition result into 'learning', so that the language model considers that the learning is a common vocabulary, and further error correction is caused.

In some related technologies, the text recognition model and the language model may be fused to obtain a unified fusion model, such as a TROCR model and an ABINet model. The TROCR model adopts implicit modeling of a language model, namely a text recognition model and the language model share parameters and are coupled together to perform text recognition on an image. As shown in fig. 2, the structure of the TROCR model includes an encoder and a decoder, and after the image is input to the TROCR model, the image is cut into a plurality of small block images by the TROCR model, each small block image is encoded, and the encoded result is further decoded to output a final recognized text. The ABINet model adopts a method of blocking gradient flow between a visual model and a language model, and realizes language model display modeling. As shown in fig. 3, the structure of the ABINet model is shown, which includes a text recognition model portion and a language model portion, wherein the recognition result of the text recognition model is used as the input of the language model, and end-to-end OCR recognition is realized.

However, whether the TROCR model or the ABINet model is adopted, the correction result is essentially dependent on the language model, namely, the TROCR model and the ABINet model can make language model prediction on each recognized character in the image, and finally, some clearly and correctly recognized characters can be corrected.

The text recognition method provided by the embodiment of the disclosure is used for solving the technical problems existing in the related art. The text recognition method provided by the embodiment of the disclosure can be applied to a text recognition system, and fig. 1 shows a schematic structural diagram of the text recognition system. As shown in fig. 4, the text recognition system 10 includes a text recognition device 11 and an electronic apparatus 12. The text recognition means 11 is connected to the electronic device 12. The text recognition device 11 and the electronic device 12 may be connected in a wired manner or in a wireless manner, which is not limited in the embodiment of the present invention.

The text recognition device 11 is configured to recognize a text in the image to be recognized based on the first preset model, obtain an initial recognition result, and determine a target character in the initial recognition result, where a first confidence coefficient is smaller than or equal to a preset threshold value. The text recognition device 11 is further configured to perform semantic analysis on the initial recognition result based on the second preset model, predict characters at a target position in the initial recognition result, and obtain a second candidate character set. The text recognition device 11 is further configured to determine a replacement character of the target character based on an intersection of the second candidate character set and the first candidate character set corresponding to the target character, and determine a target recognition result of the image to be recognized based on the replacement character and the initial recognition result.

The text recognition apparatus 11 may implement the text recognition method of the embodiment of the present disclosure in various electronic devices 12. For example, the electronic device 12 may be a scanner, a digital camera, or the like.

In different application scenarios, the text recognition apparatus 11 and the electronic device 12 may be independent devices or may be integrated in the same device, which is not specifically limited in this embodiment of the present invention.

When the text recognition device 11 and the electronic device 12 are integrated into the same device, the data transmission mode between the text recognition device 11 and the electronic device 12 is data transmission between internal modules of the device. In this case, the data transfer flow between the two is the same as the data transfer flow between the text recognition device 11 and the electronic apparatus 12 when they are independent of each other.

In the following embodiments provided in the embodiments of the present disclosure, the text recognition device 11 and the electronic device 12 are set independently of each other as an example for explanation.

FIG. 5 is a flowchart illustrating a text recognition method according to some example embodiments. In some embodiments, the text recognition method can be applied to the text recognition apparatus and the electronic device shown in fig. 1, and can also be applied to other similar devices.

As shown in fig. 5, a text recognition method provided by the embodiment of the present disclosure includes the following steps S201 to S205.

S201, the text recognition device performs text feature recognition processing on the image to be recognized to obtain an initial recognition result.

Wherein the initial recognition result comprises at least one initial character; each initial character has a corresponding first candidate character set; each initial character is a character in a first candidate character set corresponding to each initial character, and any first candidate character set is obtained by identifying a preset position in an image to be identified;

as a possible implementation manner, the text recognition device performs text feature recognition processing on the image to be recognized by using a preset text recognition algorithm to obtain an initial recognition result.

As another possible implementation manner, the text recognition device inputs the image to be recognized into the first preset model to perform text feature recognition processing, so as to obtain an initial recognition result.

It should be noted that the first preset model is a model that operation and maintenance personnel deploy in the text recognition device in advance, and is used for performing preliminary recognition on a text in an image to be recognized. In practical applications, the first model may be any text recognition model, such as a character recognition model of CRNN structure or a character recognition model of CNN structure.

Specifically, for a preset position in the image to be recognized, the text recognition device performs text feature recognition processing through the first preset model to obtain a first candidate character set corresponding to the preset position. Further, the text recognition device determines a character with a first confidence coefficient greater than or equal to a second preset threshold value in a first candidate character set corresponding to the preset position as an initial character. It should be noted that the second preset threshold is set in the text recognition device by the operation and maintenance staff in advance

Fig. 6 is a schematic structural diagram illustrating a case where the first preset model is a character recognition model of a CRNN structure. For an image to be recognized, a first preset model firstly uses a Convolutional Neural Network (CNN) part to perform feature extraction, then inputs the extracted features into the CNN part to perform time sequence coding to obtain a coded feature matrix, and finally decodes the feature matrix by a decoder part to obtain an initial recognition result.

Specifically, the feature matrix is recorded as [ Length × char ], the Length of the matrix is proportional to the Length of a text row in the image to be recognized, the height of the matrix char is the total number of character sets in a recognition dictionary of the text recognition model (that is, each column corresponds to one character to be recognized in the image to be recognized), and each number of each column element in the matrix represents the first confidence of the recognition character corresponding to the column. The decoding mode of the decoder for the feature matrix is that for each column in the feature matrix [ Length × char ], an element with the maximum first confidence coefficient is selected, the element value (index) of the element is taken, and adjacent and same characters are merged by a connection time sequence classification (CTC) decoding principle, and an initial recognition result is output.

Illustratively, take a [3 × 3] feature matrix as an example:

[[0.9，0.8，0.05]，

[0.05，0.15，0.15]，

[0.05，0.05，0.8]]。

for the above feature matrix, the decoder takes the element with the maximum first confidence in each column as the object to be decoded, namely 0.9,0.8,0.8, respectively appearing in the 1 st row, the 1 st row and the 3 rd row, so the initial recognition result before decoding is recorded as [1,1,3]. Suppose that line 1 corresponds to character a, line 2 corresponds to character B, and line 3 corresponds to character C in the recognition dictionary of the text recognition model. Thus, the decoder can parse [1,1,3] into [ a, C ] by identifying the correspondence in the dictionary. Furthermore, the decoder merges adjacent same characters [ A, A ] through a CTC decoding principle, and finally outputs an initial recognition result [ A, C ], wherein the first confidence coefficient corresponding to the character A is 0.85, and the first confidence coefficient corresponding to the character C is 0.8.

S202, the text recognition device determines the target characters with the first confidence coefficient smaller than or equal to a first preset threshold value in the initial recognition result.

And the first confidence coefficient is obtained in the process of determining the initial recognition result.

As a possible implementation manner, the text recognition apparatus obtains a first confidence of each initial character in the initial recognition result, and determines an initial character with the first confidence smaller than or equal to a first preset threshold as the target character.

It should be noted that the first preset threshold is set in the text recognition device by the operation and maintenance staff in advance. The first preset threshold and the second preset threshold may be the same or different, and the embodiment of the present disclosure does not limit this.

Illustratively, taking the recognition result [ a, C ] output in the above-described S201 embodiment as an example, the text recognition apparatus obtains the first confidence degree 0.85 of the initial character a and the first confidence degree 0.8 of the initial character C from the first preset model, respectively. If the preset threshold is 0.81, the text recognition device determines the initial character C as the target character (also called the first predetermined model untrusted character).

S203, the text recognition device performs semantic feature extraction processing on the initial recognition result, predicts characters of a target position in the initial recognition result, and obtains a second candidate character set.

And the target position is the position of the target character in the initial recognition result.

As a possible implementation mode, the text recognition device uses a preset semantic correction algorithm to perform semantic feature extraction processing on the initial recognition result, predicts characters at a target position in the initial recognition result, and obtains a second candidate character set.

As another possible implementation manner, the text recognition device inputs the initial recognition result and the target position into a second preset model, performs semantic feature extraction processing on the initial recognition result, and predicts characters at the target position in the initial recognition result to obtain a second candidate character set.

It should be noted that the second preset model is a model that the operation and maintenance personnel deploy in the text recognition device in advance, and is used for predicting characters at the target position. In practical applications, the second model may be any language model, such as a GPT-1 structured language model or a GPT-2 structured language model.

Fig. 7 shows a schematic structural diagram when the second predetermined model is a language model of GPT-2 structure. The GPT-2 model consists of a multi-layer Self-Attention (Masked Self-Attention) module and a Feed-Forward Neural Network (Feed-Forward Network) module. And the text recognition device inputs the initial recognition result and the target position into a second preset model, the second preset model predicts the target position, and a prediction result of the target position and a second candidate character set are obtained through a multilayer feedforward neural network.

Specifically, the second preset model predicts a target position to obtain a second candidate character set corresponding to the target position, and takes the character with the maximum second confidence coefficient in the second candidate character set as a predicted character. The second confidence coefficient is obtained by the second preset model in the process of determining the prediction result of the target position. For example, the second candidate character set is (N, C, M), the corresponding second confidence degrees are 0.9,0.8,0.7,0.6, respectively, and the prediction result of the target position of the second preset model is N.

S204, the text recognition device determines a replacement character of the target character based on the intersection of the second candidate character set and the first target candidate character set.

The first target candidate character set is a first candidate character set corresponding to the target character.

As a possible implementation, the text recognition apparatus determines a target confidence of each candidate character in the intersection; the target confidence of a candidate character is the sum of the first confidence and the second confidence of the candidate character. Further, the text recognition device determines the candidate character with the highest target confidence coefficient in the intersection as the replacement character of the target character.

It should be noted that, after the text recognition device determines the target character, the text recognition device decodes the target position again, and retains the recognition result of the first preset model on the target position, so as to obtain a first candidate character set corresponding to the target character.

As another possible implementation manner, the text recognition apparatus randomly selects one candidate character from an intersection of the second candidate character set and the first candidate character set corresponding to the target character, and uses the subsequent character as a replacement character of the target character.

As yet another possible implementation, the text recognition device scores candidate characters in the first set of candidate characters and scores candidate characters in the second set of candidate characters. Further, the text recognition device selects the candidate character with the highest comprehensive score in the intersection as the replacement character.

Specifically, for the first candidate character set, the text recognition device ranks the candidate characters according to the first confidence degrees of the candidate characters from high to low, and scores the candidate characters according to the ranking results, for example, the ranking results of the first candidate character set are

Top

1, 2, 3, etc. 10, and the corresponding scores are 15, 12, 8, 7, 6, 5, 4, 3, 2, 1. Similarly, for the second candidate character set, the text recognition device ranks the candidate characters according to the second confidence degrees of the candidate characters from high to low, and scores the candidate characters according to the ranking results, where the ranking result of the second candidate character set is Top 1, 2, 3,... 10, and the corresponding score is 15, 12, 8, 7, 6, 5, 4, 3, 2, 1. If the intersection is the candidate character a and the candidate character b, the score of the candidate character a in the first candidate character set is 15, and the score in the second candidate character set is 5; if the candidate character b has a score of 8 in the first candidate character set and a score of 6 in the second candidate character set, the composite score of the candidate character a is 20 and the composite score of the candidate character b is 14. Therefore, the text recognition apparatus determines the candidate character a as a replacement character for the target character.

S205, the text recognition device determines a target recognition result of the image to be recognized based on the replacing characters and the initial recognition result.

As one possible implementation manner, the text recognition apparatus replaces the target character in the initial recognition result with a replacement character, and determines the replaced initial recognition result as the target recognition result of the image to be recognized.

Illustratively, the initial recognition result is "MBA Baijia ball", where "M" is the target character. The replacement character is "N", the text recognition device replaces "M" with "N", and outputs "NBA baijia ball".

As shown in fig. 8, the initial recognition result and the target recognition result after some images to be recognized are recognized according to the embodiment of the present disclosure. Therefore, error correction of the text recognition result in the related technology is reduced through the text recognition method provided by the embodiment of the disclosure, and the text recognition precision is improved.

The technical scheme provided by the embodiment of the disclosure at least has the following beneficial effects: the text recognition device recognizes a text in the image to be recognized based on the first preset model to obtain an initial recognition result comprising at least one initial character. Since an initial character is a character in a first candidate character set; the first candidate character set is obtained by identifying the same position in the image to be identified by the first preset model, so that the first candidate character set is equivalent to a candidate character set for predicting characters at a certain position in the image to be identified by the first preset model. Further, the text recognition device determines the target characters in the initial recognition result, wherein the first confidence coefficient of the target characters is smaller than or equal to a preset threshold value. Since the first confidence coefficient is obtained by the first preset model in the process of determining the initial recognition result, the target character has a high probability that the first preset model recognizes an inaccurate target character. The text recognition device performs semantic analysis on the initial recognition result based on a second preset model, and only needs to predict characters at the target position in the initial recognition result to obtain a second candidate character set, wherein the second candidate character set is equivalent to the candidate character set obtained after the target position is corrected by the second preset model. The text recognition device determines a replacement character of the target character based on the intersection of the second candidate character set and the first candidate character set corresponding to the target character, and determines a target recognition result of the image to be recognized based on the replacement character and the initial recognition result. Compared with the error correction problem caused by the fact that the error correction is only dependent on the language model (namely the second preset model) in the related technology, the method and the device are combined with the suggestion of the recognition model (namely the first preset model) and only correct the target character, so that the error correction rate of the second model can be reduced, and the recognition precision of the text is improved.

In one design, as shown in fig. 9, in order to determine a replacement character of a target character, the foregoing S204 provided in the embodiment of the present disclosure specifically includes the following S2041 to S2042:

s2041, the text recognition device determines the target confidence of each candidate character in the intersection.

The target confidence coefficient of one candidate character is the sum of the first confidence coefficient and the second confidence coefficient of the candidate character; the second confidence is obtained by the second preset model in the process of determining the second candidate character set.

As a possible implementation manner, for any candidate character in the intersection, the text recognition apparatus obtains a first confidence degree and a second confidence degree of the candidate character. Further, the text recognition device determines the sum of the first confidence level and the second confidence level as the target confidence level of the candidate character.

S2042, the text recognition device determines the candidate character with the highest target confidence coefficient in the intersection as a replacement character of the target character.

As a possible implementation manner, the text recognition device selects a candidate character with the highest target reliability from the intersection as a replacement character of the target character.

In one design, as shown in fig. 10, in order to determine a target recognition result of an image to be recognized, the foregoing S205 provided in the embodiment of the present disclosure specifically includes the following S2051 to S2053:

s2051, the text recognition apparatus determines whether the replacement character is the same as the target character.

As a possible implementation, the text recognition device compares the replacement character with the target character to determine whether the replacement character is the same as the target character.

And S2052, in the case that the replacing character is different from the target character in the initial recognition result, the text recognition device replaces the target character in the initial recognition result with the replacing character, and determines the replaced initial recognition result as the target recognition result of the image to be recognized.

Illustratively, the initial recognition result is "MBA Baijia ball", where "M" is the target character. The replacement character is "N", that is, the replacement character is different from the target character in the initial recognition result, and therefore, the text recognition apparatus replaces "M" with "N" and outputs "NBA baijia ball".

S2053, in the case where the replacement character is the same as the target character in the initial recognition result, the text recognition apparatus determines the initial recognition result as the target recognition result of the image to be recognized.

Illustratively, the initial recognition result is "NBA baijia ball," where "N" is the target character. The replacement character is "N", that is, the replacement character is the same as the target character in the initial recognition result, and therefore, the text recognition device directly outputs "NBA baijia ball".

The above embodiments mainly describe the solutions provided by the embodiments of the present disclosure from the perspective of apparatuses (devices). It is understood that, in order to implement the above method, the apparatus or device includes hardware structures and/or software modules for executing the respective method flows, and the hardware structures and/or software modules for executing the respective method flows may constitute an electronic device. Those of skill in the art will readily appreciate that the present disclosure can be implemented in hardware or a combination of hardware and computer software for implementing the exemplary algorithm steps described in connection with the embodiments disclosed herein. Whether a function is performed in hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The present disclosure may perform functional module division on the apparatus or device according to the above method examples, for example, the apparatus or device may divide each functional module corresponding to each function, or may integrate two or more functions into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, the division of the modules in the embodiment of the present disclosure is illustrative, and is only one logical function division, and there may be another division manner in actual implementation.

Fig. 11 is a schematic structural diagram illustrating a text recognition apparatus according to an exemplary embodiment. Referring to fig. 11, a text recognition apparatus 30 provided in the embodiment of the present disclosure includes a processing unit 301 and a determining unit 302.

The processing unit 301 is configured to identify a text in an image to be identified based on a first preset model to obtain an initial identification result; the initial recognition result comprises at least one initial character; an initial character is a character in a first candidate character set; the first candidate character set is obtained by identifying the same position in the image to be identified by the first preset model; a determining unit 302, configured to determine a target character in the initial recognition result, where the first confidence is smaller than or equal to a preset threshold; the first confidence coefficient is obtained by the first preset model in the process of determining the initial recognition result; the processing unit 301 is further configured to perform semantic analysis on the initial recognition result based on a second preset model, predict characters at a target position in the initial recognition result, and obtain a second candidate character set; the target position is the position of the target character in the initial recognition result; the determining unit 302 is further configured to determine a replacement character of the target character based on an intersection of the second candidate character set and the first candidate character set corresponding to the target character, and determine a target recognition result of the image to be recognized based on the replacement character and the initial recognition result.

Optionally, the processing unit 301 is specifically configured to: and inputting the image to be recognized into a first preset model to obtain an initial recognition result.

Optionally, the determining unit 302 is specifically configured to: and acquiring a first confidence coefficient of each initial character in the initial recognition result, and determining the initial character of which the first confidence coefficient is less than or equal to a preset threshold value as a target character.

Optionally, the processing unit 301 is specifically configured to: and inputting the initial recognition result and the target position into a second preset model to obtain a second candidate character set.

Optionally, the determining unit 302 is specifically configured to: determining a target confidence coefficient of each candidate character in the intersection; the target confidence coefficient of one candidate character is the sum of the first confidence coefficient and the second confidence coefficient of the candidate character; the second confidence coefficient is obtained by the second preset model in the process of determining the second candidate character set; and determining the candidate character with the highest target confidence coefficient in the intersection as the replacement character of the target character.

Optionally, the determining unit 302 is specifically configured to: and replacing the target characters in the initial recognition result with replacement characters, and determining the replaced initial recognition result as the target recognition result of the image to be recognized.

Optionally, the determining unit 302 is specifically configured to: under the condition that the replacing characters are different from the target characters in the initial recognition result, replacing the target characters in the initial recognition result with the replacing characters, and determining the replaced initial recognition result as the target recognition result of the image to be recognized; and in the case that the replacing character is the same as the target character in the initial recognition result, determining the initial recognition result as the target recognition result of the image to be recognized.

Fig. 12 is a schematic structural diagram of an electronic device provided by the present disclosure. As shown in fig. 12, the electronic device 40 may include at least one processor 401 and a memory 402 for storing processor-executable instructions, wherein the processor 401 is configured to execute the instructions in the memory 402 to implement the text recognition method in the above-described embodiments.

In addition, the electronic device 40 may also include a communication bus 403 and at least one communication interface 404.

Processor 401 may be a Central Processing Unit (CPU), a micro-processing unit 301, an ASIC, or one or more integrated circuits configured to control the execution of programs in accordance with the present disclosure.

The communication bus 403 may include a path that transfers information between the above components.

The communication interface 404 may be any device, such as a transceiver, for communicating with other devices or communication networks, such as an ethernet, a Radio Access Network (RAN), a Wireless Local Area Network (WLAN), etc.

The memory 402 may be, but is not limited to, a read-only memory (ROM) or other type of static storage device that may store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that may store information and instructions, an electrically erasable programmable read-only memory (EEPROM), a compact disk read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory may be self-contained and coupled to the processor via a bus. The memory may also be integral to the processor.

The memory 402 is used for storing instructions for executing the disclosed solution, and is controlled to be executed by the processor 401. Processor 401 is configured to execute instructions stored in memory 402 to implement the functions of the disclosed text recognition method.

As an example, in conjunction with fig. 11, the processing unit 301 and the determination unit 302 in the text recognition apparatus 30 implement the same functions as the processor 401 in fig. 12.

In particular implementations, processor 401 may include one or more CPUs, such as CPU0 and CPU1 in fig. 12, as one embodiment.

In particular implementations, electronic device 40 may include multiple processors, such as processor 401 and processor 407 in FIG. 12, for example, as an embodiment. Each of these processors may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor. A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).

In particular implementations, electronic device 40 may also include an output device 405 and an input device 406, as one embodiment. An output device 405 is in communication with the processor 401 and may display information in a variety of ways. For example, the output device 405 may be a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display device, a Cathode Ray Tube (CRT) display device, a projector (projector), or the like. The input device 406 is in communication with the processor 401 and can accept input from a user object in a variety of ways. For example, the input device 406 may be a mouse, a keyboard, a touch screen device, or a sensing device, among others.

Those skilled in the art will appreciate that the configuration shown in fig. 12 is not intended to be limiting of the electronic device 40 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

In addition, the present disclosure also provides a computer-readable storage medium, wherein when the instructions in the computer-readable storage medium are executed by a processor of the electronic device, the electronic device is enabled to execute the text recognition method provided in the above embodiment.

In addition, the present disclosure also provides a computer program product comprising computer instructions which, when run on an electronic device, cause the electronic device to perform the text recognition method as provided in the above embodiments.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice in the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. A method of text recognition, the method comprising:

performing text feature recognition processing on an image to be recognized to obtain an initial recognition result; the initial recognition result comprises at least one initial character; each initial character has a corresponding first candidate character set; each initial character is a character in a first candidate character set corresponding to each initial character, and any first candidate character set is obtained by identifying a preset position in the image to be identified;

determining a target character with a first confidence coefficient smaller than or equal to a first preset threshold value in the initial recognition result; the first confidence coefficient is obtained in the process of determining the initial recognition result;

semantic feature extraction processing is carried out on the initial recognition result, characters of a target position in the initial recognition result are predicted, and a second candidate character set is obtained; the target position is the position of the target character in the initial recognition result;

determining a replacement character of the target character based on the intersection of the second candidate character set and the first target candidate character set; the first target candidate character set is a first candidate character set corresponding to the target character;

and determining a target recognition result of the image to be recognized based on the replacing character and the initial recognition result.

2. The method of claim 1, wherein the recognizing the text in the image to be recognized to obtain an initial recognition result comprises:

inputting the image to be recognized into a first preset model to perform text feature recognition processing, and obtaining the initial recognition result; the first preset model is trained based on text feature recognition of a plurality of sample images.

3. The method of claim 1, wherein the determining the target character with the first confidence degree smaller than or equal to the first preset threshold in the initial recognition result comprises:

acquiring the first confidence coefficient of each initial character in the initial recognition result;

and determining the initial character with the first confidence coefficient smaller than or equal to the first preset threshold value as the target character.

4. The text recognition method of claim 1, wherein the semantic feature extraction processing on the initial recognition result to predict the character at the target position in the initial recognition result to obtain a second candidate character set comprises:

inputting the initial recognition result and the target position into a second preset model to perform semantic feature extraction processing to obtain a second candidate character set; the second preset model is trained based on text feature recognition of a plurality of sample texts.

5. The text recognition method of any one of claims 1-4, wherein determining the replacement character for the target character based on the intersection of the second set of candidate characters and the first set of candidate characters corresponding to the target character comprises:

determining a target confidence of each candidate character in the intersection; the target confidence coefficient of each candidate character is the sum of the first confidence coefficient and the second confidence coefficient corresponding to each candidate character; the second confidence coefficient is obtained by the second preset model in the process of determining the second candidate character set;

and determining the candidate character with the target confidence coefficient larger than or equal to a preset confidence coefficient in the intersection as a replacement character of the target character.

6. The text recognition method according to any one of claims 1 to 4, wherein the determining a target recognition result of the image to be recognized based on the replacement character and the initial recognition result comprises:

under the condition that the replacing characters are different from the target characters in the initial recognition result, replacing the target characters in the initial recognition result with the replacing characters, and determining the replaced initial recognition result as a target recognition result of the image to be recognized;

and under the condition that the replacing character is the same as the target character in the initial recognition result, determining the initial recognition result as a target recognition result of the image to be recognized.

7. The text recognition method according to any one of claims 2 to 4, wherein performing text feature recognition processing on the image to be recognized to obtain an initial recognition result comprises:

performing text feature recognition processing on the preset position in the image to be recognized based on the first preset model to obtain a first candidate character set corresponding to the preset position;

and determining the characters with the first confidence degrees larger than or equal to a second preset threshold value in the first candidate character set corresponding to the preset position as the initial characters.

8. A text recognition apparatus is characterized by comprising a processing unit and a determination unit;

the processing unit is configured to execute text feature recognition processing on an image to be recognized to obtain an initial recognition result; the initial recognition result comprises at least one initial character; each initial character has a corresponding first candidate character set; each initial character is a character in a first candidate character set corresponding to each initial character, and any first candidate character set is obtained by identifying a preset position in the image to be identified;

the determining unit is configured to determine a target character of which the first confidence coefficient is smaller than or equal to a first preset threshold value in the initial recognition result; the first confidence coefficient is obtained in the process of determining the initial recognition result;

the processing unit is further configured to perform semantic feature extraction processing on the initial recognition result, predict characters at a target position in the initial recognition result, and obtain a second candidate character set; the target position is the position of the target character in the initial recognition result;

the determining unit is further configured to determine a replacement character of the target character based on an intersection of the second candidate character set and the first target candidate character set, and determine a target recognition result of the image to be recognized based on the replacement character and the initial recognition result; the first target candidate character set is a first candidate character set corresponding to the target character.

9. An electronic device, comprising: a processor, a memory for storing instructions executable by the processor; wherein the processor is configured to execute instructions to implement the text recognition method of any one of claims 1-7.

10. A computer-readable storage medium having instructions stored thereon, wherein the instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the text recognition method of any of claims 1-7.