CN112507080A

CN112507080A - Character recognition and correction method

Info

Publication number: CN112507080A
Application number: CN202011482957.6A
Authority: CN
Inventors: 吕学强; 游新冬; 董志安
Original assignee: Beijing Information Science and Technology University
Current assignee: Beijing Information Science and Technology University
Priority date: 2020-12-16
Filing date: 2020-12-16
Publication date: 2021-03-16

Abstract

The application discloses a method for character recognition and correction, which comprises the following steps: constructing a professional word bank; constructing an identification result area matrix; and (6) correcting. According to the character recognition and correction method, the language model is introduced, the recognition result which best accords with a word stock is predicted according to the statistical condition probability, forward and backward correction is carried out through the corresponding relation of the detection items, the recognition accuracy is further improved, and finally the best recognition result is matched through the recognition method which integrates the editing distance and the longest public subsequence, so that the recognition accuracy is improved, and the requirement of practical application can be well met.

Description

Character recognition and correction method

Technical Field

The application relates to the technical field of computer vision, in particular to a character recognition and correction method.

Background

In recent years, with the continuous development and progress of deep learning and artificial intelligence, the field of computer vision has become a hot research direction, and has attracted extensive attention from academic and industrial fields. Computer vision provides technical support for various industries through the powerful interpretation capability of the computer vision on images. Among them, the construction of intelligent medical treatment has been proposed in the medical field, and has made a breakthrough in recent years. The laboratory test is indispensable to the patient of seeking medical advice, can produce a large amount of medical laboratory test orders simultaneously, very big increase doctor's work load. Since 2005, Google constantly maintained open-source Tesseract-OCR gained excellent performance in the field of character recognition, which made the whole academic and industrial circles raise a wave of artificial intelligence and various character recognition algorithms come into play. In the medical field, OCR technique can be through discerning the laboratory sheet characters, combines medical information system, utilizes artificial intelligence and big data to carry out preliminary reading to the laboratory sheet, not only can make the patient obtain timely seeing a doctor, also can alleviate staff's work load simultaneously, promotes diagnostic efficiency greatly. OCR is known as Optical Character Recognition, and Chinese is translated into Optical Character Recognition. For the post-processing of laboratory test report character recognition in natural scene, the existing post-processing methods all have certain defects. For example: the editing distance algorithm has an unsatisfactory effect on short sequences or situations such as missing or increasing of recognition results; the longest common subsequence algorithm can overcome the deletion or addition of characters, but when a plurality of sequences have the same common subsequence, the correction can be disputed; the language model only has a plurality of prediction results during prediction, and an optimal combined path is found by constructing an identification matrix; and so on. These defects result in poor laboratory sheet recognition in natural scenes.

Disclosure of Invention

The application aims to provide a method for character recognition and correction. The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview and is intended to neither identify key/critical elements nor delineate the scope of such embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.

According to an aspect of an embodiment of the present application, there is provided a method for correcting character recognition, including:

constructing a professional word bank;

constructing an identification result area matrix;

and (6) correcting.

Further, the correcting comprises:

correcting based on the language model;

performing correction based on the edit distance and the longest common subsequence;

and correcting based on the corresponding relation.

Further, performing rectification based on the language model, including:

the language model counts the occurrence probability of the characters through probability distribution, and calculates the maximum conditional probability through the statistical result; first detection region identification result S₁Selecting the first three candidate characters given by the CRNN network and the probability W (S) of each candidate region₁) Renormalizing the second candidate region recognition result S according to the probability of network prediction₂Selecting the first three candidate characters given by the CRNN network and the probability W (S) of each candidate region₂) Renormalizing the second candidate region recognition result S according to the probability of network prediction₂Selecting the first three candidate characters given by the CRNN network and the probability W (S) of each candidate region₃) Renormalizing according to the probability of the network prediction;

conditional probability P (S) based on probabilistic analysis₂|S₁) I.e. S₁Followed by S in the event of occurrence₂The probability of (d);

f＝W(S₁)P(S₂|S₁)W(S₂)P(S₃|S₂)W(S₃)

the maximum value of f is the optimal combination mode;

for the predicted sequence S₁，S₂，S₃...S_nThe maximum value of f, W (S), needs to be calculated_i) Renormalization is performed according to the CRNN prediction probability, and the conditional probability P (S)_i+1|S_i) From the lexicon, statistics S_iNumber of occurrences N (S)_i) Statistics of S_iAnd S_i+1Number of co-occurrences N (S) before and after_i，S_i+1)，

f＝W(S₁)P(S₂|S₁)W(S₂)...W(S_n-1)P(S_n|S_n-1)W(S_n)

The conditional probability formula is

And solving the optimal solution of the optimal combined path problem.

Further, the rectifying based on the edit distance and the longest common subsequence includes:

and carrying out weighted summation on the calculation results of the editing distance method and the longest common subsequence method.

Further, the correcting based on the corresponding relation includes:

and correcting the identification result of the item corresponding to the identified item according to the corresponding relation of the items and the identified item.

Further, the constructing the recognition result area matrix includes: and taking the first N of each region of the identification result to construct an identification result region matrix, wherein N is a positive integer.

According to another aspect of the embodiments of the present application, there is provided an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the method for character recognition and correction described above.

According to another aspect of the embodiments of the present application, there is provided a computer-readable storage medium having a computer program stored thereon, where the computer program is executed by a processor to implement the method for character recognition and correction described above.

The technical scheme provided by one aspect of the embodiment of the application can have the following beneficial effects:

according to the character recognition and correction method provided by the embodiment of the application, the language model is introduced, the recognition result which best accords with the word stock is predicted according to the statistical condition probability, the forward and backward correction is carried out through the corresponding relation of the detection items, the recognition accuracy is further improved, and finally the best recognition result is matched through the recognition method which integrates the editing distance and the longest public subsequence, so that the recognition accuracy is improved, and the requirement of practical application can be well met.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the application, or may be learned by the practice of the embodiments. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow chart of a method for character recognition correction according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an identification result area matrix in an embodiment of the present application;

FIG. 3 is a diagram illustrating a dynamic migration process in one embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Aiming at the character recognition of the laboratory test report in a natural scene, the embodiment of the application provides a character recognition and correction method, which comprises the following steps: firstly, a language model is utilized to obtain a prediction sequence which is more in line with the actual situation, then complete matching is carried out according to the constructed medical word stock, if the matching is successful, forward and backward correction can be carried out according to the corresponding relation of the test item of the laboratory test report, if the matching is failed, the word correction in the closest medical word stock can be selected according to the improved editing distance and the method of the longest public subsequence, and then further correction is carried out according to the corresponding relation, so that the identification accuracy rate is improved.

The method for character recognition and correction of the test sheet (or other character materials) fused with the language model can effectively solve the problems of near-word recognition errors and individual word recognition errors of medical nouns, and the whole flow chart is shown in figure 1. The method mainly comprises the following steps of correcting identification of a laboratory test report in a natural scene by two parts, wherein the first step is preprocessing, and constructing a medical word stock and an identification matrix. And secondly, performing OCR post-processing correction, which mainly comprises correction based on a language model, correction based on a corresponding relation and correction based on an editing distance and the longest common subsequence.

A method for identifying and correcting laboratory test single words of a fusion language model comprises the steps of inputting a word identification result and a constructed medical word stock, carrying out word frequency statistics by utilizing a Number function according to an identified area matrix, calculating the probability of each combined route in the area matrix by utilizing a statistical language model, and selecting the combined route with the maximum probability by utilizing a Max function to obtain an identification sequence. And secondly, performing detection item front-back relation matching correction by using a Relationship function. And if the matching is unsuccessful, performing a third step of operation, performing optimal similar matching by using a fused editing distance algorithm and a longest public subsequence algorithm, and correcting the recognition sequence through the matched medical lexicon information. Finally, the corrected sequence is output.

Recognition result correction preprocessing

Firstly, a professional lexicon is constructed according to real data, so that later statistics and correction are facilitated. Then, according to each area of the recognition result, top3 is taken, and a recognition result area matrix is constructed. The traditional character recognition only takes the highest prediction result of each region, and obtains the final result through region combination, and the biggest problem is that once recognition error occurs in a certain region, the whole error can be caused. The result of the first three of the probability predictions of each region is constructed to form a region matrix, and a foundation is laid for introducing a language model in the later period. top3 may be topN (i.e., the first N), where N is a positive integer. As shown in fig. 2. The medical word stock is a professional word stock. In this embodiment, the method of the present application is described by taking a medical lexicon as an example, and the method of the present application may also be applied to character recognition and correction in other industries or fields, and a corresponding professional lexicon needs to be constructed.

Recognition result correction based on language model

The language model is a model for the probability part of the sentence, and obtains the rules of the characters in the language by collecting the language and carrying out statistical analysis on the collected language characters. The language model can count the occurrence probability of the characters through probability distribution, so that the recognition result is effectively corrected preliminarily. Specifically, the maximum conditional probability, such as "aldosterone", is calculated by counting the results S of the first detection zone identification₁Selecting the first three candidate characters 'aldehyde', 'phenol', 'ether' given by the CRNN network, and the probability W (S) of each candidate region₁) Renormalizing the second candidate region recognition result S according to the probability of network prediction₂Selecting the probability W (S) of each candidate region of the first three candidate characters 'hui', 'gu' and 'tian' given by the CRNN₂) Renormalizing the second candidate region recognition result S according to the probability of network prediction₂Selecting the probability W (S) of each candidate region of the first three candidate characters 'Tung', 'Cu', 'Ken' given by the CRNN network₃) The combinations in the area matrix of the recognition result, i.e. 27, are renormalized according to the probability of the network prediction. Conditional probability P (S) based on probabilistic analysis₂|S₁) I.e. S₁Followed by S in the event of occurrence₂The probability of (c). As shown in the formula (1), it is easy to find that the maximum value of f is the optimal combination mode,and the recognition accuracy is further improved by utilizing a statistical mode.

f＝W(S₁)P(S₂|S₁)W(S₂)P(S₃|S₂)W(S₃) (1)

For the same reason, for the predicted sequence S₁，S₂，S₃...S_nIt is necessary to calculate the maximum value of the equation (2), where W (S)_i) Renormalization is performed according to the CRNN prediction probability, and the conditional probability P (S)_i+1|S_i) Medical lexicon, statistics S, obtained from pre-processing parts_iNumber of occurrences N (S)_i) Statistics of S_iAnd S_i+1Number of co-occurrences N (S) before and after_i，S_i+1) The conditional probability formula is shown in (3).

f＝W(S₁)P(S₂|S₁)W(S₂)...W(S_n-1)P(S_n|S_n-1)W(S_n) (2)

Although the conditional probability and the prediction probability are known, the complexity of the combination situation is exponentially increased, and the final result is difficult to obtain in effective time, so that the Viterbi algorithm is introduced to solve the optimal combination path problem, the optimal path combination can be abstracted into a dynamic programming problem, and a dynamic transfer equation is firstly constructed. If the optimal path passes a certain node S_iThen S must also be passed from the initial node_iAnd this path is to S_iAccording to the idea that each node only affects the conditional probability of the two nodes before and after the optimal path, the problem can be divided into a plurality of sub-problems, and the problem is solved through S_iThe optimal path only needs to pass through S_i-1The optimal path of all candidate points. The dynamic transfer equation is shown in equation (24):

dp[i，j]＝max(dp[i-1，k]+value(k，j)) (4)

k in the dynamic transfer equation represents the number of the selected predicted values, value (k, j) represents the conditional probability from the kth point of the i-1 th area to the jth point of the i-th area, as shown in fig. 3, some line segments in the graph are marked with a letter a, some line segments are marked with a letter b, and the rest line segments are not marked with letters, wherein the line segments without the letters represent a violent solving process, the line segments marked with the letter a represent a dynamic planning solving process, and each node only needs to select an optimal arrival path. Therefore, only n times of calculation is needed in each step, the solution of each subproblem is sequentially recurrently solved, and finally the optimal solution of the problem can be gradually solved.

Recognition result correction fusing edit distance and longest common subsequence

The Levenshtein Distance, also called Edit Distance (Edit Distance), is a common algorithm for measuring similarity between two sequences, and the algorithm converts one sequence into another sequence through operations such as insertion, deletion, and replacement, and counts the minimum required steps. In this embodiment, the OCR recognition result and the constructed medical lexicon are subjected to edit distance calculation, so that the medical noun in the medical lexicon closest to the recognition result can be found. The editing distance algorithm applies the idea of dynamic programming, the identification result and the medical noun in the medical word stock are respectively input into sequences s and e, dp [ i, j ] represents the editing distance between the first i elements of the sequence s and the first j elements of the sequence e, a dynamic transfer equation is shown as a formula (5), if the ith element of the sequence s is equal to the jth element of the sequence e, the editing operation is 0, and dp [ i, j ] ═ dp [ i-1, j-1 ]; if the ith element of the sequence s is not equal to the jth element of the sequence e, then dp [ i, j ] can perform three operations, the first operation mode is to delete the ith element of the sequence s, the second mode deletes the jth element of the sequence e, the third mode replaces the operation to make the ith element equal to the jth element, and the three modes take the minimum value.

The longest common subsequence is also an algorithm using dynamic programming, is mainly used for solving the similarity between two sequences, and can also be used for correcting character recognition. For the input sequences x and y respectively representing the recognition result and the medical noun in the medical word stock, C [ i, j ] represents the longest common sequence of the first i elements of the sequence x and the first j elements of the sequence y, and the dynamic transfer equation is shown in formula (6), if the ith element of the sequence x and the jth element of the sequence y are equal, the length of the longest common subsequence is increased by one, and C [ i, j ] ═ C [ i-1, j-1] +1, if the ith element of the sequence x and the jth element of the sequence y are not equal, C [ i, j ] takes the maximum values of C [ i, j-1] and C [ i, j-1], which is equivalent to the longest common subsequence of the sequence after deleting the ith element or the jth element, and then the scheme with the longest common subsequence after deleting is deleted for deletion.

For the condition of missing characters or multiple characters, the editing distance method has certain limitation, and the longest common subsequence algorithm can generate an ambiguous condition for character strings with different lengths but common subsequences. The present embodiment merges the edit distance method and the longest common subsequence method. For the recognition sequence length n, the medical noun sequence length m in the medical thesaurus, Q [ n, m ] represents the similarity of the two sequences, the distribution rule of the data is found through statistical analysis, the proportional coefficient alpha is set to be 0.3 and the proportional coefficient beta is set to be 0.7 by combining two methods of the editing distance and the longest common subsequence, and the solving formula is shown as the formula (7).

Q[n，m]＝α·dp[n，m]+β·(min(n，m)-C[n，m]) (7)

Correction of recognition result of correspondence

According to the actual test sheet data, each test sheet contains seven items at most, namely a serial number, a project name, an abbreviation, a detection result, a unit, a reference value range and a prompt. The item name, the abbreviation, the unit and the reference value range have fixed corresponding relations, and if the item name and the abbreviation of one inspection item are correctly identified, the identification results of other columns can be corrected front and back according to the item name and the abbreviation of each inspection item. In the case of a part of test sheets with a plurality of test items in a row, two test items in a row are checked and split according to the identification result of the column names and the test item characters, and the two test items are corrected respectively.

According to the method, the accuracy rate, the recall rate and the F1 value of the recognition result are high, and the text recognition added with the post-processing is more accurate. Some form and word errors can be corrected by using a post-processing method, some medical words can be corrected accurately, and meanwhile, the recognition result can be further normalized through post-processing, so that post-processing is facilitated. The accuracy rate of text recognition can be improved by introducing post-processing through qualitative analysis.

In another embodiment of the present application, an electronic device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor executes the program to implement the method for character recognition and correction described above.

In another embodiment of the present application, a computer-readable storage medium is provided, on which a computer program is stored, and the computer program is executed by a processor to implement the method for character recognition correction described above.

The embodiment of the application provides a method for identifying and correcting laboratory test report characters in a natural scene by fusing language models. The method introduces a statistical language model, carries out conditional probability statistics on a recognition area matrix, predicts the optimal recognition result which accords with a medical word stock, then carries out forward and backward correction according to the corresponding relation of a test item, and finally carries out recognition result correction based on a fused editing distance and a longest public subsequence method. Compared with the identification result without post-processing, the post-processing method introduced by the embodiment has the effects of improving the accuracy, the recall rate and the F1 value by 2%, 3%, 2%, 5% and 4% on two batches of medical laboratory sheets respectively. The comparison experiment shows that the post-processing correction method provided by the embodiment can further improve the recognition precision of the text box characters, and lays a foundation for later-stage laboratory sheet interpretation.

The embodiment of the application provides a character recognition and correction method fusing a language model. By introducing a language model, counting the recognition result of the conditional probability prediction which best accords with the medical word stock, carrying out forward and backward correction through the corresponding relation of the detection items, further improving the recognition accuracy, and finally matching the best and appropriate recognition result through a recognition method of fusing the editing distance and the longest public subsequence. A comparison experiment shows that after the correction method provided by the application is introduced, the identification accuracy of the laboratory test report is improved, and the effectiveness of the method is proved.

It should be noted that:

the algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose devices may be used with the teachings herein. The required structure for constructing such a device will be apparent from the description above. In addition, this application is not directed to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the present application as described herein, and any descriptions of specific languages are provided above to disclose the best modes of the present application.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the application, various features of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: this application is intended to cover such departures from the present disclosure as come within known or customary practice in the art to which this invention pertains. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in a strict order unless explicitly stated in the present embodiment, and may be performed in other orders. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The above-mentioned embodiments only express the embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. A method for character recognition correction, comprising:

constructing a professional word bank;

constructing an identification result area matrix;

and (6) correcting.

2. The method of claim 1, wherein the correcting comprises:

correcting based on the language model;

and correcting based on the corresponding relation.

3. The method of claim 2, wherein performing the correction based on the language model comprises:

the language model counts the occurrence probability of the characters through probability distribution, and calculates the maximum conditional probability through the statistical result; first detection region identification result S₁Selecting the first three candidate words given by the CRNN network, each timeProbability W (S) of each candidate region₁) Renormalizing the second candidate region recognition result S according to the probability of network prediction₂Selecting the first three candidate characters given by the CRNN network and the probability W (S) of each candidate region₂) Renormalizing the second candidate region recognition result S according to the probability of network prediction₂Selecting the first three candidate characters given by the CRNN network and the probability W (S) of each candidate region₃) Renormalizing according to the probability of the network prediction;

f＝W(S₁)P(S₂|S₁)W(S₂)P(S₃|S₂)W(S₃)

the maximum value of f is the optimal combination mode;

f＝W(S₁)P(S₂|S₁)W(S₂)...W(S_n-1)P(S_n|S_n-1)W(S_n)

The conditional probability formula is

And solving the optimal solution of the optimal combined path problem.

4. The method of claim 2, wherein performing rectification based on edit distance and longest common subsequence comprises:

5. The method of claim 1, wherein the performing the correction based on the correspondence comprises:

6. The method of claim 1, wherein the constructing the recognition result area matrix comprises: and taking the first N of each region of the identification result to construct an identification result region matrix, wherein N is a positive integer.

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the program to implement the method of any one of claims 1-6.

8. A computer-readable storage medium, on which a computer program is stored, characterized in that the program is executed by a processor to implement the method according to any of claims 1-6.