CN103996021A - Fusion method of multiple character identification results - Google Patents
Fusion method of multiple character identification results Download PDFInfo
- Publication number
- CN103996021A CN103996021A CN201410191507.XA CN201410191507A CN103996021A CN 103996021 A CN103996021 A CN 103996021A CN 201410191507 A CN201410191507 A CN 201410191507A CN 103996021 A CN103996021 A CN 103996021A
- Authority
- CN
- China
- Prior art keywords
- character
- msub
- mtr
- mtd
- characters
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000007500 overflow downdraw method Methods 0.000 title abstract description 5
- 238000000034 method Methods 0.000 claims abstract description 38
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 6
- 230000004927 fusion Effects 0.000 claims abstract description 5
- 239000011159 matrix material Substances 0.000 claims description 20
- 230000011218 segmentation Effects 0.000 abstract description 9
- 230000000694 effects Effects 0.000 abstract description 4
- 238000013179 statistical model Methods 0.000 abstract description 4
- APTZNLHMIGJTEW-UHFFFAOYSA-N pyraflufen-ethyl Chemical compound C1=C(Cl)C(OCC(=O)OCC)=CC(C=2C(=C(OC(F)F)N(C)N=2)Cl)=C1F APTZNLHMIGJTEW-UHFFFAOYSA-N 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 6
- XEEYBQQBJWHFJM-UHFFFAOYSA-N Iron Chemical compound [Fe] XEEYBQQBJWHFJM-UHFFFAOYSA-N 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000012937 correction Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 229910052742 iron Inorganic materials 0.000 description 2
- 229910052754 neon Inorganic materials 0.000 description 2
- GKAOGPIIYCISHV-UHFFFAOYSA-N neon atom Chemical compound [Ne] GKAOGPIIYCISHV-UHFFFAOYSA-N 0.000 description 2
- 238000012805 post-processing Methods 0.000 description 2
- 206010008748 Chorea Diseases 0.000 description 1
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 208000012601 choreatic disease Diseases 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Landscapes
- Character Input (AREA)
Abstract
The invention discloses a fusion method of multiple character identification results. The fusion method comprises the steps that at least two strings are obtained from at least two character identifiers, and each string comprises multiple characters; identical characters in the two strings are aligned via an optimal alignment algorithm based on a minimal editing distance; all the strings are aligned according to the identical characters, i.e., the identical characters in the multiple strings are aligned; segmentation is carried out according to the aligned identical characters in the multiple strings to obtain segmental aligned links; and an optimal link path is selected from the segmental aligned links to obtain a fusion result. The method determines a result which is most probable to be correct in multiple different identification parts by utilizing a statistical model based on characters, thereby selecting the optimal link path, and achieving a good effect.
Description
Technical Field
The invention relates to a character recognition technology, in particular to a fusion method of multi-character recognition results.
Background
Automatic mail sorting is an important component of postal automation, wherein, one automatic mail sorting technology is to collect mail images, segment the postal code area and address area of mail receivers, identify numbers and Chinese characters of the segmentation results, and realize automatic sorting according to the identification results. Therefore, a correct identification of the mail addresses is an important basis for a correct sorting.
In practical application, the address area of the mail is not clear enough, and the like, which often brings many errors to the recognition result of the character recognizer, and there are two main types: firstly, the character segmentation of the address block is correct, but errors are caused because the first character recognition accuracy is not high enough; the second is character segmentation error of address block, which causes recognition result error. For the errors, the proposed and used method for fusing the results of the multiple character recognizers can reduce the influence caused by the errors of a single character recognizer, so that the recognition accuracy of the final result is greatly improved.
The recognition error correction of the Chinese character recognizer belongs to a post-processing part of a recognition system, namely, the error result of the character recognizer is corrected by combining the semantics and word senses of natural language. In the prior art, post-processing is mainly performed based on single-character recognizer recognition results, and error correction for the single-character recognizer recognition results is mainly based on two methods, namely statistics and rules. The rule-based approach is to use a rule set and some exact dictionary information; statistical-based methods typically use a language model that is based on knowledge of the language and knowledge in the analysis corpus. For single-character recognizer recognition results, if the erroneous result is due to character segmentation errors of the character, it is difficult to correct whether rule-based or statistical-based.
Disclosure of Invention
The invention overcomes the defects of the prior art and provides a method for fusing multi-character recognition results.
The invention provides a method for fusing multi-character recognition results, which comprises the following steps:
the method comprises the following steps: obtaining at least 2 character strings from at least 2 character recognizers; the character string comprises a plurality of characters;
step two: aligning the same character in the two character strings by using an optimized alignment algorithm based on the minimum editing distance;
step three: aligning all character strings according to the same character to realize the same character alignment of multiple character strings;
step four: segmenting according to the same aligned characters in the multi-character string to obtain segment aligned links;
step five: and selecting the optimal link path in the segment alignment link to obtain a fusion result.
In the method for fusing the multi-character recognition results provided by the invention, the second step comprises the following steps:
step a: calculating the minimum editing distance between the two character strings to generate an editing distance matrix;
step b: obtaining a unit which can be reached by a minimum edit distance backspacing path in the edit distance matrix, and calculating an attribute tuple of the unit;
step c: acquiring an optimal alignment mode from the unit according to the attribute tuple;
step d: and repeating the steps a to c until the two character strings are aligned.
In the method for fusing the multi-character recognition results, the minimum editing distance between the characters is expressed by the following formula:
wherein, <math>
<mfenced open='{' close=''>
<mtable>
<mtr>
<mtd>
<mi>ins</mi>
<mo>-</mo>
<mi>cos</mi>
<mi>t</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>B</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mn>1</mn>
</mtd>
</mtr>
<mtr>
<mtd>
<mi>del</mi>
<mo>-</mo>
<mi>cos</mi>
<mi>t</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>A</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mn>1</mn>
</mtd>
</mtr>
<mtr>
<mtd>
<mi>subst</mi>
<mo>-</mo>
<mi>cos</mi>
<mi>t</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>A</mi>
<mi>i</mi>
</msub>
<mo>,</mo>
<msub>
<mi>A</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfenced open='{' close=''>
<mtable>
<mtr>
<mtd>
<mn>2</mn>
<mo>,</mo>
<msub>
<mi>A</mi>
<mi>i</mi>
</msub>
<mo>≠</mo>
<msub>
<mi>A</mi>
<mi>j</mi>
</msub>
</mtd>
</mtr>
<mtr>
<mtd>
<mn>0</mn>
<mo>,</mo>
<msub>
<mi>A</mi>
<mi>i</mi>
</msub>
<mo>=</mo>
<msub>
<mi>A</mi>
<mi>j</mi>
</msub>
</mtd>
</mtr>
</mtable>
</mfenced>
</mtd>
</mtr>
</mtable>
</mfenced>
</math>
wherein distance [ i, j ]]Represents the minimum edit distance, i represents the character number in the target string, m represents the total number of characters in the target string, j represents the character number in the source string, n represents the total number of characters in the source string, ins-cost (B)j) Indicating a distance penalty, del-cost (A) of adding a characteri) Represents the distance cost of deleting a character, subs-cost (A)i,Aj) Representing the distance cost of replacing a character.
In the method for fusing the multi-character recognition results, the optimal path comprises the following steps:
step b 1: for two address strings with the lengths of m and n respectively, constructing an editing distance matrix with m +1 rows and n +1 columns, selecting units [ m, n ] or [0, 0] from the editing distance matrix as a starting point and an end point respectively, and taking the starting point to the end point as a path direction;
step b 2: establishing a tuple for characterizing each cell attribute in the distance editing matrix, the tuple comprising:
element targetijFor characterizing the maximum number of identical characters from said starting point to said cell;
element tagijIf the numerical value is true, the character in the ith row is represented to be the same as the character in the jth column;
element subijCharacterizing a maximum number of replacement operations from the endpoint to the cell;
element leftijIf the value is true, the unit is characterized by the existence of a transverse unit;
element downijIf the numerical value is true, the unit is characterized to have a longitudinal unit;
element obliqueijIf the numerical value is true, the unit is represented to have an inclined unit;
step b 3: according to the tuple, if the transverse unit of the starting point exists and the maximum replacing operation times of the transverse unit are equal to the maximum replacing operation times of the starting point, the path is from the starting point to the transverse unit; otherwise, if the longitudinal unit of the starting point exists and the maximum replacing operation times of the longitudinal unit are equal to the maximum replacing operation times of the starting point, the path is from the starting point to the longitudinal unit; otherwise, if the slant unit of the starting point exists, the path is from the starting point to the slant unit; after the path is updated, continuing to update the path trend according to the tuple until the path is from the starting point to the end point position;
step b 4: obtaining the tuple tag from the pathijAnd for the true unit, obtaining the same character between two character strings, and aligning the two character strings according to the same character.
In the method for fusing the multi-character recognition results, the characters are grouped according to positions by the aligned character strings in the fourth step, the probability values of the characters between the groups from one character to the other are calculated one by one, and the path formed by the characters with the maximum probability value is marked as the path of the correct character.
In the method for fusing multi-character recognition results, the probability is expressed by the following formula:
in the formula, rk1,rk2,rk3Respectively represent the weight, pr (a)k|ak+1) Is shown in character ak+1Character a in case of already occurringkProbability of occurrence, pr (b)k|bk+1) Is shown in character bk+1Character b in case of already occurringkProbability of occurrence, pr (c)k|ck+1) Is shown in character ck+1Character c in case it has appearedkProbability of occurrence, pr (L)A) Represents a segment LA={a1,a2,...,amThe probability of occurrence of the entire string in pr (L)B) Represents LB={b1,b2,...,bnThe probability of occurrence of the entire string, pr (L)C) Represents LC={c1,c2,...,cpThe probability of occurrence of the entire string.
The beneficial effects of the invention include: the invention adopts an optimal alignment method based on the minimum editing distance, and selects a path which can ensure that the number of times of replacement operation is the maximum when the number of the same character alignment is the maximum by calculating the maximum number of the same characters and the maximum number of times of replacement operation of each path unit, thereby maximizing the expected alignment. In order to solve the problem, a statistical model based on characters is used for confirming the most probable correct result in a recognition difference part, so that the optimal link path is selected, and a good effect is achieved.
Drawings
FIG. 1 is a flow chart of a method for fusing multiple character recognition results according to the present invention.
FIG. 2 is a diagram of an edit distance matrix in one embodiment.
Fig. 3 is a diagram of a segment aligned link in one embodiment.
FIG. 4 is a flow diagram of tagging element attribute tuples.
FIG. 5 is a flowchart of a method for obtaining an optimal alignment of two address strings according to a path element attribute tuple.
Fig. 6 is a flow chart of a selection probability calculation method.
Detailed Description
The present invention will be described in further detail with reference to the following specific examples and the accompanying drawings. The procedures, conditions, experimental methods and the like for carrying out the present invention are general knowledge and common general knowledge in the art except for the contents specifically mentioned below, and the present invention is not particularly limited.
The invention confirms the most correct character possible in the difference parts of a plurality of character strings through the statistical model of the character recognition result, thereby selecting the path of the optimal link and further achieving good recognition effect. The invention is suitable for recognizing Chinese and English characters in images, and is particularly suitable for recognizing Chinese addresses containing Chinese and English characters and numbers. As shown in fig. 1, the method for fusing multi-character recognition results of the present invention comprises the following steps:
the method comprises the following steps: obtaining at least 2 character strings from at least 2 character recognizers; the character string includes a plurality of characters;
step two: aligning the same character in the two character strings by using an optimized alignment algorithm based on the minimum editing distance;
step three: aligning all character strings according to the same character to realize the alignment of the same character of multiple character strings;
step four: segmenting according to the same aligned characters in the multi-character string to obtain segment aligned links;
step five: and selecting the optimal link path in the segment alignment link to obtain a fusion result.
The invention can fuse the results of a plurality of character recognizers and can effectively improve the performance of a recognition system.
The following exemplifies a character string composed of three character recognizers, in which recognition errors exist in the single character recognizer. The character string is a character string generated by a character recognizer according to a character image containing' correct address: and 4, identifying a target image of 'New people shopping in New people network of 41 th building No. 755 of Weihai road of Shanghai city'.
OCR A: the squashed canoe minor of the New people 11, chorea No. 755, of New people
OCR B: xinminjiu Jun of Xinmin network, Wei Hai Lu 755 # 41G Ba
OCR C: new people shopping with Weihailu No. 7S5 straight-building new people network
Since different character recognizers have great difference in address string segmentation and recognition, there are many segmentation or recognition errors in their character strings. It is possible for a character segmentation error to divide a plurality of characters into 1 character or 1 character into a plurality of characters. This makes the length and position of the recognition address strings output by different character recognizers not necessarily the same. Therefore, when fusing the results of multiple character recognizers, it is necessary to align the same characters that are correctly recognized. The invention adopts an alignment method based on the editing distance, and can effectively select the expected optimal path.
The edit distance of two strings represents the minimum cost required to convert from one string to another through the following three editing operations. The editing operation includes three types: add (I), delete (D) and replace (S), each with a different cost value.
In the alignment of the output address strings of the multi-character recognizer, the method uses the address string alignment based on the editing distance and mainly comprises the following 3 steps:
step a: calculating the minimum editing distance between the two character strings to generate an editing distance matrix;
step b: obtaining a unit which can be reached by a minimum edit distance backspacing path in the edit distance matrix, and calculating an attribute tuple of the unit;
step c: acquiring an optimal alignment mode from the unit according to the attribute tuple;
step d: and repeating the steps a to c until the two character strings are aligned.
Calculating an edit distance for two character strings, wherein A ═ a1,a2,...,amIs the target string, B ═ B1,b2,...,bnAnd is the source character string. By distance [ i, j ]]Representing a character string { a }1,a2,...,aiI is not less than 1 and not more than m and b1,b2,...,bnJ is less than or equal to 1 and less than or equal to n. The value of each cell of the edit distance matrix is the minimum of the costs in the three paths that may exist to reach the cell. ComputingThe method comprises the following steps:
wherein, <math>
<mfenced open='{' close=''>
<mtable>
<mtr>
<mtd>
<mi>ins</mi>
<mo>-</mo>
<mi>cos</mi>
<mi>t</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>B</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mn>1</mn>
</mtd>
</mtr>
<mtr>
<mtd>
<mi>del</mi>
<mo>-</mo>
<mi>cos</mi>
<mi>t</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>A</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mn>1</mn>
</mtd>
</mtr>
<mtr>
<mtd>
<mi>subst</mi>
<mo>-</mo>
<mi>cos</mi>
<mi>t</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>A</mi>
<mi>i</mi>
</msub>
<mo>,</mo>
<msub>
<mi>A</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfenced open='{' close=''>
<mtable>
<mtr>
<mtd>
<mn>2</mn>
<mo>,</mo>
<msub>
<mi>A</mi>
<mi>i</mi>
</msub>
<mo>≠</mo>
<msub>
<mi>A</mi>
<mi>j</mi>
</msub>
</mtd>
</mtr>
<mtr>
<mtd>
<mn>0</mn>
<mo>,</mo>
<msub>
<mi>A</mi>
<mi>i</mi>
</msub>
<mo>=</mo>
<msub>
<mi>A</mi>
<mi>j</mi>
</msub>
</mtd>
</mtr>
</mtable>
</mfenced>
</mtd>
</mtr>
</mtable>
</mfenced>
</math>
wherein distance [ i, j ]]Represents the minimum edit distance, i represents the character number in the target string, m represents the total number of characters in the target string, j represents the character number in the source string, n represents the total number of characters in the source string, ins-cost (B)j) Indicating a distance penalty, del-cost (A) of adding a characteri) Represents the distance cost of deleting a character, subs-cost (A)i,Aj) Representing the distance cost of replacing a character.
For example, the above algorithm is used to calculate the edit distance matrix of the character strings "north way of the middle mountain" and "iron mountain way of the Baoshan", as shown in table 1:
TABLE 1 edit distance matrix for character strings "Zhongshan Bei Yi" and "Baoshan Tie shan Yi
Road surface | 5 | 6 | 5 | 6 | 7 | 6 |
4 | 5 | 4 | 5 | 6 | 7 | |
North China | 3 | 4 | 3 | 4 | 5 | 6 |
Mountain | 2 | 3 | 2 | 3 | 4 | 5 |
In | 1 | 2 | 3 | 4 | 5 | 6 |
# | 0 | 1 | 2 | 3 | 4 | 5 |
# | Treasure | Mountain | Iron | Mountain | Road surface |
The optimal path selected based on the minimum edit distance rollback method is not unique, and different paths have obvious difference in alignment. As shown in table 2, two different alignment modes are represented by the same minimum edit distance and the maximum number of characters, and different numbers of replacement operations. The occurrence probability of the first alignment mode is greater than that of the second alignment mode. Therefore, the present invention provides a method for selecting an optimal path. The method not only meets the requirement of the minimum editing distance, but also meets the requirement of the maximum number of replacement operations when the number of the same characters is selected to be the maximum, and the improvement can obviously improve the accuracy of alignment.
TABLE 2 alignment of the strings "Zhongshan Bei Yi" and "Baoshan Tie shan Yi
For the selection of the optimal path, the time complexity is reduced if each path is searchedIs O (3)n) Where n is the larger of the source string length and the target string length. In order to solve the problem that the complexity of searching a target path is too high, the following path searching method is provided.
Edit each cell [ i, j ] of the distance matrix]Using a tuple of attributes (target)ij,subij,tagij,leftij,downij,obliqueij) Representing the properties of the cell. The attribute tuple includes:
element targetijFor characterizing the maximum number of identical characters from the starting point to the cell;
element tagijIf the numerical value is true, the method represents whether the ith row character is the same as the jth row character;
element subijFor characterizing a maximum number of replacement operations from the endpoint to the cell;
element leftijIf the numerical value is true, the representation unit has a transverse unit;
element downijIf the numerical value is true, the representation unit has a longitudinal unit;
element obliqueijIf the numerical value is true, the representation unit has an oblique unit;
for each cell of the edit distance matrix, there are 3 possible directional paths, which are horizontal, vertical and diagonal, respectively. And marking the direction attribute of the path unit according to the minimum edit distance rollback path.
Referring to FIG. 4, the edit distance matrix is a matrix of i rows and j columns with cells [0, 0]]As an end point, cell [ i, j ]]As a starting point, for cell [ i, j]If the cell [ i, j-1 ]]Is present and distance [ i, j-1 ]]<distance[i,j]Then leftijTrue; if the cell [ i-1, j ]]Exists and distance [ i-1, j ]]<distance[i,j]Then downijTrue; if the cell [ i-1, j-1 ] is obliquely down]Is present and distance [ i-1, j-1]<distance[i,j]Or distance[i-1,j-1]==distance[i,j]And tagijWhen true, then obliqueijTrue. As shown in FIG. 4, the present invention also includes the use of the unit [0, 0]]As starting point, cell [ i, j ]]The way to find the optimal path for the end point is that the direction of its cells is reversed from the above process.
Path unit [ i, j ]],targetijRepresenting slave units m, n]Reach cell [ i, j]The largest number of identical characters encountered. If cell [ i, j]If the corresponding characters are equal, tagijTrue; cell [ i, j ]]It is only possible to reach from 3 directions, from above, from obliquely above and from the right. Therefore, targetij=max(targeti+1j,targetij+1,targeti+1j+1+tagi+1j+1)。
Path unit [ i, j ]],subijRepresenting the slave unit [0, 0]]Reach cell [ i, j]The maximum number of replacement operations performed. For the cell [ i, j]It is possible to arrive from any of the 3 directions of left, from bottom and obliquely bottom, and the arrival unit [ i, j ] is selected]Maximum value sub of the replacement operation ofij=max(subij-1,subi-1j,subi-1j-1+1)。
Selecting a slave unit m, n according to the attribute tuple of the unit]To [0, 0]]The path performs the replacement operation the most times under the condition of the most identical words. For cell [ i, j]If leftijTrue and subij-1=subijThen goes to the unit [ i, j-1 ]](ii) a If downijTrue and subi-1j=subijThen goes to the unit [ i-1, j ]](ii) a Otherwise, go to the unit [ i-1, j-1]. And repeating the steps until the position goes from the starting point to the end point to obtain the optimal path. Obtaining tuple tag from optimal pathijAnd if the cell is a true cell, obtaining the same character corresponding to the cell, and aligning the two character strings according to the same character to obtain the optimal alignment mode.
Table 3 shows the attribute distribution of a unit, and fig. 2 shows the unit attribute tuples generated in the alignment of two character strings "north-middle road" and "iron-mountain road" in baoshan, and the selection result of the optimal path.
Attribute representation of Table 3 elements
After the character strings are aligned pairwise by the method, the same characters aligned pairwise with addresses are obtained, and the same characters aligned pairwise with the addresses are combined into multi-address alignment by a method of searching and matching the same subscript, so that the aim of obtaining the same characters aligned with multiple addresses is fulfilled. Referring to fig. 5, the method steps for three address (OCRA, OCRB, OCRC) alignment are: 1. marking the same character labels of a certain alignment of OCRA and OCRB as i and j respectively; 2. in the alignment of OCRA and OCRC, if the ith character of OCRA is aligned with the kth character of OCRC, turning to step 3, otherwise, returning to step 1. 3. In the alignment of OCRB and OCRC, if the j-th character of OCRB is aligned with the k-th character of OCRC, the character is a plurality of address alignment characters. And (3) after the kth character is recorded, returning to the step 1 to search the next same character with multiple address alignments until all the same characters with multiple address alignments are obtained.
Multiple character recognizer string fusion is an optimal path selection problem. Aligning character strings of a plurality of character recognizers, segmenting according to the same aligned characters to form segment aligned links, and finally selecting an optimal link path by using a character-based statistical language model. In the selection of the optimal link path, because the space between Chinese written words is lack and the recognition result is wrong, the path selection based on dictionary matching and rules is difficult to use. Thus, it is a good effect to use a character-based statistical model to determine the most likely correct result in identifying the difference portion.
The character string is segmented according to the same character, and multiple character recognizers can form multi-segment alignment to form a segment alignment link. In the segment alignment link, the aligned same characters can be regarded as combined into one character, multiple candidate paths are formed among different characters, and the selected different characters with the highest probability value and the path formed by the same characters are the optimal link path. Paths are selected within the aligned segments, and the most probable segment is selected if the segments are not of the same length. In this case, there is only a path within a segment and no path between segments. And if the lengths of the plurality of segments are the same, selecting the one with the highest probability in the corresponding single character in the plurality of segments according to a reverse order, wherein the condition comprises an intra-segment path and an inter-segment path. FIG. 3 is a segment aligned link diagram formed after string alignment.
The probability maximum path is selected based on a character statistical language model, wherein the statistical language model discloses the rules existing in the natural language by using a probability statistical method, and actually the rules are probability distribution to give the probability of all possible character strings in the natural language. The appearance of any string is acceptable to the statistical language model, only with different acceptability. For example, for a string w1,w2,...,wi(i represents the string length), the probability of occurrence is:
pr(W1,W2,…,wi)=pr(W1)*pr(W2|W1)*…*pr(wi|W1,W2,…,wi-1)
wherein pr (w)l),pr(W2|W1),...,pr(wi|W1,W2,...,wi-1) Is calculated by corpus statistics. But pr (w)i|W1,W2,...,wi-1) The computation of (2) is easy to cause the sparse problem due to the insufficient completeness of the corpus, and the Markov chain model, namely an N-gram model is usually used for carrying out the hypothesis. For example: unigram: pr (w)i|W1,W2,...,wi-1)=pr(wi);bigram:pr(wi|W1,W2,...,wi-1)=pr(wi|wi-1) (ii) a The bigram model used in the invention is used for probability calculation.
Pr (a) of Bigram modelk|ak+1) Using a Maximum Likelihood Estimation (MLE) estimation,wherein # (a)k+1) Denotes ak+1Number of occurrences in corpus, # (a)k,ak+1) Is shown (a)k,ak+1) Number of occurrences in the corpus.
Due to the completeness of the corpus, there are many entries that do not exist in the corpus or occur only infrequently. For nonexistent entry probability, a simple Laplace smoothing algorithm is usedFor the small probability problem of bigram model, an interpolation method is used to improve, pr (a)k|ak+1)=λ1pr(ak|ak+1)+λ2pr(ak). Therein, sigmaiλi=1。
Calculating the probability of characters on the path has two directions of left to right and right to left. More information can be obtained by selecting right to left, and the right result can be better selected by adding some rules in the calculation of probability, wherein the rules are as follows:
1. in the Chinese address identification, when keywords such as 'Chao', 'number', 'building', 'layer', 'building', 'multi-span', 'room' and the like appear, the probability of the appearance of the numbers in the front is much greater than that of the Chinese characters;
2. in the Chinese address recognition, the probability of many keywords is much higher than that of a general word, for example, the probability of selecting keywords such as "province", "city", "district", "county", "town", "road", "village", "fidu", "number", "building", "layer", "building", "room", etc. is higher than that of a general word; therefore, it may be preferable to increase in calculating the probabilityA weight rkIn the case of a signal that satisfies the condition 1 or 2,otherwise rk=1;
3. For the probability of a single number, because the occurrence frequency of the number in the training sample is very high, and the number only exists between 0 and 9, the probability of the number is large during calculation, and therefore, a limit value N (for example, the value is 50) is given to the occurrence frequency of the single number;
4. for the probability of 2 consecutive numbers, a limited number of occurrences M (e.g., 1000) is also given for the same reason as rule 3.
Referring to fig. 6, if the address strings OCRA, OCRB, OCRC are strings of three character recognizers, L is respectively set for three segments after character segmentationA={a1,a2,...,am},LB={b1,b2,...,bn},LC={c1,c2,...,cp}. If the number m, n, p of the characters after segmentation is equal, i.e. m ═ n ═ p, then the character max (r) with the highest probability is selected in sequencek1log(pr(ak|ak+1)),rk2log(pr(bk|bk+1)),rk3log(pr(ck|ck+1) K is not less than 1 and not more than m); otherwise, if not, selecting the segment max (log (pr (L)) with the maximum probabilityA)),log(pr(LB)),log(pr(LC)))。
Wherein, log (pr (L)A) Is calculated as follows) is calculated as follows,
pr(LA)=pr(a1,a2,…,am)
=pr(am)*pr(am-1|am)*...*pr(al|a2)
for the convenience of calculation, two sides are obtained by taking logarithm,
due to LA,LB,LCMay be different, to avoid the deviation caused by the different lengths, an average is taken, and a rule weight is added, that is:
similarly, log (pr (L) can be calculatedB) And log (pr (L)C))。
The following exemplifies a character string composed of three character recognizers, which is a character recognizer according to the "correct address" containing a character image: and 8' of Zhangjiang Harley road 898.
OCR A: zhang Jiang Ha Le 898, 8;
OCR B: zhangjiang thunderbolt 898 neon 8;
OCR C: zhang Jiang Harley 8 in 8 No. 8;
when the optimal link path selection is performed on the 3 address strings, the situations of the same segment length and different segment lengths are encountered.
For the case where the segment lengths are not the same: l isA{98 do }, LB{98 neon }, LCSegment probabilities are calculated as follows:
due to log (pr (L)A))>log(pr(LC))>log(pr(LB) The selected segment is "98 do".
For the same segment length case: l isA(Lei) Lei, LBThunderway, LCThe link probability is calculated and selected as follows:
since r (chore, 8) log (pr (chore |8)) < r (way, 8) log (pr (way |8)), the selected character is "way".
For the selection of the next character, wherein the character that has appeared selects the character that has been selected in the previous step, i.e. "way", the probability calculation and selection are as follows:
r (thunder, road) log (pr (thunder | road)) -1 ═ log (thunder) + log (thunder | road)) -10.6176
r (thunderbolt, way) log (pr (thunderbolt | way)) -1 ═ log (thunderbolt) + log (thunderbolt |)) -38.0009
Since r (thunder, way) log (pr (thunder | way)) > r (thunderbolt, way) log (pr (thunderbolt | way)), the character selected is "thunder". The entire character of the segment is selected as "thunderroad".
The protection of the present invention is not limited to the above embodiments. Variations and advantages that may occur to those skilled in the art may be incorporated into the invention without departing from the spirit and scope of the inventive concept, and the scope of the appended claims is intended to be protected.
Claims (6)
1. A method for fusing multi-character recognition results is characterized by comprising the following steps:
the method comprises the following steps: obtaining at least 2 character strings from at least 2 character recognizers; the character string comprises a plurality of characters;
step two: aligning the same character in the two character strings by using an optimized alignment algorithm based on the minimum editing distance;
step three: aligning all character strings according to the same character to realize the same character alignment of multiple character strings;
step four: segmenting according to the same aligned characters in the multi-character string to obtain segment aligned links;
step five: and selecting the optimal link path in the segment alignment link to obtain a fusion result.
2. The method for fusing multiple character recognition results according to claim 1, wherein the second step comprises the steps of:
step a: calculating the minimum editing distance between the two character strings to generate an editing distance matrix;
step b: obtaining a unit which can be reached by a minimum edit distance backspacing path in the edit distance matrix, and calculating an attribute tuple of the unit;
step c: acquiring an optimal alignment mode from the unit according to the attribute tuple;
step d: and repeating the steps a to c until the two character strings are aligned.
3. The method for fusing multiple character recognition results according to claim 2, wherein the minimum edit distance between characters is expressed by the following formula:
wherein, <math>
<mfenced open='{' close=''>
<mtable>
<mtr>
<mtd>
<mi>ins</mi>
<mo>-</mo>
<mi>cos</mi>
<mi>t</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>B</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mn>1</mn>
</mtd>
</mtr>
<mtr>
<mtd>
<mi>del</mi>
<mo>-</mo>
<mi>cos</mi>
<mi>t</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>A</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mn>1</mn>
</mtd>
</mtr>
<mtr>
<mtd>
<mi>subst</mi>
<mo>-</mo>
<mi>cos</mi>
<mi>t</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>A</mi>
<mi>i</mi>
</msub>
<mo>,</mo>
<msub>
<mi>A</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfenced open='{' close=''>
<mtable>
<mtr>
<mtd>
<mn>2</mn>
<mo>,</mo>
<msub>
<mi>A</mi>
<mi>i</mi>
</msub>
<mo>≠</mo>
<msub>
<mi>A</mi>
<mi>j</mi>
</msub>
</mtd>
</mtr>
<mtr>
<mtd>
<mn>0</mn>
<mo>,</mo>
<msub>
<mi>A</mi>
<mi>i</mi>
</msub>
<mo>=</mo>
<msub>
<mi>A</mi>
<mi>j</mi>
</msub>
</mtd>
</mtr>
</mtable>
</mfenced>
</mtd>
</mtr>
</mtable>
</mfenced>
</math>
wherein distance [ i, j ]]Represents the minimum edit distance, i represents the targetThe number of characters in the character string, m the total number of characters in the target character string, j the number of characters in the source character string, n the total number of characters in the source character string, ins-cost (B)j) Indicating a distance penalty, del-cost (A) of adding a characteri) Represents the distance cost of deleting a character, subs-cost (A)i,Aj) Representing the distance cost of replacing a character.
4. The method for fusing multiple character recognition results according to claim 2, wherein the optimal path comprises the steps of:
step b 1: for two address strings with the lengths of m and n respectively, constructing an editing distance matrix with m +1 rows and n +1 columns, selecting units [ m, n ] or [0, 0] from the editing distance matrix as a starting point and an end point respectively, and taking the starting point to the end point as a path direction;
step b 2: establishing a tuple for characterizing each cell attribute in the distance editing matrix, the tuple comprising:
element targetijFor characterizing the maximum number of identical characters from said starting point to said cell;
element tagijIf the numerical value is true, the character in the ith row is represented to be the same as the character in the jth column;
element subijCharacterizing a maximum number of replacement operations from the endpoint to the cell;
element leftijIf the value is true, the unit is characterized by the existence of a transverse unit;
element downijIf the numerical value is true, the unit is characterized to have a longitudinal unit;
element obliqueijIf the numerical value is true, the unit is represented to have an inclined unit;
step b 3: according to the tuple, if the transverse unit of the starting point exists and the maximum replacing operation times of the transverse unit are equal to the maximum replacing operation times of the starting point, the path is from the starting point to the transverse unit; otherwise, if the longitudinal unit of the starting point exists and the maximum replacing operation times of the longitudinal unit are equal to the maximum replacing operation times of the starting point, the path is from the starting point to the longitudinal unit; otherwise, if the slant unit of the starting point exists, the path is from the starting point to the slant unit; after the path is updated, continuing to update the path trend according to the tuple until the path is from the starting point to the end point position;
step b 4: obtaining the tuple tag from the pathijAnd for the true unit, obtaining the same character between two character strings, and aligning the two character strings according to the same character.
5. The method for fusing multi-character recognition results as claimed in claim 1, wherein in the fourth step, the characters are grouped by position according to the aligned character strings, the probability values of the characters between the groups are calculated one by one from one character, and the path composed of the characters with the maximum probability value is marked as the path of the correct character.
6. The method for fusing multiple character recognition results according to claim 5, wherein the probability is expressed by the following formula:
in the formula, rk1,rk2,rk3Respectively represent the weight, pr (a)k|ak+1) Is shown in character ak+1Character a in case of already occurringkProbability of occurrence, pr (b)k|bk+1) Is shown in character bk+1Character b in case of already occurringkProbability of occurrence, pr (c)k|ck+1) Is shown in character ck+1Character c in case it has appearedkProbability of occurrence, pr (L)A) Represents a segment LA={a1,a2,...,amThe probability of occurrence of the entire string in pr (L)B) Represents LB={b1,b2,...,bnThe probability of occurrence of the entire string, pr (L)C) Represents LC={c1,c2,...,cpThe probability of occurrence of the entire string.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410191507.XA CN103996021A (en) | 2014-05-08 | 2014-05-08 | Fusion method of multiple character identification results |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410191507.XA CN103996021A (en) | 2014-05-08 | 2014-05-08 | Fusion method of multiple character identification results |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103996021A true CN103996021A (en) | 2014-08-20 |
Family
ID=51310182
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410191507.XA Pending CN103996021A (en) | 2014-05-08 | 2014-05-08 | Fusion method of multiple character identification results |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103996021A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105653517A (en) * | 2015-11-05 | 2016-06-08 | 乐视致新电子科技(天津)有限公司 | Recognition rate determining method and apparatus |
CN107220639A (en) * | 2017-04-14 | 2017-09-29 | 北京捷通华声科技股份有限公司 | The correcting method and device of OCR recognition results |
CN107609592A (en) * | 2017-09-15 | 2018-01-19 | 桂林电子科技大学 | A kind of figure edit distance approach towards Letter identification |
CN107967303A (en) * | 2017-11-10 | 2018-04-27 | 传神语联网网络科技股份有限公司 | The method and device that language material is shown |
CN108052609A (en) * | 2017-12-13 | 2018-05-18 | 武汉烽火普天信息技术有限公司 | A kind of address matching method based on dictionary and machine learning |
CN108647319A (en) * | 2018-05-10 | 2018-10-12 | 思派(北京)网络科技有限公司 | A kind of labeling system and its method based on short text clustering |
CN111832554A (en) * | 2019-04-15 | 2020-10-27 | 顺丰科技有限公司 | Image detection method, device and storage medium |
CN112257703A (en) * | 2020-12-24 | 2021-01-22 | 北京世纪好未来教育科技有限公司 | Image recognition method, device, equipment and readable storage medium |
CN112784125A (en) * | 2021-01-14 | 2021-05-11 | 辽宁工程技术大学 | Mode identification method and device for input information |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101950560A (en) * | 2010-09-10 | 2011-01-19 | 中国科学院声学研究所 | Continuous voice tone identification method |
EP2309487A1 (en) * | 2009-09-11 | 2011-04-13 | Honda Research Institute Europe GmbH | Automatic speech recognition system integrating multiple sequence alignment for model bootstrapping |
CN103680499A (en) * | 2013-11-29 | 2014-03-26 | 北京中科模识科技有限公司 | High-precision recognition method and high-precision recognition system on basis of voice and subtitle synchronization |
-
2014
- 2014-05-08 CN CN201410191507.XA patent/CN103996021A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2309487A1 (en) * | 2009-09-11 | 2011-04-13 | Honda Research Institute Europe GmbH | Automatic speech recognition system integrating multiple sequence alignment for model bootstrapping |
CN101950560A (en) * | 2010-09-10 | 2011-01-19 | 中国科学院声学研究所 | Continuous voice tone identification method |
CN103680499A (en) * | 2013-11-29 | 2014-03-26 | 北京中科模识科技有限公司 | High-precision recognition method and high-precision recognition system on basis of voice and subtitle synchronization |
Non-Patent Citations (1)
Title |
---|
全中华等: "基于串匹配的特殊点匹配和自由伪造签名快速排除法", 《应用科学学报》 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105653517A (en) * | 2015-11-05 | 2016-06-08 | 乐视致新电子科技(天津)有限公司 | Recognition rate determining method and apparatus |
CN107220639A (en) * | 2017-04-14 | 2017-09-29 | 北京捷通华声科技股份有限公司 | The correcting method and device of OCR recognition results |
CN107609592B (en) * | 2017-09-15 | 2020-10-23 | 桂林电子科技大学 | Graph editing distance method for letter recognition |
CN107609592A (en) * | 2017-09-15 | 2018-01-19 | 桂林电子科技大学 | A kind of figure edit distance approach towards Letter identification |
CN107967303A (en) * | 2017-11-10 | 2018-04-27 | 传神语联网网络科技股份有限公司 | The method and device that language material is shown |
CN107967303B (en) * | 2017-11-10 | 2021-03-26 | 传神语联网网络科技股份有限公司 | Corpus display method and apparatus |
CN108052609A (en) * | 2017-12-13 | 2018-05-18 | 武汉烽火普天信息技术有限公司 | A kind of address matching method based on dictionary and machine learning |
CN108647319A (en) * | 2018-05-10 | 2018-10-12 | 思派(北京)网络科技有限公司 | A kind of labeling system and its method based on short text clustering |
CN108647319B (en) * | 2018-05-10 | 2021-07-06 | 思派(北京)网络科技有限公司 | Labeling system and method based on short text clustering |
CN111832554A (en) * | 2019-04-15 | 2020-10-27 | 顺丰科技有限公司 | Image detection method, device and storage medium |
CN112257703A (en) * | 2020-12-24 | 2021-01-22 | 北京世纪好未来教育科技有限公司 | Image recognition method, device, equipment and readable storage medium |
CN112784125A (en) * | 2021-01-14 | 2021-05-11 | 辽宁工程技术大学 | Mode identification method and device for input information |
CN112784125B (en) * | 2021-01-14 | 2024-07-05 | 辽宁工程技术大学 | Method and device for identifying mode of input information |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103996021A (en) | Fusion method of multiple character identification results | |
WO2016165538A1 (en) | Address data management method and device | |
CN108369582B (en) | Address error correction method and terminal | |
CN112560478B (en) | Chinese address Roberta-BiLSTM-CRF coupling analysis method using semantic annotation | |
CN104991889A (en) | Fuzzy word segmentation based non-multi-character word error automatic proofreading method | |
CN111062376A (en) | Text recognition method based on optical character recognition and error correction tight coupling processing | |
CN103970733B (en) | A kind of Chinese new word identification method based on graph structure | |
CN101655837A (en) | Method for detecting and correcting error on text after voice recognition | |
CN111062397A (en) | Intelligent bill processing system | |
CN111897917B (en) | Rail transit industry term extraction method based on multi-modal natural language features | |
CN109086266B (en) | Error detection and correction method for text-shaped near characters | |
CN104615676A (en) | Picture searching method based on maximum similarity matching | |
CN110991184B (en) | Relay protection fixed value self-adaptive checking method based on comprehensive dictionary characteristics | |
CN110851559A (en) | Automatic data element identification method and identification system | |
CN113901214B (en) | Method and device for extracting form information, electronic equipment and storage medium | |
CN108304377A (en) | A kind of extracting method and relevant apparatus of long-tail word | |
CN103324632A (en) | Concept identification method and device based on collaborative learning | |
CN114780680A (en) | Retrieval and completion method and system based on place name and address database | |
CN114595661A (en) | Method, apparatus, and medium for reviewing bid document | |
CN112182353B (en) | Method, electronic device, and storage medium for information search | |
CN105930478A (en) | Element object spatial information fingerprint-based spatial data change capture method | |
CN117371534A (en) | Knowledge graph construction method and system based on BERT | |
CN112651590B (en) | Instruction processing flow recommending method | |
CN114154494A (en) | Disambiguation word segmentation method, system, device and storage medium | |
CN108595584B (en) | Chinese character output method and system based on digital marks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20140820 |
|
WD01 | Invention patent application deemed withdrawn after publication |