CN105095826A - Character recognition method and character recognition device - Google Patents

Character recognition method and character recognition device Download PDF

Info

Publication number
CN105095826A
CN105095826A CN201410156083.3A CN201410156083A CN105095826A CN 105095826 A CN105095826 A CN 105095826A CN 201410156083 A CN201410156083 A CN 201410156083A CN 105095826 A CN105095826 A CN 105095826A
Authority
CN
China
Prior art keywords
word
identified
alternative
posterior probability
special
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410156083.3A
Other languages
Chinese (zh)
Other versions
CN105095826B (en
Inventor
张宇
杜志军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201410156083.3A priority Critical patent/CN105095826B/en
Publication of CN105095826A publication Critical patent/CN105095826A/en
Application granted granted Critical
Publication of CN105095826B publication Critical patent/CN105095826B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Character Discrimination (AREA)

Abstract

The invention discloses a character recognition method and a character recognition device, which is used for solving a problem that the character recognition precision is low in special application contexts in the prior art. The character recognition method comprises the steps of determining alternative characters of a character to be recognized, determining the special posterior probability that the character to be recognized is the alternative character by adopting a special correction model in allusion to each alternative character, and then recognizing the character to be recognized according to the special posterior probability of each alternative character. According to the invention, the special correction model is acquired in advance by calculating the word frequency of words appeared in the special application contexts, so that characters conforming to the special application contexts can be recognized accurately by adopting the special correction model, thereby being capable of improving the character recognition precision in the special application contexts.

Description

A kind of character recognition method and device
Technical field
The application relates to field of computer technology, particularly relates to a kind of character recognition method and device.
Background technology
Along with the development of computer technology, character recognition technology arises at the historic moment, by this technology, equipment can by the Text region in image out, and typing character recognition technology being applied to non-digitalization information just can significantly improve the efficiency of inputting of non-digitalization information.Conventional method is, gathers the image of non-digitalization information, and the word in recycling character recognition technology recognition image, to obtain information and typing.Obviously, when adopting character recognition technology to carry out typing to non-digitalization information, the precision of Text region is a key factor of the accuracy determining entry information.
Wherein, utilize the core concept of the word in character recognition technology recognition image mainly: by text event detection optical character identification (OpticalCharacterRecognition to be identified in image, OCR) engine, OCR engine extracts the feature of the word to be identified of input, and the feature of the feature of extraction with each grapholect be kept in template base is in advance compared, to determine the similarity of the feature of feature and each grapholect extracted, grapholect the highest for similarity is defined as this word to be identified.
But, in practical application scene, there is the word that all multiple features are comparatively similar, as " district " and " act of violence ", " chopping " and " stopping " etc.Affecting by gathering the sharpness, degree of tilt etc. of image, often there is situation about by mistake identifying in the comparatively similar word of these features.Such as, should be the word of " address: Chaoyang District ", probably can be mistakenly identified as due to the impact of image inclination degree " address: Chaoyang is fierce ".Therefore, in order to improve the precision of Text region, in the prior art, several grapholects that can be higher according to the similarity of the feature with word to be identified, in conjunction with the calibration model preset, identify word to be identified.
Concrete, for the word to be identified of i-th in literal line, determine the alternative word of this i-th word to be identified, for each alternative word determined, according to the i-th-1 word identified (the i-th-1 word is the previous word of this i-th word) and default calibration model, determine under the condition of this i-th-1 word, this i-th word to be identified is the posterior probability of this alternative word, alternative word maximum for posterior probability is defined as this i-th word to be identified identified.
Such as, suppose that the actual word in the literal line extracted from image is " Chaoyang District ", then, when identifying three words of this in this literal line, can identify successively according to order from left to right.After supposing that the first two word (" court " and " sun ") identifies, when identifying the 3rd word, can according to the feature of the 3rd word, determine that the grapholect larger with the similarity of the feature of the 3rd word is for " district " and " act of violence ", therefore, by the alternative word of " district " and " act of violence " these two words as the 3rd word.Because the 2nd word identified is " sun ", then can according to the 2nd word " sun " identified, and the calibration model preset, determine P (c respectively 3, district| c 2, sun) and P (c 3, fierce| c 2, sun), wherein, P (c 3, district| c 2, sun) represent under the 2nd word is the condition of " sun ", the 3rd word is the posterior probability in " district ", P (c 3, fierce| c 2, sun) represent under the 2nd word is the condition of " sun ", the 3rd word is the posterior probability of " act of violence ".Suppose according to calibration model, determine P (c 3, district| c 2, sun) be greater than P (c 3, fierce| c 2, sun), then alternative word " district " is defined as the 3rd word identified.
But, in the prior art, above-mentioned default calibration model adds up the word frequency of the various vocabulary appeared in real life to obtain, for a vocabulary, the word frequency that this vocabulary occurs in real life is higher, then in this vocabulary previous word condition under, in this vocabulary, the posterior probability of a rear word is larger, and above-mentioned default calibration model also can be referred to as universal calibration model.And for some special application scenarios, above-mentioned universal calibration model is also inapplicable.
Such as, suppose that in literal line, the i-th-1 word is identified as " answering ", when identification i-th word, determine that the alternative word of this i-th word is for " being somebody's turn to do " and " receipts ", then because universal calibration model adds up the word frequency of the vocabulary appeared in real life to obtain, and in real life, vocabulary " should " word frequency that occurs obviously will much larger than vocabulary " receivable ", therefore, universal calibration model can be thought under the i-th-1 word is the condition of " answering ", i-th word is the posterior probability that the posterior probability of " being somebody's turn to do " is greater than " receipts ", thus i-th word is identified as " being somebody's turn to do ".
In upper example, if literal line extracts from the image of the file such as newspaper, publication, the recognition result then obtained can be thought correct substantially, if but literal line extracts from the image of receipt, document such as shopping receipt etc., then obviously recognition result is that the possibility of " receivable " should be larger.
Visible, in special application scenarios, adopt universal calibration model can not identify the word meeting this special applications scene accurately, cause the precision of Text region lower.
Summary of the invention
The embodiment of the present application provides a kind of character recognition method and device, identifies in order to solve prior art the problem that the precision of word is lower in special applications scene.
A kind of character recognition method that the embodiment of the present application provides, comprising:
According to the feature of word to be identified, determine the alternative word of described word to be identified;
For each alternative word, according to the previous word of the word described to be identified identified, special correction model is adopted to determine that described word to be identified is the special posterior probability of this alternative word; Wherein, described special correction model obtains according to the word frequency of the vocabulary appeared in special applications scene of statistics in advance;
According to the special posterior probability of each alternative word, described word to be identified is identified.
A kind of character recognition device that the embodiment of the present application provides, comprising:
Alternative word determination module, according to the feature of word to be identified, determines the alternative word of described word to be identified;
Probability determination module, for each alternative word, according to the previous word of the word described to be identified identified, adopts special correction model to determine that described word to be identified is the special posterior probability of this alternative word; Wherein, described special correction model obtains according to the word frequency of the vocabulary appeared in special applications scene of statistics in advance;
Identification module, according to the special posterior probability of each alternative word, identifies described word to be identified.
The embodiment of the present application provides a kind of character recognition method and device, the method determines the alternative word of word to be identified, and for each alternative word, special correction model is adopted to determine that this word to be identified is the special posterior probability of this alternative word, then according to this word to be identified of special posterior probability identification of each alternative word.Because above-mentioned special correction model obtains according to the word frequency of the vocabulary appeared in special applications scene of statistics in advance, therefore adopt special correction model can identify the word meeting special applications scene accurately, thus the precision identifying word in special applications scene can be improved.
Accompanying drawing explanation
Accompanying drawing described herein is used to provide further understanding of the present application, and form a application's part, the schematic description and description of the application, for explaining the application, does not form the improper restriction to the application.In the accompanying drawings:
The Text region process that Fig. 1 provides for the embodiment of the present application;
The character recognition device structural representation that Fig. 2 provides for the embodiment of the present application.
Embodiment
Due in special application scenarios, adopt universal calibration model can not identify the word meeting this special applications scene accurately, therefore, for special applications scene in the embodiment of the present application, count the word frequency of the vocabulary in this special applications scene present in advance, and obtain special correction model accordingly, when identifying word to be identified, this special correction model is adopted to identify, to improve the precision identifying word under this special applications scene.
For making the object of the application, technical scheme and advantage clearly, below in conjunction with the application's specific embodiment and corresponding accompanying drawing, technical scheme is clearly and completely described.Obviously, described embodiment is only some embodiments of the present application, instead of whole embodiments.Based on the embodiment in the application, those of ordinary skill in the art are not making the every other embodiment obtained under creative work prerequisite, all belong to the scope of the application's protection.
The Text region process that Fig. 1 provides for the embodiment of the present application, specifically comprises the following steps:
S101: according to the feature of word to be identified, determines the alternative word of this word to be identified.
In the embodiment of the present application, recognition device can by text event detection to be identified in OCR engine, the feature of this word to be identified is extracted by OCR engine, and the feature of the feature of extraction with each grapholect be kept in template base is in advance compared, with the similarity of the feature of the feature and each grapholect of determining this word to be identified, then several larger for similarity grapholects are defined as the alternative word of this word to be identified.Wherein, the feature of the word to be identified described in the embodiment of the present application includes but not limited to the stroke feature of word to be identified.
Concrete, recognition device first can extract the literal line in image, determine that character block in literal line (wherein again, a word is comprised) in a character block, again according to character calligraph order (e.g., from left to right), successively for each character block, using the word in this character block as word to be identified (also namely, current for character block in what comprise is exactly word to be identified) be input in OCR engine and identify.
Such as, to suppose in the literal line extracted that the actual word comprised is for " receivable ", then the character block in this literal line of determining of recognition device has two, and one is the character block at " answering " word place, and another is the character block at " receipts " word place.
Because general character calligraph order is from left to right, therefore, recognition device, according to order from left to right, first identifies the word in the character block at " answering " word place, then identifies the word in the character block at " receipts " word place.
S102: for each alternative word, according to the previous word of this word to be identified identified, adopts special correction model to determine that this word to be identified is the special posterior probability of this alternative word.
In the embodiment of the present application, above-mentioned special correction model obtains according to the word frequency of the vocabulary appeared in special applications scene of statistics in advance.Such as, suppose that special applications scene is document application scenarios, then can collect the vocabulary appeared in a large number in document application scenarios in advance from a large amount of shopping receipts, receipt, and according to these shopping receipt and receipt, count the word frequency of each vocabulary in present document application scenarios, then obtain special correction model according to the word frequency of statistics.It should be noted that, even same vocabulary, its word frequency appeared in common scenarios probably has a long way to go with the word frequency appeared in special applications scene, and therefore, the special correction model in the embodiment of the present application and universal calibration model have very big-difference.For the vocabulary AB be made up of word A and word B, in special applications scene, if the word frequency that AB occurs is larger, then under the previous word of word to be identified is the condition of A, this word to be identified adopting special correction model to determine is that the special posterior probability of B is also larger.
Determined the alternative word of word to be identified by step S101 after, then by this special correction model, determine under the condition of previous word identifying this word to be identified, this word to be identified is the special posterior probability of this alternative word.
Still that " receivable " is described for word actual in the literal line extracted, suppose by the method shown in the application Fig. 1, the word to be identified identified in the character block at " answering " word place is should, then for the character block at " receipts " word place, using the word in this character block as word to be identified, by text event detection OCR engine to be identified, OCR engine is then according to the feature of word to be identified of input and the feature of each grapholect of preservation, determine that the grapholect larger with the characteristic similarity of the word to be identified of input is for " being somebody's turn to do " and " receipts ", thus determine that the alternative word of this word to be identified is for " being somebody's turn to do " and " receipts ".Recognition device then adopts according to appearing at the word frequency of vocabulary in document application scenarios and the special correction model that obtains, and determine respectively under previous word is the condition of " answering ", this current word to be identified is the special posterior probability of " being somebody's turn to do " and " receipts ".
S103: according to the special posterior probability of each alternative word, identifies this word to be identified.
Concrete, alternative word maximum for the special posterior probability determined can be defined as this word to be identified by recognition device.
Continue along using example, because special correction model obtains according to the word frequency appeared in document application scenarios, and in document application scenarios " should " word frequency that occurs is less than the word frequency that " receivable " occurs, therefore, under the condition of " answering " at previous word, this word to be identified adopting above-mentioned special correction model to determine is the special posterior probability that the special posterior probability of " receipts " is greater than that this word to be identified is " being somebody's turn to do ", thus " receipts " are defined as this word to be identified by recognition device.
From said method, because special correction model obtains according to the word frequency of the vocabulary appeared in special applications scene of statistics in advance, therefore adopt special correction model can identify the word meeting special applications scene accurately, thus the precision identifying word in special applications scene can be improved.
In the embodiment of the present application, in order to improve the precision identifying word further, after determining the alternative word of word to be identified, except determining the special posterior probability of each alternative word, also can determine the general posterior probability of each alternative word, and according to the special posterior probability of each alternative word and general posterior probability, this word to be identified is identified, also be, consider a vocabulary and appear at word frequency in special applications scene and general application scenarios, then word to be identified is identified.Concrete, recognition device can before identifying word to be identified according to the special posterior probability of each alternative word, also be, before the step S103 shown in Fig. 1, for each alternative word of this word to be identified, according to the previous word of this word to be identified identified, universal calibration model is adopted to determine that this word to be identified is the general posterior probability of this alternative word, and when identifying this word to be identified, according to general posterior probability and the special posterior probability of each alternative word, word to be identified is identified, as, the alternative word that can the product of special posterior probability taken advantage of maximum general posterior probability is defined as this word to be identified.Wherein, above-mentioned universal calibration model carries out adding up obtaining according to the word frequency of the various vocabulary appeared in general application scenarios added up in advance.
Still that " receivable " is described for word actual in the literal line extracted, suppose by the method shown in the application Fig. 1, the word to be identified identified in the character block at " answering " word place is should, then for the character block at " receipts " word place, using the word in this character block as text event detection OCR engine to be identified, OCR engine is then according to the feature of word to be identified of input and the feature of each grapholect of preservation, determine that the grapholect larger with the characteristic similarity of the word to be identified of input is for " being somebody's turn to do " and " receipts ", thus determine that the alternative word of this word to be identified is for " being somebody's turn to do " and " receipts ".Recognition device then adopts according to appearing at the word frequency of vocabulary in document application scenarios and the special correction model that obtains, determine respectively under previous word is the condition of " answering ", this current word to be identified is the special posterior probability of " being somebody's turn to do " and " receipts ", and adopt according to appearing at the word frequency of vocabulary in general application scenarios and the universal calibration model that obtains, determine respectively under previous word is the condition of " answering ", this current word to be identified is the general posterior probability of " being somebody's turn to do " and " receipts ".Finally, the alternative word that general posterior probability takes advantage of the product of special posterior probability maximum by recognition device is defined as this word to be identified.
In the embodiment of the present application, in order to improve the precision identifying word further, when identifying word to be identified, except considering the special posterior probability in special applications scene and the general posterior probability in general application scenarios, it is also conceivable to the degree of confidence that this Text region to be identified is alternative word by OCR engine, and the reliability weight of clause that all words of having identified and alternative word are formed.Concrete, recognition device can for each alternative word of word to be identified, according to all reliability weights having identified the clause that word is formed be positioned at before this word to be identified, by the degree of confidence that this Text region to be identified is this alternative word, the general posterior probability of this alternative word and special posterior probability, determine under this word to be identified is the condition of this alternative word, be positioned at all reliability weights having identified the clause that word and this alternative word are formed before this word to be identified, again alternative word maximum for the reliability weight determined is defined as this word to be identified identified.
Concrete, suppose that word to be identified is t word, t-1 word is the previous word of this word to be identified, then for the kth alternative word that this word to be identified is determined, can adopt formula Q ( t , k ) = log P 1 ( c t , k ) + log P 2 ( c t , k ) + log CF ( c t , k ) ; t = 1 Q ( t - 1 , j ) + log P 1 ( c t , k | c t - 1 , j ) + log P 2 ( c t , k | c t - 1 , j ) + log CF ( c t , k ) ; t > 1 Determine under this word to be identified is the condition of this kth alternative word, be positioned at all reliability weight Q (t, k) having identified the clause that word and this kth alternative word are formed before this word to be identified.In above-mentioned formula:
J represents that the previous word (i.e. t-1 word) of this word to be identified identified is: when identifying this previous word (i.e. t-1 word), for the jth alternative word that this previous word is determined;
P 1(c t,k) represent when this word to be identified is first word, this word to be identified is the general posterior probability of a kth alternative word;
P 2(c t,k) represent when this word to be identified is first word, this word to be identified is the special posterior probability of a kth alternative word;
CF (c t,k) representing to be the degree of confidence of a kth alternative word by this Text region to be identified, that is, ORC engine is according to the feature of this word to be identified, is the degree of confidence of a kth alternative word by this Text region to be identified;
Q (t-1, j) represents all reliability weights having identified the clause that word is formed be positioned at before this word to be identified;
P 1(c t,k| c t-1, j) represent when this word to be identified is not first word, this word to be identified is the general posterior probability of a kth alternative word;
P 2(c t,k| c t-1, j) represent when this word to be identified is not first word, this word to be identified is the special posterior probability of a kth alternative word.
Such as, suppose that word actual in the literal line extracted is " amounting to amount receivable ", 6 character blocks are determined altogether in this literal line, the character block at " being total to " place, the character block at " meter " place, the character block at " answering " place, the character block at " receipts " place, the character block at " gold " place, the character block at " volume " place respectively, then according to the sequential write of normal words, i.e. order from left to right, first for the character block at " being total to " place, the word in this character block is identified as word to be identified.The each alternative word of this word to be identified first determined by recognition device, suppose that the alternative word determined is " being total to " and " emerging ", then because this word to be identified current is first word, therefore, recognition device first adopts universal calibration model and special correction model to determine the general posterior probability P that alternative word " is total to " 1(c 1,1) and special posterior probability P 2(c 1,1), and be the degree of confidence CF (c that alternative word " is total to " by this Text region to be identified 1,1).
It should be noted that, because this word to be identified current is first word, therefore, the general posterior probability P determined 1(c 1,1) meaning be actually: in general application scenarios, comprise the probability that vocabulary that alternative word " is total to " occurs.Similar, the special posterior probability P determined 2(c 1,1) meaning be actually: in special applications scene, comprise the probability that vocabulary that alternative word " is total to " occurs.
" be total to " for alternative word and determine general posterior probability P 1(c 1,1), special posterior probability P 2(c 1,1), degree of confidence CF (c 1,1) after, then can adopt formula Q (1,1)=logP 1(c 1,1)+logP 2(c 1,1)+logCF (c 1,1) determine the reliability weight Q (1,1) only " being total to " clause formed by alternative word.
Similar, then for another alternative word " emerging ", determine the reliability weight Q (1,2) of the clause be only made up of alternative word " emerging ".Then according to Q (1,1) and Q (1,2), select the alternative word that maximum reliability weight is corresponding, be defined as this word to be identified.Suppose that Q (1,1) is greater than Q (1,2), then determine that alternative word " is total to " as this current word to be identified.
After identifying first word, then identify for second character block (i.e. the character block at " meter " place), that is, current word to be identified is second word.Concrete, suppose that the alternative word determined according to current word to be identified is " meter " and " assorted ", then adopt universal calibration model and special correction model to determine the general posterior probability P of alternative word " meter " 1(c 2,1| c 1,1) and special posterior probability P 2(c 2,1| c 1,1), and be the degree of confidence CF (c of alternative word " meter " by this Text region to be identified 2,1), and then determine, when second word (that is, current word to be identified) be under the condition of " meter ", " to be total to " the reliability weight Q (2 of the clause formed by second word and first word having identified, 1)=Q (1,1)+logP 1(c 2,1| c 1,1)+logP 2(c 2,1| c 1,1)+logCF (c 2,1).Similar, for second alternative word " assorted ", determine, when second word be under the condition of " assorted ", " to be total to " the reliability weight Q (2 of the clause formed by second word and first word having identified, 2)=Q (1,1)+logP 1(c 2,2| c 1,1)+logP 2(c 2,2| c 1,1)+logCF (c 2,2).And then according to size identification second word of Q (2,1) and Q (2,2).By that analogy, " altogether amount receivable " these 6 words in literal line can be identified successively.
As can be seen from said process, after last Text region in literal line is completed, the reliability weight of the clause that all words identified are formed in this literal line can be obtained, and due to for same literal line, when determining each character block in this literal line, adopt different character block defining method may determine different character blocks, therefore, can for the character block adopting often kind of character block defining method to determine, said method is all adopted to carry out Text region, finally obtain the reliability weight of the clause that all words identified are formed in this literal line, therefrom select the maximum clause of reliability weight again, as this literal line identified.
Such as, for the literal line that actual word is " amounting to amount receivable ", the character block adopting certain character block defining method to determine is 6, is the character block at " being total to " place, the character block at " meter " place, the character block at " answering " place, the character block at " receipts " place, the character block at " gold " place, the character block at " volume " place respectively.Adopting another kind of character block defining method then may determine 7 character blocks, is the character block at " being total to " place, the character block at " Yan " place, the character block at " ten " place, the character block at " answering " place, the character block at " receipts " place, the character block at " gold " place, the character block at " volume " place respectively.Also, namely, when adopting another kind of character block defining method, " meter " word be divide into two words (" Yan " and " ten ") by mistake.In this case, can for 6 character blocks adopting the first character block defining method to determine, word in this literal line is identified, finally obtain the reliability weight Q (6 of the clause be made up of 6 words identified, k), again for 7 character blocks adopting the second character block defining method to determine, word in this literal line is identified, finally obtain the reliability weight Q (7 of the clause be made up of 7 words identified, k), finally compare Q (6, k) with Q (7, k) size, therefrom select the clause that maximum reliability weight is corresponding, as the word in this literal line identified.(6, (7, k), then the recognition result being carried out identifying by 6 character blocks determined for the first character block defining method of employing is as the word in this literal line identified k) to be greater than Q to suppose Q.
In addition, the reason of reliability weight of the method definite clause be added of taking the logarithm why is adopted to be in upper example: for certain special posterior probability or the very little alternative word of general posterior probability, if the reliability weight of the clause its general posterior probability or special posterior probability formed with the word identified before is multiplied, then can reduce the reliability weight of clause greatly, take the logarithm to be added and then can avoid the problems referred to above, make recognition result more close to actual, can further improve the accuracy of Text region.
As, in " amounting to amount receivable " in upper example, when identifying " answering " word, the word identified because it is previous is " meter ", and tangible general application scenarios " counted and answer " in vocabulary or the word frequency occurred in document application scenarios is all very little, therefore, under the condition of " answering " at previous word, current word to be identified is that the general posterior probability of " answering " and special posterior probability are all very little, if directly by the general posterior probability of " answering ", special posterior probability is multiplied with the reliability weight of " amounting to " clause before, the result then obtained will be very little, but this obviously and actual conditions be not inconsistent.Therefore, adopt the reliability weight of the method definite clause be added of taking the logarithm in the application, make result of calculation more realistic, to improve the precision identifying word.
In addition, consider in practical application scene to there is a lot of easily gibberish, as numeral " 2 " and letter " Z ", numeral " 5 " and letter " S ", digital " 0 " and letter " O " etc., therefore, in order to improve the precision of Text region further, these easy gibberish and easily gibberish set in the embodiment of the present application, also can be preset in recognition device.
Concrete, easy gibberish can be numeral " 2 ", letter " Z ", numeral " 5 ", letter " S ", digital " 0 ", letter " O ", and easily gibberish set can be then the set that set, digital " 0 " and letter " O " that set, numeral " 5 " that numeral " 2 " is formed with letter " Z " are formed with letter " S " are formed.
Also namely, each word in an easy gibberish set is easy gibberish.For an easy gibberish, other words in the easy gibberish set at this easy gibberish place are all that gibberish easy in this is similar but the word that literal type is different.Literal type described in the application includes but not limited to numeral, English alphabet, Chinese character etc.
Then recognition device is after identifying all words to be identified, can judge in the word identified, whether to there is default easy gibberish, if there is easy gibberish, then determine the easy gibberish set at this easy gibberish place, and in each literal type, select to meet the literal type of specified requirements, then belong to the word of the literal type meeting specified requirements in the easy gibberish set this easy gibberish being adjusted to its place.Wherein, for literal type undetermined, if the quantity belonging to the word of this literal type undetermined in each word identified is maximum, then this literal type undetermined is the literal type meeting specified requirements.
Such as, suppose to identify all words for " 1234S678 ", then recognition device can determine that in the word identified, " S " is easy gibberish, the easy gibberish set determining easy gibberish " S " place is { 5, S}, what in the word identified, literal type was " numeral " has 7, also be, what affiliated literal type was " numeral " identifies that the quantity of word is maximum, therefore, literal type " numeral " is the literal type meeting specified requirements, thus, easy gibberish " S " is adjusted to literal type in the easy gibberish set at its place and is similarly the word of " numeral ", i.e. numeral " 5 ", thus, the word identified after adjustment is " 12345678 ".
Further, consider that recognition device may save multiple special correction model, thus, the corresponding relation of predeterminable each special correction model and specific vocabulary, then recognition device is when identifying the word in an image, because putting in order of normal words row is from top to bottom, therefore, after recognition device extracts each literal line in image, can first according to universal calibration model, word in a literal line of the top is identified, again according to the vocabulary that the word identified is formed, select the special correction model that this vocabulary is corresponding, and when identifying the word in follow-up literal line, this special correction model selected is adopted to identify.
Such as, the specific vocabulary that special correction model in default document application scenarios is corresponding is " document ", " receipt ", " bill ", " receipt " etc., then when identifying the word in an image, first adopt the word in the literal line of the top in this image of universal calibration Model Identification, and determine the special correction model that vocabulary that the word identified is formed is corresponding, if the vocabulary that the word identified is formed is " document ", special correction model that then " document " of recognition device selection preservation is corresponding (is also, special correction model in document application scenarios) word in literal line follow-up in this image is identified.
Be the character recognition method that the embodiment of the present application provides above, based on same thinking, the embodiment of the present application also provides a kind of character recognition device, as shown in Figure 2.
The character recognition device structural representation that Fig. 2 provides for the embodiment of the present application, specifically comprises:
Alternative word determination module 201, according to the feature of word to be identified, determines the alternative word of described word to be identified;
Probability determination module 202, for each alternative word, according to the previous word of the word described to be identified identified, adopts special correction model to determine that described word to be identified is the special posterior probability of this alternative word; Wherein, described special correction model obtains according to the word frequency of the vocabulary appeared in special applications scene of statistics in advance;
Identification module 203, according to the special posterior probability of each alternative word, identifies described word to be identified.
Described probability determination module 202 also for, in the special posterior probability of described identification module 203 according to each alternative word, before described word to be identified is identified, for each alternative word, according to the previous word of the word described to be identified identified, universal calibration model is adopted to determine that described word to be identified is the general posterior probability of this alternative word;
Described identification module 203 specifically for, according to general posterior probability and the special posterior probability of each alternative word, described word to be identified is identified.
Described identification module 203 specifically for, for each alternative word, according to all reliability weights having identified the clause that word is formed be positioned at before described word to be identified, by the degree of confidence that described Text region to be identified is this alternative word, the general posterior probability of this alternative word and special posterior probability, determine under described word to be identified is the condition of this alternative word, be positioned at all reliability weights having identified the clause that word and this alternative word are formed before described word to be identified, alternative word maximum for the reliability weight determined is defined as the word described to be identified identified.
Described identification module 203 specifically for, adopt formula Q ( t , k ) = log P 1 ( c t , k ) + log P 2 ( c t , k ) + log CF ( c t , k ) ; t = 1 Q ( t - 1 , j ) + log P 1 ( c t , k | c t - 1 , j ) + log P 2 ( c t , k | c t - 1 , j ) + log CF ( c t , k ) ; t > 1 Determine under described word to be identified is the condition of this alternative word, be positioned at all reliability weight Q (t, k) having identified the clause that word and this alternative word are formed before described word to be identified;
Wherein, t represents that described word to be identified is t word;
T-1 represents that the previous word of described word to be identified is t-1 word;
K represents a kth alternative word of the word described to be identified determined;
J represents that the previous word of the word described to be identified identified is: when identifying described previous word, for the jth alternative word that described previous word is determined;
P 1(c t,k) represent when described word to be identified is first word, described word to be identified is the general posterior probability of a kth alternative word;
P 2(c t,k) represent when described word to be identified is first word, described word to be identified is the special posterior probability of a kth alternative word;
CF (c t,k) represent to be the degree of confidence of a kth alternative word by described Text region to be identified;
Q (t-1, j) represents all reliability weights having identified the clause that word is formed be positioned at before described word to be identified;
P 1(c t,k| c t-1, j) represent when described word to be identified is not first word, described word to be identified is the general posterior probability of a kth alternative word;
P 2(c t,k| c t-1, j) represent when described word to be identified is not first word, described word to be identified is the special posterior probability of a kth alternative word.
Described device also comprises:
Correction module 204, for when there is default easy gibberish in the word identified, determines the easy gibberish set at described easy gibberish place, and wherein, in described easy gibberish set, the literal type of each easy gibberish is different; In each literal type, select the literal type meeting specified requirements, wherein, for literal type undetermined, if the quantity belonging to the word of this literal type undetermined in each word identified is maximum, then this literal type undetermined is the literal type meeting specified requirements; Described easy gibberish is adjusted in described easy gibberish set and belongs to the described word meeting the literal type of specified requirements.
The embodiment of the present application provides a kind of character recognition method and device, the method determines the alternative word of word to be identified, and for each alternative word, special correction model is adopted to determine that this word to be identified is the special posterior probability of this alternative word, then according to this word to be identified of special posterior probability identification of each alternative word.Because above-mentioned special correction model obtains according to the word frequency of the vocabulary appeared in special applications scene of statistics in advance, therefore adopt special correction model can identify the word meeting special applications scene accurately, thus the precision identifying word in special applications scene can be improved.
In one typically configuration, computing equipment comprises one or more processor (CPU), input/output interface, network interface and internal memory.
Internal memory may comprise the volatile memory in computer-readable medium, and the forms such as random access memory (RAM) and/or Nonvolatile memory, as ROM (read-only memory) (ROM) or flash memory (flashRAM).Internal memory is the example of computer-readable medium.
Computer-readable medium comprises permanent and impermanency, removable and non-removable media can be stored to realize information by any method or technology.Information can be computer-readable instruction, data structure, the module of program or other data.The example of the storage medium of computing machine comprises, but be not limited to phase transition internal memory (PRAM), static RAM (SRAM), dynamic RAM (DRAM), the random access memory (RAM) of other types, ROM (read-only memory) (ROM), Electrically Erasable Read Only Memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc ROM (read-only memory) (CD-ROM), digital versatile disc (DVD) or other optical memory, magnetic magnetic tape cassette, tape magnetic rigid disk stores or other magnetic storage apparatus or any other non-transmitting medium, can be used for storing the information can accessed by computing equipment.According to defining herein, computer-readable medium does not comprise temporary computer readable media (transitorymedia), as data-signal and the carrier wave of modulation.
Also it should be noted that, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, commodity or equipment and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, commodity or equipment.When not more restrictions, the key element limited by statement " comprising ... ", and be not precluded within process, method, commodity or the equipment comprising described key element and also there is other identical element.
It will be understood by those skilled in the art that the embodiment of the application can be provided as method, system or computer program.Therefore, the application can adopt the form of complete hardware embodiment, completely software implementation or the embodiment in conjunction with software and hardware aspect.And the application can adopt in one or more form wherein including the upper computer program implemented of computer-usable storage medium (including but not limited to magnetic disk memory, CD-ROM, optical memory etc.) of computer usable program code.
The foregoing is only the embodiment of the application, be not limited to the application.To those skilled in the art, the application can have various modifications and variations.Any amendment done within all spirit in the application and principle, equivalent replacement, improvement etc., within the right that all should be included in the application.

Claims (10)

1. a character recognition method, is characterized in that, comprising:
According to the feature of word to be identified, determine the alternative word of described word to be identified;
For each alternative word, according to the previous word of the word described to be identified identified, special correction model is adopted to determine that described word to be identified is the special posterior probability of this alternative word; Wherein, described special correction model obtains according to the word frequency of the vocabulary appeared in special applications scene of statistics in advance;
According to the special posterior probability of each alternative word, described word to be identified is identified.
2. the method for claim 1, is characterized in that, according to the special posterior probability of each alternative word, before identifying described word to be identified, described method also comprises:
For each alternative word, according to the previous word of the word described to be identified identified, universal calibration model is adopted to determine that described word to be identified is the general posterior probability of this alternative word;
According to the special posterior probability of each alternative word, described word to be identified is identified, specifically comprises:
According to general posterior probability and the special posterior probability of each alternative word, described word to be identified is identified.
3. method as claimed in claim 2, is characterized in that, according to general posterior probability and the special posterior probability of each alternative word, identify, specifically comprise described word to be identified:
For each alternative word, identify the reliability weight of the clause that word is formed according to be positioned at before described word to be identified all, be the degree of confidence of this alternative word, the general posterior probability of this alternative word and special posterior probability by described Text region to be identified, determine under described word to be identified is the condition of this alternative word, be positioned at all reliability weights having identified the clause that word and this alternative word are formed before described word to be identified;
Alternative word maximum for the reliability weight determined is defined as the word described to be identified identified.
4. method as claimed in claim 3, is characterized in that, adopts formula Q ( t , k ) = log P 1 ( c t , k ) + log P 2 ( c t , k ) + log CF ( c t , k ) ; t = 1 Q ( t - 1 , j ) + log P 1 ( c t , k | c t - 1 , j ) + log P 2 ( c t , k | c t - 1 , j ) + log CF ( c t , k ) ; t > 1 Determine under described word to be identified is the condition of this alternative word, be positioned at all reliability weight Q (t, k) having identified the clause that word and this alternative word are formed before described word to be identified;
Wherein, t represents that described word to be identified is t word;
T-1 represents that the previous word of described word to be identified is t-1 word;
K represents a kth alternative word of the word described to be identified determined;
J represents that the previous word of the word described to be identified identified is: when identifying described previous word, for the jth alternative word that described previous word is determined;
P 1(c t,k) represent when described word to be identified is first word, described word to be identified is the general posterior probability of a kth alternative word;
P 2(c t,k) represent when described word to be identified is first word, described word to be identified is the special posterior probability of a kth alternative word;
CF (c t,k) represent to be the degree of confidence of a kth alternative word by described Text region to be identified;
Q (t-1, j) represents all reliability weights having identified the clause that word is formed be positioned at before described word to be identified;
P 1(c t,k| c t-1, j) represent when described word to be identified is not first word, described word to be identified is the general posterior probability of a kth alternative word;
P 2(c t,k| c t-1, j) represent when described word to be identified is not first word, described word to be identified is the special posterior probability of a kth alternative word.
5. the method for claim 1, is characterized in that, described method also comprises:
When there is default easy gibberish in the word identified, determine the easy gibberish set at described easy gibberish place, wherein, in described easy gibberish set, the literal type of each easy gibberish is different;
In each literal type, select the literal type meeting specified requirements, wherein, for literal type undetermined, if the quantity belonging to the word of this literal type undetermined in each word identified is maximum, then this literal type undetermined is the literal type meeting specified requirements;
Described easy gibberish is adjusted in described easy gibberish set and belongs to the described word meeting the literal type of specified requirements.
6. a character recognition device, is characterized in that, comprising:
Alternative word determination module, according to the feature of word to be identified, determines the alternative word of described word to be identified;
Probability determination module, for each alternative word, according to the previous word of the word described to be identified identified, adopts special correction model to determine that described word to be identified is the special posterior probability of this alternative word; Wherein, described special correction model obtains according to the word frequency of the vocabulary appeared in special applications scene of statistics in advance;
Identification module, according to the special posterior probability of each alternative word, identifies described word to be identified.
7. device as claimed in claim 6, it is characterized in that, described probability determination module also for, in the special posterior probability of described identification module according to each alternative word, before described word to be identified is identified, for each alternative word, according to the previous word of the word described to be identified identified, universal calibration model is adopted to determine that described word to be identified is the general posterior probability of this alternative word;
Described identification module specifically for, according to general posterior probability and the special posterior probability of each alternative word, described word to be identified is identified.
8. device as claimed in claim 7, it is characterized in that, described identification module specifically for, for each alternative word, according to all reliability weights having identified the clause that word is formed be positioned at before described word to be identified, by the degree of confidence that described Text region to be identified is this alternative word, the general posterior probability of this alternative word and special posterior probability, determine under described word to be identified is the condition of this alternative word, be positioned at all reliability weights having identified the clause that word and this alternative word are formed before described word to be identified, alternative word maximum for the reliability weight determined is defined as the word described to be identified identified.
9. device as claimed in claim 8, is characterized in that, described identification module specifically for, adopt formula Q ( t , k ) = log P 1 ( c t , k ) + log P 2 ( c t , k ) + log CF ( c t , k ) ; t = 1 Q ( t - 1 , j ) + log P 1 ( c t , k | c t - 1 , j ) + log P 2 ( c t , k | c t - 1 , j ) + log CF ( c t , k ) ; t > 1 Determine under described word to be identified is the condition of this alternative word, be positioned at all reliability weight Q (t, k) having identified the clause that word and this alternative word are formed before described word to be identified;
Wherein, t represents that described word to be identified is t word;
T-1 represents that the previous word of described word to be identified is t-1 word;
K represents a kth alternative word of the word described to be identified determined;
J represents that the previous word of the word described to be identified identified is: when identifying described previous word, for the jth alternative word that described previous word is determined;
P 1(c t,k) represent when described word to be identified is first word, described word to be identified is the general posterior probability of a kth alternative word;
P 2(c t,k) represent when described word to be identified is first word, described word to be identified is the special posterior probability of a kth alternative word;
CF (c t,k) represent to be the degree of confidence of a kth alternative word by described Text region to be identified;
Q (t-1, j) represents all reliability weights having identified the clause that word is formed be positioned at before described word to be identified;
P 1(c t,k| c t-1, j) represent when described word to be identified is not first word, described word to be identified is the general posterior probability of a kth alternative word;
P 2(c t,k| c t-1, j) represent when described word to be identified is not first word, described word to be identified is the special posterior probability of a kth alternative word.
10. device as claimed in claim 6, it is characterized in that, described device also comprises:
Correction module, for when there is default easy gibberish in the word identified, determines the easy gibberish set at described easy gibberish place, and wherein, in described easy gibberish set, the literal type of each easy gibberish is different; In each literal type, select the literal type meeting specified requirements, wherein, for literal type undetermined, if the quantity belonging to the word of this literal type undetermined in each word identified is maximum, then this literal type undetermined is the literal type meeting specified requirements; Described easy gibberish is adjusted in described easy gibberish set and belongs to the described word meeting the literal type of specified requirements.
CN201410156083.3A 2014-04-17 2014-04-17 A kind of character recognition method and device Active CN105095826B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410156083.3A CN105095826B (en) 2014-04-17 2014-04-17 A kind of character recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410156083.3A CN105095826B (en) 2014-04-17 2014-04-17 A kind of character recognition method and device

Publications (2)

Publication Number Publication Date
CN105095826A true CN105095826A (en) 2015-11-25
CN105095826B CN105095826B (en) 2019-10-01

Family

ID=54576222

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410156083.3A Active CN105095826B (en) 2014-04-17 2014-04-17 A kind of character recognition method and device

Country Status (1)

Country Link
CN (1) CN105095826B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108710855A (en) * 2018-05-22 2018-10-26 山西同方知网数字出版技术有限公司 A kind of Text region editing method
CN111144100A (en) * 2019-12-24 2020-05-12 五八有限公司 Question text recognition method and device, electronic equipment and storage medium
CN111444906A (en) * 2020-03-24 2020-07-24 腾讯科技(深圳)有限公司 Image recognition method based on artificial intelligence and related device
CN114078254A (en) * 2022-01-07 2022-02-22 华中科技大学同济医学院附属协和医院 Intelligent data acquisition system based on robot

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1236458A (en) * 1997-06-06 1999-11-24 微软公司 Reducing handwriting recognizer errors using decision trees
CN101802812A (en) * 2007-08-01 2010-08-11 金格软件有限公司 Automatic context sensitive language correction and enhancement using an internet corpus
CN101149806B (en) * 2006-09-19 2012-09-05 北京三星通信技术研究有限公司 Method and device for hand writing identification post treatment using context information
CN102663454A (en) * 2012-04-18 2012-09-12 安徽科大讯飞信息科技股份有限公司 Method and device for evaluating character writing standard degree
CN102890783A (en) * 2011-07-20 2013-01-23 富士通株式会社 Method and device for recognizing direction of character in image block

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1236458A (en) * 1997-06-06 1999-11-24 微软公司 Reducing handwriting recognizer errors using decision trees
CN101149806B (en) * 2006-09-19 2012-09-05 北京三星通信技术研究有限公司 Method and device for hand writing identification post treatment using context information
CN101802812A (en) * 2007-08-01 2010-08-11 金格软件有限公司 Automatic context sensitive language correction and enhancement using an internet corpus
CN102890783A (en) * 2011-07-20 2013-01-23 富士通株式会社 Method and device for recognizing direction of character in image block
CN102663454A (en) * 2012-04-18 2012-09-12 安徽科大讯飞信息科技股份有限公司 Method and device for evaluating character writing standard degree

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
丁晓青: "《汉字识别研究的回顾》", 《电子学报》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108710855A (en) * 2018-05-22 2018-10-26 山西同方知网数字出版技术有限公司 A kind of Text region editing method
CN111144100A (en) * 2019-12-24 2020-05-12 五八有限公司 Question text recognition method and device, electronic equipment and storage medium
CN111144100B (en) * 2019-12-24 2023-08-18 五八有限公司 Question text recognition method and device, electronic equipment and storage medium
CN111444906A (en) * 2020-03-24 2020-07-24 腾讯科技(深圳)有限公司 Image recognition method based on artificial intelligence and related device
CN111444906B (en) * 2020-03-24 2023-09-29 腾讯科技(深圳)有限公司 Image recognition method and related device based on artificial intelligence
CN114078254A (en) * 2022-01-07 2022-02-22 华中科技大学同济医学院附属协和医院 Intelligent data acquisition system based on robot

Also Published As

Publication number Publication date
CN105095826B (en) 2019-10-01

Similar Documents

Publication Publication Date Title
US11062043B2 (en) Database entity sensitivity classification
CN110765770A (en) Automatic contract generation method and device
US11776248B2 (en) Systems and methods for automated document image orientation correction
US20190392038A1 (en) Methods, devices and systems for data augmentation to improve fraud detection
US11567976B2 (en) Detecting relationships across data columns
CN109597983B (en) Spelling error correction method and device
CN109299269A (en) A kind of file classification method and device
CN103577989A (en) Method and system for information classification based on product identification
CN103942223A (en) Method and system for conducting online error correction on language model
US11663206B2 (en) Detecting relationships across data columns
CN105095826A (en) Character recognition method and character recognition device
US11328001B2 (en) Efficient matching of data fields in response to database requests
US20230138491A1 (en) Continuous learning for document processing and analysis
US20230061731A1 (en) Significance-based prediction from unstructured text
US11163963B2 (en) Natural language processing using hybrid document embedding
CN107861950A (en) The detection method and device of abnormal text
CN112765976A (en) Text similarity calculation method, device and equipment and storage medium
US20230385557A1 (en) Natural language processing techniques for machine-learning-guided summarization using hybrid class templates
US20230062114A1 (en) Machine learning techniques for efficient data pattern recognition across databases
EP4085343A1 (en) Domain based text extraction
CN103870822A (en) Word identification method and device
US11948378B2 (en) Machine learning techniques for determining predicted similarity scores for input sequences
US20240062003A1 (en) Machine learning techniques for generating semantic table representations using a token-wise entity type classification mechanism
US11714789B2 (en) Performing cross-dataset field integration
US20230066906A1 (en) Techniques for digital document analysis using document image fingerprinting

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20191211

Address after: P.O. Box 31119, grand exhibition hall, hibiscus street, 802 West Bay Road, Grand Cayman, ky1-1205, Cayman Islands

Patentee after: Innovative advanced technology Co., Ltd

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Patentee before: Alibaba Group Holding Co., Ltd.