CN100550040C - Optical character recognition method and equipment and character recognition method and equipment - Google Patents

Optical character recognition method and equipment and character recognition method and equipment Download PDF

Info

Publication number
CN100550040C
CN100550040C CNB2005100228818A CN200510022881A CN100550040C CN 100550040 C CN100550040 C CN 100550040C CN B2005100228818 A CNB2005100228818 A CN B2005100228818A CN 200510022881 A CN200510022881 A CN 200510022881A CN 100550040 C CN100550040 C CN 100550040C
Authority
CN
China
Prior art keywords
speech
font
candidate
information
centering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2005100228818A
Other languages
Chinese (zh)
Other versions
CN1979529A (en
Inventor
伊晓晶
谢文俊
李献
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Canon Inc
Original Assignee
Canon Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Canon Inc filed Critical Canon Inc
Priority to CNB2005100228818A priority Critical patent/CN100550040C/en
Publication of CN1979529A publication Critical patent/CN1979529A/en
Application granted granted Critical
Publication of CN100550040C publication Critical patent/CN100550040C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Character Discrimination (AREA)

Abstract

The present invention relates to a kind of optical character recognition method and equipment.On the one hand, the invention provides a kind of following optical font recognition methods and equipment: it is right that the speech of input text image is divided into speech, be identified in respectively corresponding speech centering than long word and shorter words, respectively according to the font information of adjacent speech with according to the font information of regulating speech according to the coarse adjustment step of the font information of the font information profile adjustment row in being expert at, and the font size of additionally discerning speech in this row.The present invention also provides a kind of character recognition method and equipment and by using projecting method to differentiate Interval Type in conjunction with the connected unit method and calculating the method and apparatus of X height based on assorting process.

Description

Optical character recognition method and equipment and character recognition method and equipment
Technical field
The present invention relates to a kind of optical character recognition method and equipment, more particularly, relate to a kind of in Optical Character Recognition system the method and apparatus of identification character font type.
Background technology
Optical character identification (OCR) system is widely used.Font information has been used for conventional OCR system to improve the performance of OCR such as font, gradient, poundage and font size, and the font information performance and the information that also help document structure analysis is recovered simultaneously.
Nowadays have two kinds of methods to can be used for Character Font Recognition:
-extraction global characteristics from text entity (word, line, section).This method is suitable for the priori Character Font Recognition, identification font and without any need for other knowledge of alphabetic class in this recognition methods.
-from single letter, extract local feature.This method can have benefited from other knowledge of alphabetic class substantially.
Particularly, US6,337,924 and US6,496,600 disclose a kind of posteriority Character Font Recognition.At first known character kind (code) can be extracted the font of local feature with the character of definite appointment thus.These features are based on alphabetical details, such as the expression of shape of serif (arch, square, triangle etc.) and particular letter such as g and g, a and a, perhaps can with character picture directly and the character picture template of different fonts compare.Yet it is a kind of diverse Character Font Recognition that can use alphabetical classification knowledge.Such method can have benefited from other knowledge of alphabetic class substantially, therefore can realize higher degree of accuracy.But there is not help for the font information of predicting complete-font OCR.
CN 1271140A (Chinese application number 99105851.8) and " Optical FontRecognition Using typographical features " (IEEE TRANSACTIONSON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.20, NO.8, AUGUST 1998) to Abdelwahab Zramdini and Rolf Ingold discloses a kind of " english font identification " based on literal piece or line.Extract global characteristics to determine the font information of whole literal piece or line.These features are generally determined (as orientation and spacing, the serif etc. of literal density, size, letter) by non-printing technology personnel.Yet different speech can be different font (font, size, poundage, gradient etc.) in delegation, such as title with increase the weight of etc.This method is generally extracted feature from bulk literal (such as literal piece, delegation at least), and can not discern the font information of single speech.
Optical font recognition from projection profiles (Electronicpublishing, VOL.6 (3), 249-260 (September 1993)) and Scriptidentification in printed bilingual documents to D DHANYA, A GRAMAKRISHNAN_ and PEETA BASA PATI (Sadhana Vol.27, Part1, February 2002, pp.73-82.
Figure C20051002288100101
Printed in India) a kind of " english font identification " based on speech is disclosed.The font feature is extracted from each speech image, so it can discern the different fonts information of each speech in the line of text.Yet other English word OFR handles X height or the Interval Type information of not utilizing usually in the font identification.But in desirable English word OFR, speech X highly is unusual Useful Information.The speech image can extract the font feature according to the normalization of X height and from normalized image, extracts the font feature of different speech and avoids the influence of font size with identical grade so that guarantee.The speech Interval Type also is the required information of font identification.Usually, different interval English word types can have different font features, and this is because the cause that kinds of characters (retouching the limit) makes up in different intervals.Speech X height and Interval Type also are used for font size identification.General other Character Font Recognition is handled the scheme of the maturation of not guaranteeing that also accurate X height and speech Interval Type are calculated.They obtain the information of row usually by simple projection histogram method.
In addition, may change font on the specified point hereof to be emphasized or the reader is noted easily.This can be by selecting another font or changing style (such as poundage, gradient etc.) or select the different font sizes of identical font to realize.In some above-mentioned previous character recognition method, from literal piece or row, extract the font feature, perhaps from single speech, extract the font feature.Obviously, for the line of text or the piece of single font, first method is obtained font information easily, but can not differentiate the font information of single speech, and the different speech in delegation can have different fonts (font, font size, poundage, gradient etc.) simultaneously.Second method can be differentiated the different fonts information of each speech in line of text, but it can not utilize the context font information, and they are also not exclusively uncorrelated usually simultaneously.Before carrying out any OFR processing, must carry out special consideration to the size in zone to be processed.If the zone is too little, the information that comprises therein may be not enough to be used for classification; Yet if too big, the dissimilar possibilities of mixing in same area are too many.
Summary of the invention
Consider the above-mentioned defective of prior art, requirement provides a kind of character recognition method of novelty of the defective that can eliminate above-mentioned prior art, and this method can accurately be differentiated the font information (font, font size, serif, poundage, gradient and spacing) of each speech in English text is capable.
Also require a kind of character recognition method of the novelty that can be discerned more multiword shape far more than common OCR software (such as Omnipage, FineReader etc.) with file space of a whole page restore funcitons.
According to first aspect, the invention provides a kind of optical font recognition methods, comprise the steps:
The speech of the capable image of input text is divided into the right partiting step of speech;
Discern the Character Font Recognition step of each speech centering than the font information of long word;
Based on the speech centering that comprises shorter words than the font information of long word and differentiate that at the font information than long word of the speech centering adjacent the font of the font information of each speech centering shorter words differentiates step with described shorter words.
Preferably also comprise the thin tuning step of regulating the font information of speech according to the font information of adjacent speech.
The optical font recognition methods further comprises the identification step of the font size of speech in the identification above line.
The be expert at OFR of image of this optical font recognition methods adopts speech to mechanism in handling, and this mechanism is based on the font classification method of English word and consider the characteristic that the font in actual English text distributes simultaneously.Its uses based on two stages of context font information regulation technology as a result, and this technology can realize higher degree of accuracy and realize more neat output in line of text.
Preferably, the font of differentiating the font information of each speech centering shorter words differentiates that step comprises:
The font information than long word than the speech centering at the font information of long word and shorter words to be identified place of speech centering that will be adjacent with shorter words to be identified compares; With
If the font information than long word than the speech centering at the font information of long word and shorter words to be identified place of the speech centering adjacent with shorter words to be identified is identical, the shorter words of then determining this speech centering has the font information identical than long word than long word and adjacent speech centering with this speech centering.
Preferably, if the font information than long word of the speech centering adjacent with shorter words to be identified is different from the font information than long word of the speech centering at shorter words to be identified place, then discern the font information of shorter words.
Preferably, the font of differentiating the font information of each speech centering shorter words differentiates that step further comprises if be first or last speech in being expert at the shorter words of speech centering then discern the font information of shorter words.
Preferably, the thin tuning step of adjusting font information comprises: whether first candidate's font of each speech in determining to be expert at is available; If determine that first candidate's font of this speech is unavailable then whether second candidate's font definite this speech is available; If determine this speech second candidate's font can with second candidate's font of this speech and first candidate's font of this speech are exchanged.
Preferably, the thin tuning step of regulating font information further comprises if determine that second candidate's font of this speech is unavailable then determine whether the 3rd candidate's font of this speech available, and if determine this speech the 3rd candidate's font can with then the 3rd candidate's font of this speech and first candidate's font of this speech are exchanged.
Preferably, if determine that the 3rd candidate's font of this speech is unavailable, judge then whether all three candidate's fonts of this speech all are reliable; And if all three candidate's fonts of determining this speech all are insecure then set first candidate's font of this speech with the font of adjacent speech.
Preferably, the thin tuning step comprises following preprocessing process: the font information of each speech and the dictionary that comprises the font information of a plurality of known fonts in relatively going; And obtain three candidate's fonts and three corresponding distance values of this speech with the order of similarity, judge that all insecure condition of all three candidate's fonts of this speech comprises that all three distance values are all greater than predetermined threshold value.
Preferably, the thin tuning step comprises following preprocessing process: the font information of each speech and the dictionary that comprises the font information of a plurality of known fonts in relatively going; And at least the first and second candidate's fonts and at least two corresponding distance values of obtaining this speech with the order of similarity; And the distribution of the detailed font in being expert at is counted according to first candidate's font of speech.
Preferably, at first or the last speech in the row, determine whether available condition comprises for first candidate's font of institute's predicate: roughly consistent during detailed font is expert at, perhaps first candidate's font of speech is consistent with capable main font.
Preferably, at first or the last speech in the row, determine institute's predicate second candidate's font can with condition comprise: second candidate's font of speech distance value consistent and corresponding with the main font of row is less than predetermined threshold value.
Preferably, described candidate's font comprises first candidate's font, second candidate's font and the 3rd candidate's font, at first or the last speech in the row, determine institute's predicate the 3rd candidate's font can with condition comprise: the 3rd candidate's font of speech distance value consistent and corresponding with the main font of row is less than predetermined threshold value.
Preferably, at each speech first and the last speech in row, determine whether available condition comprises for first candidate's font of institute's predicate: roughly consistent during font is expert in detail, and first candidate's font of current speech is consistent with the main font of row, perhaps first candidate's font of two adjacent speech in first candidate's font of current speech and this row is different, and perhaps the detailed font of two adjacent speech is identical.
Preferably, at each speech first and the last speech in row, determine institute's predicate second candidate's font can with condition comprise: second candidate's font of speech distance value consistent and corresponding with two adjacent speech is less than predetermined threshold value.
Preferably, candidate's font comprises first candidate's font, second candidate's font and the 3rd candidate's font, at each speech first or the last speech in row, determine institute's predicate the 3rd candidate's font can with condition comprise: the 3rd candidate's font of speech distance value consistent and corresponding with the main font of two adjacent speech is less than predetermined threshold value.
Preferably, also comprise the coarse adjustment step according to the font information of the font information profile adjustment row in the row, the coarse adjustment step of the font information of this adjusting row comprises: to the serif in the row and in the spacing at least one and in detail the distribution of font count; Determine the described at least a whether uniformity in serif and the spacing and whether font is roughly consistent in being expert in detail, if and determine that above-mentioned distribution satisfies this condition, the first candidate's font that then with main detailed font first candidate's font of all speech in this row is set and uses each speech is as the font result who is discerned.
Preferably, the identification step of discerning the font size of speech in this row comprises: highly whether the input speech X that judges this speech available; If speech X is highly available in input, then institute's recognition font and the inquiry of input picture resolution with known words comprises the priori font size dictionary of " image resolution ratio/font/font size/X height " table, and the X that obtains different font sizes highly tabulates; The X of this speech height is highly mated with X in X highly tabulates; And with the font size of correspondence as the font size of being discerned.
Preferably, the identification step of discerning the font size of speech in this row comprises: institute's recognition font and the inquiry of input picture resolution with known words comprise the priori font size dictionary of " image resolution ratio/font/font size/speech height " table, and obtain the speech height tabulation of different font sizes; This speech height and the speech height in the tabulation of speech height are mated; And with the font size of correspondence as the font size of being discerned.
According to second aspect, the invention provides a kind of character recognition method, comprising:
The speech of input picture is normalized to the normalization step of predetermined altitude;
From normalized speech, extract the characteristic extraction step of feature;
The determining step of the Interval Type of grammatical term for the character;
Classification step and
Discern the identification step of the font information of speech based on the result of classification step.
This character recognition method normalizes to given height with the height of speech image, extracts the feature of the speech image with identical Interval Type thus with identical grade, and this is because according to the Interval Type of speech dictionary is classified in font identification.Consider that at first the speech Interval Type is to select corresponding Interval Type dictionary in the font classification.When being unknown greater than given threshold value or speech Interval Type, the multiword allusion quotation identification that will adopt priori to arrange is handled in consequent distance.Therefore, this embodiment will make full use of the information about Interval Type, and whole OCR processing becomes more effective.
Preferably, input picture is from the speech image in the document image that has carried out capable segmentation and word staging treating.
Preferably, the determining step Interval Type known step whether that comprises grammatical term for the character from the outside.
Preferably, classification step comprises that further the feature with the candidate's font in the dictionary of the extraction feature of normalization speech and the speech Interval Type of being judged compares, and obtain distance value from comparison, and this method further is included in identification step is examined this distance value before with predetermined threshold value the step of examining.Be preferred for classification step at this Bayes classifier.
Preferably, described dictionary is at least from the dictionary of X height speech, capitalize the dictionary of the dictionary of dictionary, ascender speech and descender speech of speech and full-height speech and select entirely.
Preferably, each dictionary of Interval Type has candidate's font of at least 40 types.
Preferably, if if can not obtain the Interval Type of speech or examining step middle distance value less than predetermined threshold in regulating step, then this method further comprises the disaggregated classification step of using a plurality of dictionaries.
Preferably, the disaggregated classification step comprises: compare the extraction feature of normalization speech and the feature of the candidate's font in the dictionary of at least one other speech Interval Type except the speech Interval Type of being judged, from at least one comparison step, obtain distance value respectively, and examine this distance value with another predetermined threshold, so that discern the font information of speech based on the result who examines step.
The dictionary of at least one speech Interval Type that preferably, compare comprises the dictionary of the dictionary of following order: X height speech, capitalizes the dictionary of dictionary, ascender speech and descender speech of speech and the dictionary of full-height speech entirely.
Preferably, the font information of the speech that discern comprises font, serif, poundage, gradient and spacing at least.
Replacedly, the present invention also provides the other method of Character Font Recognition, comprising:
The speech of input picture is normalized to the normalization step of predetermined altitude;
From normalized speech, extract the characteristic extraction step of feature;
Use the disaggregated classification step of a plurality of dictionaries; With
Discern the identification step of the font information of speech based on the result of disaggregated classification step.
Similarly, input picture is from the speech image in the document image of space segmentation and word staging treating, and the Gabor filtrator is used to carry out characteristic extraction step and Bayes classifier is used for the disaggregated classification step.In addition, the disaggregated classification step comprises: compare the extraction feature of normalization speech and the feature of the candidate's font in the dictionary of at least one speech Interval Type, from at least one comparison step, obtain distance value respectively, and examine this distance value with predetermined threshold value, so that discern the font information of speech based on the result who examines step.The dictionary of at least one the speech Interval Type that compares comprises the dictionary of the dictionary of following order: X height speech, capitalizes the dictionary of dictionary, ascender speech and descender speech of speech and the dictionary of full-height speech entirely.
Preferably, the font information of the speech that discern comprises font, serif, poundage, gradient and spacing at least.
According to the third aspect, the invention provides a kind of method of differentiating the Interval Type that input text is capable, comprising:
Use sciagraphy to calculate the capable information calculations step of the capable capable information of input text;
Use sciagraphy to calculate the word information calculation procedure of the word information of selected speech in the line of text;
Reliably whether grammatical term for the character information the reliability determining step;
The Interval Type of institute's predicate is carried out the Interval Type markers step of mark according to row information and word information;
If wherein in reliable determining step, judging that the word information that is calculated is unreliable, then use the connected unit method to calculate the base line of selected speech and reach the standard grade.
The method combination dual mode of the Interval Type that this discriminating input text is capable is to improve the degree of accuracy that the X height value calculates.Because projecting method is far faster than the connected unit method, so projecting method at first is used to calculate speech and reaches the standard grade, and uses the connected unit method when the reliable line information of speech can not obtain by projecting method.The execution strictness is checked and is examined several times so that the speech that calculates is reached the standard grade more and more accurate.At last, reliable speech X height value and Interval Type have been obtained.
Preferably, going the information calculations step comprises: carry out vertical projection with the baseline that obtains line of text with reach the standard grade in territory, whole block.And row information calculations step further comprises: the reliability of the capable baseline that check obtains.The reliability of the capable baseline that check obtains comprises: reach the standard grade and go the height of the intermediate space between the baseline if the height between capable baseline that obtains from row information calculations step and the lower region between the row bottom line is not less than at the row that obtains from row information calculations step, then carry out the vertical projection method once more to obtain new capable baseline in the interval under the baseline roughly.
Preferably, the word information calculation procedure comprises: use the vertical projection histogram to reach the standard grade with the speech that calculates in first speech interval; And calculate the step of base line based on the relation between the y coordinate of the height of speech and row baseline, and if the row baseline approach the bottom of speech then determine that directly the base line equals the bottom of speech.
Preferably, the reliability determining step is by determining to concern whether the grammatical term for the character line is reliable between information of being expert at and the word information.
Preferably, the Interval Type markers step comprises that use carries out step mark just based on the class condition of row information and word information to the Interval Type of selected speech.Described class condition comprises selects at least one feature from following group:
z1/(z1+z2);z3/(z3+z2);nzw;th;|ul_w-ul_l|/(z1+z2);
Here, z1 refers to the height of speech upper interval, and z2 refers to the height in the speech intermediate space; Z3 refers to the height between the speech lower region; Ul_w refers to the y coordinate (in the speech coordinate system) that speech is reached the standard grade; Ul_l refers to the y coordinate (in the speech coordinate system) that row is reached the standard grade; Nzw refers to the ratio of w1 and w2; W1 refers to the width between the histogrammic area of non-zero regions of the horizontal projection in the speech upper interval; W2 refers to the width of speech; Th refers to the ratio of max1 and max2; Max1 refers to the maximal value { i is so that near the speech top line } of Pv (i); Max2 refers to the maximal value { i is so that reach the standard grade near speech } of Pv (i).Preferably, feature nzw and another feature th are used in combination.
The Interval Type of the speech of selecting by mark can comprise; X height speech; Ascender or full speech; Descender or full speech; The speech that is higher than the X height that it is reached the standard grade and reaches the standard grade near row; The speech that is higher than the X height; Speech with the unknown.
Preferably, the method for the Interval Type of identification input characters line may further include judge selected speech in the line of text whether be short speech and by in the appointed area in speech interval, use projecting method calculate respectively judge the step of the word information of lacking speech.Therefore, the step of using the method for the capable Interval Type of input text may further include to examine the speech that in the word information calculation procedure, calculates to reach the standard grade, if judge that in the reliability determining step word information that calculates is reliable, judge after the step that predicate is reached the standard grade on having examined then whether the speech in the line of text is the step of short speech.
Preferably, the connected unit method comprises the step that at least one is following:
Because these connected units are counted as noise, therefore deletion is less than the predetermined connected unit than small size;
The connected unit of deletion under the speech center line; With
Connected unit on the speech center line that deletion obtains in the word information calculation procedure.
Preferably, the Interval Type markers step further comprises the step that judges whether to proofread and correct word line information according to the Interval Type of mark.
Based on the Interval Type that obtains by said method, calculate the X height of any speech in input text is capable easily.
In addition, the present invention further provides a kind of optical font identification equipment, comprising:
The speech of the capable image of input text is divided into the right classification apparatus of speech;
Discern the character recognition device of each speech centering than the font information of long word;
Based on the speech centering that comprises shorter words than the font information of long word and the font identification device of differentiating the font information of each speech centering shorter words at the font information than long word of the speech centering adjacent with described shorter words;
Regulate the thin tuning device of the font information of speech according to the font information of adjacent speech; With
Coarse adjustment device according to the font information of the font information profile adjustment row in the row; With
Discern the recognition device of the font size of speech in this row.
Correspondingly, the present invention further provides a kind of Character Font Recognition equipment, comprising:
The speech of input picture is normalized to the normalization device of predetermined altitude;
From normalized speech, extract the feature deriving means of feature;
The judgment means of the Interval Type of grammatical term for the character;
Sorter and
Discern the recognition device of the font information of speech based on the result who obtains by sorter.
Replacedly, above-mentioned Character Font Recognition equipment can be modified to comprise: the normalization device that input picture is normalized to predetermined altitude; From normalized word, extract the feature deriving means of feature; Directly use the disaggregated classification device of a plurality of dictionaries; With the recognition device of discerning the font information of speech based on the result who obtains by the disaggregated classification device.
Therefore, the present invention further provides a kind of equipment of differentiating the Interval Type that input text is capable, comprising:
Use sciagraphy to calculate the capable information calculations device of the capable capable information of input text;
Use sciagraphy to calculate the first word information calculation element of the word information of selected speech in the line of text;
Reliably whether grammatical term for the character information the reliability judgment means;
Be judged as the base line of reliable selected speech and the second word information calculation element of reaching the standard grade by using the connected unit method not calculate by the reliability judgment means; With
The Interval Type of institute's predicate is carried out the Interval Type labelling apparatus of mark according to row information and word information;
Therefore, the present invention further provides Interval Type that the equipment of a kind of use by the capable Interval Type of identification input text obtains equipment according to the X height of calculation of equipments speech in input text is capable of the capable Interval Type of identification input text.
The present invention also provides a kind of program of carrying out the step of optical font identification, and described step comprises:
The speech of the capable image of input text is divided into the right partiting step of speech;
Discern the Character Font Recognition step of each speech centering than the font information of long word;
Based on the speech centering that comprises shorter words than the font information of long word and differentiate that at the font information than long word of the speech centering adjacent the font of the font information of each speech centering shorter words differentiates step with described shorter words;
Regulate the thin tuning step of the font information of speech according to the font information of adjacent speech; With
Coarse adjustment step according to the font information of the font information profile adjustment row in the row; With
Discern the identification step of the font size of speech in this row.
The present invention also provides a kind of program of carrying out the step of Character Font Recognition, and described step comprises:
The speech of input picture is normalized to the normalization step of predetermined altitude;
From normalized speech, extract the characteristic extraction step of feature;
The determining step of the Interval Type of grammatical term for the character;
Classification step and
Discern the identification step of the font information of speech based on the result who obtains by classification step.
Replacedly, the present invention also provides a kind of program of carrying out the step of Character Font Recognition, and described step comprises:
Input picture is normalized to the normalization step of predetermined height;
From normalized word, extract the characteristic extraction step of feature;
Directly use the disaggregated classification step of a plurality of dictionaries; With the identification step of discerning the font information of speech based on the result who obtains by the disaggregated classification step.
The present invention also provides a kind of execution to be identified in the program of the following steps of the capable Interval Type of input text, and described step comprises:
Use sciagraphy to calculate the capable information calculations step of the capable capable information of input text;
Use sciagraphy to calculate the first word information calculation procedure of the word information of selected speech in the line of text;
Reliably whether grammatical term for the character information the reliability determining step;
The Interval Type of institute's predicate is carried out the Interval Type markers step of mark according to row information and word information; With
If wherein in reliable determining step, judging that the word information that is calculated is unreliable, then use the connected unit method to calculate the base line of selected speech and reach the standard grade.
The present invention also provides a kind of program of carrying out the X following steps highly of calculating speech in input text is capable, and described step comprises:
Use sciagraphy to calculate the capable information calculations step of the capable capable information of input text;
Use sciagraphy to calculate the first word information calculation procedure of the word information of selected speech in the line of text;
Reliably whether grammatical term for the character information the reliability determining step;
If judge that in the reliability determining step word information that calculates is unreliable, the use connected unit calculates the base line of selected speech and reaches the standard grade;
The Interval Type of institute's predicate is carried out the Interval Type markers step of mark according to row information and word information; With
Use the Interval Type that is obtained to calculate the X height of speech.
In another aspect, the present invention also provides a kind of medium of storing the program of the following steps of carrying out optical font identification, comprising:
The speech of the capable image of input text is divided into the right partiting step of speech;
Discern the Character Font Recognition step of each speech centering than the font information of long word;
Based on the speech centering that comprises shorter words than the font information of long word and differentiate that at the font information than long word of the speech centering adjacent the font of the font information of each speech centering shorter words differentiates step with described shorter words;
Regulate the thin tuning step of the font information of speech according to the font information of adjacent speech; With
Coarse adjustment step according to the font information of the font information profile adjustment row in the row.
On the other hand, the present invention also provides a kind of medium of storing the program of the following steps of carrying out the capable Interval Type of input text, comprising:
Use sciagraphy to calculate the capable information calculations step of the capable capable information of input text;
Use sciagraphy to calculate the first word information calculation procedure of the word information of selected speech in the line of text;
Reliably whether grammatical term for the character information the reliability determining step;
The Interval Type of institute's predicate is carried out the Interval Type markers step of mark according to row information and word information; With
If wherein in reliable determining step, judging that the word information that is calculated is unreliable, then use the connected unit method to calculate the base line of selected speech and reach the standard grade.
On the other hand, the present invention also provides a kind of medium of storing the program of the following steps of carrying out the X height that calculates speech in input text is capable, and described step comprises:
Use sciagraphy to calculate the capable information calculations step of the capable capable information of input text;
Use sciagraphy to calculate the first word information calculation procedure of the word information of selected speech in the line of text;
Reliably whether grammatical term for the character information the reliability determining step;
If judge that in the reliability determining step word information that calculates is unreliable, the use connected unit calculates the base line of selected speech and reaches the standard grade;
The Interval Type of institute's predicate is carried out the Interval Type markers step of mark according to row information and word information; With
Use the Interval Type that is obtained to calculate the X height of speech.
Will know other features and advantages of the present invention by description taken together with the accompanying drawings hereinafter.
Description of drawings
The accompanying drawing of incorporating in this manual and constituting its part shows embodiments of the invention, and is used from and explains principle of the present invention together with describing one.
Accompanying drawing 1 is depicted as the overview flow chart that the X height value calculates and Interval Type is differentiated according to first embodiment.
Accompanying drawing 2 is depicted as the details of the vertical projection method of X height value calculating.
Accompanying drawing 3 and 4 is depicted as the capable information result of projecting method, uses two feature nzw and th in the labeling process of detailed speech Interval Type respectively.
Accompanying drawing 5 is depicted as respectively reaching the standard grade of short speech " The " using " normal projection's method " and " projection in lacking the speech designation area ".
Accompanying drawing 6 is depicted as the word line that comprises the short speech " He " that belongs to short part of speech type 2.
Accompanying drawing 7 is depicted as the short speech " He " that uses the vertical projection histogram to handle.
Accompanying drawing 8 is depicted as the expansion process flow diagram of step 600 in accompanying drawing 1.
Accompanying drawing 9 is depicted as a kind of application of the Interval Type of the discriminating that comprises according to the first embodiment of the present invention.
Accompanying drawing 10 is depicted as a kind of application according to the X height value of the calculating of the English word of first embodiment of the present invention acquisition.
Accompanying drawing 11 is depicted as the main flow chart of the second embodiment of the present invention.
Accompanying drawing 12 is depicted as the expansion process flow diagram of the step 400 of accompanying drawing 11.
Accompanying drawing 13 is depicted as the expansion process flow diagram of the step 600 of accompanying drawing 11.
Accompanying drawing 14 is depicted as the process flow diagram of the application of the second embodiment of the present invention.
Accompanying drawing 15 is depicted as the main flow chart of the third embodiment of the present invention.
Accompanying drawing 16 is depicted as the detailed main flow chart of the step 400 of a third embodiment in accordance with the invention accompanying drawing 15.
Accompanying drawing 17 is depicted as the detail flowchart of the step 500 of a third embodiment in accordance with the invention accompanying drawing 15.
Accompanying drawing 18 is depicted as the detail flowchart of the step 600 of a third embodiment in accordance with the invention accompanying drawing 15.
Accompanying drawing 19 is depicted as the detail flowchart of the step 700 of a third embodiment in accordance with the invention accompanying drawing 15.
Accompanying drawing 20 is depicted as the process flow diagram of the application of the third embodiment of the present invention.
Accompanying drawing 21 is depicted as the process flow diagram of the optical character identification of a fourth embodiment in accordance with the invention.
Accompanying drawing 22 is depicted as the equipment of the method for the optical font identification of implementing a third embodiment in accordance with the invention.
Accompanying drawing 23 is depicted as the equipment of enforcement character recognition method according to a second embodiment of the present invention.
Accompanying drawing 24 is depicted as the equipment of enforcement character recognition method according to a second embodiment of the present invention.
Accompanying drawing 25 is depicted as the equipment of discriminating according to the method for the capable Interval Type of the input text of the first embodiment of the present invention.
Embodiment
Hereinafter by being described with reference to the drawings according to optical character recognition method of the present invention and equipment.
The terminological interpretation of Chu Xianing is as follows in the present invention:
OFR: optical font identification.
OCR: optical character identification
ICR: single character recognition (OCR of single character).
X height: this highly is the part that has with the letter of the lowercase x equal height of in-line.
Ascender: the part that lowercase rises on the X height is an ascender.
Descender: the part that lowercase stretches out under baseline is a descender.
Font: a kind of version of style in word same clan type, such as roman font, italic, runic, super body, simplified, expansion, profile, profile etc.
Poundage: the poundage of character is to determine by its relation of retouching limit (stroke) thickness and its overall height.Most of font is provided as two kinds of poundages, regular and runics.
Gradient: the orientation of mainly retouching the limit of gradient indication letter.Font can be Rome body or italic.
Serif: short delivery cross wires at the place, end of the uplink and downlink letter of some font.
Spacing: spacing refers to the desired horizontal space of each character of certain font.Because different characters has different width, so different spacing is regulated the gap between the character.Single spacing is provided identical space for each character and has nothing to do with width.
(embodiment 1)
Embodiment 1 accurately calculates the X height value of English word, and differentiates the Interval Type (ascender, descender, full-height or X height) of English word reliably.
Present embodiment has improved projecting method and connected unit (CC) method, and makes up them to improve the degree of accuracy that the X height value calculates.
Be well known that, as at Optical font recognition from projection profiles (Electronic publishing, VOL.6 (3), 249-260 (September 1993)) disclosed vertical projection distributes and only can handle speech or line of text and have the situation that desirable vertical projection distributes in, and when handling short speech and difference X height and upper case character its degradation.Oblique when askew in line of text, this method may lose efficacy.In addition, being difficult to only distributes by vertical projection differentiates the Interval Type of English word, is distributed in similarly in shape because have the projection of the speech of different Interval Type sometimes, therefore is difficult to classification.
Yet because projecting method is more faster than connected unit method, so projecting method is used for calculating speech apace and reaches the standard grade, and uses the connected unit method can not obtain the reliable rows information of speech by projecting method the time.Close inspection and examine execution repeatedly so that the speech that calculates is reached the standard grade more and more accurate.At last, reliable speech X height value and Interval Type have been obtained.
Accompanying drawing 1 is depicted as the overview flow chart that the X height value calculates and Interval Type is differentiated according to first embodiment.
In step 100, use projecting method to calculate capable baseline neatly so that avoid the situation of skew back.Accompanying drawing 2 is depicted as the details of the vertical projection method that is used for the calculating of X height value, and height (Htop), the height (Hbottom) of bottom line, the height (Hbase) of baseline and the height (Hupper) of reaching the standard grade of the top line of row are calculated as follows:
Htop=max{is is so that Pv[i]>0};
Hbottom=min{i is so that Pv[i]>0};
Hbase=i is so that Pv[i]-Pv[i-1] maximum;
Hupper=i is so that Pv[i]-Pv[i-1] minimum.
At first, in whole character area, carry out vertical projection with the rough baseline that obtains line of text with reach the standard grade.
Then, check the validity of row baseline.If the height between the lower region between row baseline and the row bottom line is less than going the height of reaching the standard grade and going the intermediate space between the baseline, the information of then going may be correct.Otherwise, carry out the vertical projection method in the interval under rough baseline once more to obtain to go more reliably baseline, so that avoid the crooked influence of row.
In step 200, the speech that the vertical projection histogram is used for calculating in first speech interval is reached the standard grade.
In step 300, this embodiment uses diverse ways to calculate the base line according to different capable baseline results.Details is shown in the Table I.
Table I
No. Row baseline result Corresponding method
1 0.57*Hw-1<Y (baseline)<Hw-2, (this baseline is in second speech interval) The vertical projection histogram in the speech interval of (± 20%) around the baseline of being expert at
2 | Y (baseline)-(Hw-1) |≤1, (this row baseline is near the word bottom) The base line equals the bottom of speech
3 Other situation At its altitude range vertical projection histogram in the speech interval of speech height 55% to 100%
Here,
Hw: the height of speech
Y (baseline): the y coordinate of row baseline
If 0.57*Hw-1<Y (baseline)<Hw-2 this means that capable baseline is in second speech interval, around the baseline of then being expert at (baseline ± 20% in) the speech interval in use the vertical projection histogram to calculate the base line; And
If Y (baseline)-(Hw-1) |≤1, this means that capable baseline approaches the speech bottom, determine that then the base line equals the bottom of speech; Otherwise,
In 55% to 100% the speech interval of its altitude range, use the vertical projection histogram to calculate the base line at the speech height.
In step 350, judge that described capable information is wrong certainly, employed criterion is defined as follows:
During in not satisfying 3 following standards one, the information of going certainly is wrong:
Speech is reached the standard grade and is not very near the top of going;
The interval height of row under the base line be no more than be expert at reach the standard grade and the base line between the interval height of row;
The base line is in the interval between speech height 57% and 100%.
If determine that in step 350 row information is wrong, then forward step 600 to, and use connected unit (CC) method to calculate base line and reach the standard grade (will be explained in greater detail below).If determine that in step 350 row information does not have mistake, then uses step 380 according to row mentioned above and word information detailed Interval Type is carried out mark.Details is shown in the Table II.
The local Interval Type of using 6 kinds of following Interval Type: Interval Type No.1 to refer to X height speech of other of Table II and this instructions; Interval Type No.2 refers to the Interval Type of ascender or whole speech (full speech); Interval Type No.3 refers to the Interval Type of descender or full speech; Interval Type No.4 refers to the Interval Type of the speech that is higher than the X height of reaching the standard grade of its approaching row of reaching the standard grade; Interval Type No.5 refers to be higher than the Interval Type of X speech highly; Refer to unknown speech with Interval Type No.6.
Table II:
The Interval Type sequence number. Class condition Detailed Interval Type
1 Z1/ (z1+z2)<0.065 and z3/ (z3+z2)<0.065 X height speech
2 0.22<z1/ (z1+z2)<0.42 and nzw<0.6 or th<0.6 Ascender or full speech
3 0.21<z3/ (z2+z3)<0.45 and z1/ (z1+z2)<0.42 Descender or full speech
4 <z1/ (z1+z2)≤0.22) and (| ul_w-ul_l|/(z1+z2)<0.065) and (nzw<0.6 or th<0.6) It is reached the standard grade near the Interval Type of the speech that is higher than the X height of reaching the standard grade of row
5 <z1/ (z1+z2)≤0.22) and (nzw<0.6 or th<0.6) The speech that is higher than the X height
6 Other situation Unknown speech
Here,
Z1: the height of the speech upper interval between refer to be expert at top line and row are reached the standard grade;
Z2: refer to height in the speech intermediate space;
Z3: refer to the height between the speech lower region;
Ul_w: refer to the y coordinate (in the speech coordinate system) that speech is reached the standard grade;
Ul_l: refer to the y coordinate (in the speech coordinate system) that row is reached the standard grade;
Nzw: the ratio that refers to w1 and w2
W1: refer to the width between the histogrammic area of non-zero regions of horizontal projection in the speech upper interval;
W2: the width that refers to speech;
Th: the ratio that refers to max1 and max2;
Max1: the maximal value { i is so that near the speech top line } that refers to Pv (i);
Max2: the maximal value { i is so that reach the standard grade near speech } that refers to Pv (i)
At this, present embodiment uses two feature nzw and th in to the mark of detailed speech Interval Type.The designated capable information result that is used to check projecting method of these two features, but check the connected unit method as a result the time be invalid.When feature nzw and feature th are used separately as feature in the classification situation, can obtain detailed Interval Type satisfactorily.
Accompanying drawing 3 is depicted as the capable information result of projecting method.The top fine rule of each is used as the calculating of speech top line in speech " were " and " labor ", and the bottom fine rule that crosses each speech " were " and " labor " is that speech is reached the standard grade.Obviously, the erroneous results of speech " were ", but the result of speech " labor " is correct.Explain hereinafter how two feature nzw and th work.
In accompanying drawing 4, the horizontal projection histogram is applied in the speech upper interval, so that arrive " w1 ", and no matter speech is " were " or " labor ".Then, obtain " nzw " easily." nzw " value of speech " were " is 0.73, and the value of other speech is 0.16.These results are different fully.Therefore, whether the present embodiment setting threshold is correct with the capable information result of judging projecting method.
On the right of accompanying drawing 3, use the vertical projection histogram word-for-word to obtain near the speech top line and near the histogrammic maximal value speech is reached the standard grade.Then, determine that " th " of institute's predicate " were " is 0.90, other value is 0.29.These results are also different fully.Therefore, whether setting threshold is correct with judged result easily.
In a word, be easy to generate mistake, make up these two features and judge for the judgement of avoiding single feature is too strict.If two features " nzw " and " th " be greater than threshold value, then the capable information result of projecting method must mistake, otherwise it is reliable.
In step 400 as shown in Figure 1, examine the subsidiary condition that speech is reached the standard grade as shown in Table III in this employed criterion.
Table III
Interval Type Subsidiary condition Handle
1 N/A Hang in the air
2 R>3.9 or (R>=1.3 and (z1+z2)/(zz1+zz2)>0.63) Examine and reach the standard grade
3 N/A Examine and reach the standard grade
4 R>3.9 Examine and reach the standard grade
6 Other situation Do not examine
Here,
R: aspect ratio=speech width/(z1+z2)
Zz1: the height of row upper interval
Zz2: the height of row intermediate space
If determine and to reach the standard grade at step 400 center notional word, then forward step 800 to and determine whether last information is correct.Yet, reach the standard grade if determine in step 400, can not examine speech, forward step 450 to, use whether following criterion is " short speech " with grammatical term for the character.
Table IV
Figure C20051002288100281
Here,
Max1: the maximal value { i is so that the top 1/3rd of intermediate space and upper interval } that refers to Pv (i);
Max2: the maximal value { i is so that the centre 1/3rd of intermediate space and upper interval } that refers to Pv (i)
Ul_l: the y coordinate (in the speech coordinate system) that row is reached the standard grade;
R: aspect ratio=speech width/(z1+z2)
Zz1: the height of row upper interval
Zz2: the height of row intermediate space
Nzw and th: identical with the criterion of in step 380, using.
If satisfy above-mentioned subsidiary condition, then speech is " a short speech ".
If determining speech is not " short speech ", then forwards step 600 to and use connected unit (CC) method to calculate base line and reach the standard grade (will be described in more detail below).If determining speech is " short speech ", then step 500 is reached the standard grade with the speech that projecting method calculates short speech in the zone of appointment (such as the zone on the right of accompanying drawing 5), reaches the standard grade according to speech in step 700 then and once more detailed Interval Type is carried out mark.
Table V
Short part of speech type Disposal route
1 Its scope is the vertical projection histogram from 22% to 42% speech interval of upper interval and intermediate space
2 The vertical projection histogram near reaching the standard grade the speech interval of being expert at
Accompanying drawing 5 is depicted as reaching the standard grade of speech " The ".Top fine rule on this speech is the result of normal projection's method, and the bottom fine rule that crosses this speech is the result of " projection in short speech designation area ".Therefore, judge that easily the former is a mistake and the latter is correct.
As shown in Figure 5, max1 is less than max2, and therefore short speech " The " belongs to short speech Class1.Use corresponding disposal route according to Table V then: the vertical projection histogram in the white specific word interval around the bottom fine rule that crosses speech.Then, can obtain correct speech reaches the standard grade.
As shown in Figure 7, the top line of short speech " He " is near the row top line, and the capable information of whole row satisfies the situation of appointment, so this speech belongs to short part of speech type 2.
Use corresponding disposal route then: vertical projection histogram near the speech interval the row shown between the white given zone of projection around the top of accompanying drawing 7 is reached the standard grade.At last, obtain correct speech and reach the standard grade, shown in the directly top fine rule on the top of the letter as shown in the accompanying drawing 7 " e ", reach the standard grade if use normal projecting method simultaneously then obtain wrong speech.
Hereinafter interpretation procedure 600, wherein use connected unit (CC) method to calculate the base line and reach the standard grade, the condition that this step is carried out is as follows: if determine that in step 350 row information is wrong, if perhaps determine that in step 450 speech is not " a short speech ", promptly can not obtain the reliable capable information of speech by projecting method the time.Used connected unit (CC) method in the art widely.At this, present embodiment provides a kind of new connected unit method, and the connected unit that wherein will have unsuitable area or position is regarded noise or punctuate as, eliminates it then to obtain accurate more base line and to reach the standard grade.
Accompanying drawing 8 is depicted as the expansion process flow diagram of the step 600 in accompanying drawing 1.
In the step 610 of accompanying drawing 8, after being carried out mark, speech connected unit (CC) information deletes the connected unit (CC) of its area less than SmallArea.At this, select SmallArea to equal 0.015 of speech area, because its area is counted as noise and should be deleted less than the connected unit of SmallArea.The ratio of SmallArea and speech area rule of thumb or actual requirement select.
In the step 620 of accompanying drawing 8, the connected unit of deletion under the speech center line.This step 620 is designed to deletion " following punctuate ", such as comma and fullstop etc.
Then, from remaining connected unit, calculate MTop (top average), and in the step 630 of accompanying drawing 8, the connected unit of deletion on the speech center line.This step 630 is designed to deletion " go up punctuate ", such as quotation marks etc.
From remaining connected unit, calculate MBot (bottom average) then, from its bottom is not more than the remaining connected unit of Mbot, calculate average bottom value then to obtain baseline.
In addition, be not less than from its top and calculate average top value based on the MTop that obtains the remaining connected unit of MTop and reach the standard grade with acquisition.
In the step 500 of accompanying drawing 1 or in the step 600 of accompanying drawing 1, fall into a trap and calculated after speech reaches the standard grade, reach the standard grade according to the speech of drawing up of final inspection that is used for row information and final speech Interval Type mark the detailed speech Interval Type of the step 700 of accompanying drawing 1 is carried out mark.
In the step 800 of accompanying drawing 1, judge according to detailed Interval Type whether final row information is correct:
When Interval Type was 3 (can not be 2 to 5 simultaneously), speech was reached the standard grade and must be satisfied following standard:
((z1-1)/(z2+1)<0.15) and (z3/ (z2+1)<0.83).
Otherwise the capable information of speech is wrong.
In Interval Type is 2 or 5 o'clock, and speech is reached the standard grade and must be satisfied following standard:
((z1-1)/(z2+1)<0.83) and (z3/ (z2+1)<0.83).
Otherwise the capable information of speech is wrong, and Interval Type and highly the unknown of X.
Here,
Z1: the height of speech upper interval
Z2: the height of speech intermediate space
Z3: the height between the speech lower region
Interval Type (Zone type): refer to as type listed in the Table II.
In the step 900 of accompanying drawing 1, according to the final speech Interval Type of Table VI mark.
Table VI
No. Rule Final Interval Type
1 Interval type is the 2nd or No. 5 in the time can calculating speech X height value, must be 3 simultaneously Full speech
2 Interval type is the 2nd or No. 5 in the time can calculating speech X height value, is not 3 simultaneously The ascender speech
3 Interval type is No. 3 in the time can calculating speech X height value, is not 2 words or 3 simultaneously The descender speech
4 The top of the approaching row of speech top line or the aspect ratio of speech are less than 0.65 Capitalization speech or numeral
5 Other situation It can be X height speech
Here,
Interval Type: Interval Type No.1 refers to the Interval Type of X height speech; Interval Type No.2 refers to the Interval Type of ascender or full speech; Interval Type No.3 refers to the Interval Type of descender or full speech; Interval Type No.4 refers to the Interval Type of the speech (X-height-plus word) that is higher than the X height of reaching the standard grade of its approaching row of reaching the standard grade; Interval Type No.5 refers to be higher than the Interval Type of X speech highly; Refer to unknown speech with Interval Type No.6.
The application of embodiment 1
The Interval Type of the X height value and the discriminating of calculating of English word (ascender, descender, full-height or X height) all is unusual Useful Information in OCRh and OFR process.
For example, the Interval Type of discriminating can be used for the dictionary classification (according to Interval Type) of English OCR.Time by reducing pattern match it can improve the speed of ICR engine because candidate characters is still less arranged in dictionary.It also helps accuracy, because a character always can not be identified as the candidate characters with other Interval Type mistakenly.Accompanying drawing 9 is depicted as a kind of application of the Interval Type of differentiating that obtains according to present embodiment, it judges respectively whether the English character image is X height Interval Type, ascender Interval Type, descender Interval Type and full-height Interval Type, otherwise the dictionary identification with all Interval Type has obtained character code and believable value as a result thus.
Interval Type information also can be used for the OCR aftertreatment, and it can be used for proofreading and correct obscuring of capitalization and corresponding lowercase character.
In English OCR character segmentation, Interval Type information can be used for judging whether " separation path " (in possible character separating resulting a kind of) is wrong.
About the X height value of the English word that calculated, very useful in the normalization of X height value at English word in OCR and two kinds of processing procedures of OFR.The English word of different size can be normalized to the height of appointment according to the X height value.Therefore the feature (OCR and OFR) of same levels can be extracted to avoid the influence of font size from normalized image.
The X height value also is important in the font size identification of English word, and word or line height can not directly use (in font size identification) simultaneously, because the character in a speech or row can not take whole 3 intervals.Accompanying drawing 10 is depicted as a kind of application of the English word X height value that obtains according to present embodiment.The X that speech X height, image resolution ratio and the type of being discerned can be used for obtaining the different font sizes under the image resolution ratio of appointment and font highly tabulates, and with the X height list match of speech X height with different font sizes, has obtained font size thus then.
Present embodiment is also assessed the degree of accuracy of the Interval Type discrimination method of printed text on capable.In 3302 words, the degree of accuracy that Interval Type is differentiated is 99.67% altogether, and the assessment sample is that to have a printed text of 3 kinds of different fonts capable.
In brief, present embodiment provides a kind of method of differentiating the method for the Interval Type that input text is capable and calculating the X height of the speech in input text is capable, and both comprise: use sciagraphy to calculate the capable information calculations step of the capable capable information of input text; Use sciagraphy to calculate the word information calculation procedure of the word information of selected speech in the line of text; Reliably whether grammatical term for the character information the reliability determining step; The Interval Type of institute's predicate is carried out the Interval Type markers step of mark according to row information and word information; If wherein in reliable determining step, judging that the word information that is calculated is unreliable, then use the connected unit method to calculate the base line of selected speech and reach the standard grade.
Present embodiment is expert at and has been made up projecting method and connected unit (CC) method in the information extraction process, and wherein projecting method is become to be used to obtain better result by specialized designs with the connected unit method, and considers and limited the influence of crooked, noise etc. fully.In the present embodiment, carry out very close inspection and examining several times so that the speech that is calculated is reached the standard grade more and more accurate.Particularly, in the base line computation, use row baseline and projecting method to obtain accurate base line (refer step 300) together; The projection histogram of use in the zone of appointment is to calculate reach the standard grade (with reference to the criterion in step 450, step 500 and accompanying drawing 5,6 and 7) of short speech flexibly; In new connected unit (CC) method, the connected unit with unsuitable area or position is counted as noise or punctuate, and is eliminated to obtain accurate more base line and to reach the standard grade; And can not obtain the reliable rows information of speech by projecting method the time, use connected unit method (refer step 600 and accompanying drawing 8).In addition, reach the standard grade the detailed speech Interval Type of mark (refer step 380,700,900 and accompanying drawing 3 and 4) according to the final inspection that is used for row information and the speech of drawing up of the final speech Interval Type of mark.
Therefore, present embodiment can accurately calculate the X height value of English word, and differentiates the Interval Type (ascender, descender, full-height or X height) of English word reliably.
(embodiment 2)
Second embodiment belongs to the priori Character Font Recognition on the speech aspect.It utilizes Interval Type information that English word is divided into four types, and every type has different dictionaries.It can differentiate the font information (font, serif, poundage, gradient and spacing) of English word with higher degree of accuracy and speed.
It supports the identification of at least 10 kinds of fonts, far more than the current popular OCR software with file space of a whole page restore funcitons, such as Omnipage, FineReader etc.
Accompanying drawing 11 is depicted as the main flow chart of embodiment 2.
In the step 100 of accompanying drawing 11, the height of each speech image is normalized to WordHeight (at this WordHeight=35) with bilinear interpolation, so that can handle the word of virtually any size.This beguine is better according to the normalization of X height; Because can not very accurately obtain the X height value sometimes, therefore can influence latter feature and extract and separating treatment.
The step 200 of accompanying drawing 11 is extracted the font feature.In principle, can use any overall textural characteristics in this step.Use compound 2D isotropy Gabor wave filter or other global characteristics from normalized speech image, to extract the font feature at this.2D isotropy Gabor wave filter is very known in the art, and provides in CN1271140A and use this Gabor wave filter to extract the feature of texture image, and details will be described below.
Particularly, use has 12 isotropy Gabor wave filters of 6 angles (0,30,60,90,120,150 degree) and two frequency values (0.14,0.11).Consider speed and degree of accuracy, present embodiment looks like to be divided into 2 images according to " real part " and " imaginary part " with each compound Gabor filter graph.Calculating mean value and deviation from each part respectively then.Therefore extract 48 features and form 48 dimensional feature vectors (12 * 2 * 2).
The Interval Type of step 300 grammatical term for the character of accompanying drawing 11.
According to the shared interval of each letter in the speech, speech can be categorized as four Interval Type (X height speech, ascender speech, descender speech and full-height speech).
Usually the interval English words of difference have different textural characteristics, because therefore the cause of the kinds of characters in different intervals (retouching the limit) combination proposes the different dictionaries that use comprises the speech of different Interval Type.The inventor has been found that by experiment that the descender speech can be reduced with the ascender speech and is identical type, because the height between upper interval and lower region is roughly the same.The inventor finds that also all capitalization speech have and other ascender speech different texture feature, so present embodiment separates them from the ascender speech.
At last, four kinds of dictionaries are arranged, promptly be used for the dictionary of the dictionary of X height speech, the dictionary that is used for ascender speech and descender speech, full-height speech and capitalize the dictionary of speech entirely.
All capitalization speech do not appear in the English text usually, and it is very difficult to do not knowing in the speech whether the situation of each the alphabetical details speech that judges all is the capitalization speech.Therefore in normal circumstances, this processing can only obtain the Interval Type of ascender/descender, full-height or X height speech sometimes.Can not always obtain the Interval Type of English word, particularly when speech is very short.
Accompanying drawing 12 is depicted as the process flow diagram of expansion of the step 400 of accompanying drawing 11.
If the Interval Type of known words then only requires once to discern with known Interval Type.In each dictionary of the Interval Type of appointment, have 40 kinds of detailed candidate's fonts (10 font * 2 poundage * 2 gradients) at least.The eigenvector that from normalized speech image, extracts again with each eigenvector coupling of detailed candidate's font in the dictionary of the speech Interval Type of appointment.In addition, calculate a distance value (distance value), this distance value is represented the mathematical distance between the eigenvector of the eigenvector that extracts and the detailed candidate's font in a dictionary of the speech Interval Type of appointment from normalized speech image.
In this classification step, adopt Bayes classifier, and other sorter all is applicatory such as minimum distance classification.These classification that comprise Bayes all are very known in the art, will be not described in detail at this.
In the step 500 of accompanying drawing 11, examine the result of step 400.D0 is the distance value from Bayes classifier.TH0 (=-480) is the empirical value that is obtained by experiment.If distance value D0 is less than TH0 as a result, then reliable results.Otherwise, determine that the speech Interval Type may be incorrect, execution in step 600 is so that with identification of multiword allusion quotation and consideration priority.
Accompanying drawing 13 is depicted as the expansion process flow diagram of the step 600 of accompanying drawing 11.
This step process as shown in Figure 13 is not such as obtaining the situation that speech Interval Type or speech Interval Type may be incorrect etc., and these situations are included in determines in the step 300 that Interval Type is that the unknown result with step 400 is not verified in step 500.If all alphabetical tops all are in equal height in the word, then are difficult to distinguish top line and reach the standard grade.Similarly, if the bottom of all letters also has identical height in the word, then be difficult to distinguish baseline and bottom line.Therefore, usually, compare the Interval Type of ascender, descender speech, the difficult Interval Type that obtains X height/all ascender speech, and the Interval Type of the easiest acquisition full-height speech.Therefore, in this step, the priority orders that dictionary is selected as shown in Figure 13.
Distance value D1, D2, D3, D4 are the distance values from Bayes classifier, and the mathematical distance between the eigenvector of the eigenvector of normalization speech image extraction and the detailed candidate's font the normal dictionary of four speech Interval Type, i.e. distance value are represented in calculating.TH1 (=-500) is the empirical value that obtains from experiment.If Di (i=1,2,3,4) is less than TH1, then the result by corresponding dictionary identification is very reliable, and we will not continue to use other dictionary to discern then, and the result will be output as net result.Otherwise with distance value D1, D2, D3, D4 are mutually relatively to obtain minor increment.Corresponding font is (selecting from 40 kinds) final font in detail.
The application of embodiment 2
Above the method for Jie Shaoing can be used for page layout analysis and recovery.It can differentiate the font information of each speech in the English document image.
It also can be used for omnifont OCR system.The font information of English word can be predicted and be used to select to have the OCR dictionary of specific font to improve the OCR degree of accuracy.
Above-mentioned process flow diagram is simply illustrated by accompanying drawing 14: the english literature image is carried out piece select, go segmentation, word segmentation so that obtain the speech image and implement the discriminating of speech Interval Type, carry out the processing of defining by present embodiment then.The font information of each speech that is obtained can be used for layout recovery, list-font OCR etc.
Present embodiment compares the popular OCR software with Character Font Recognition (layout recovery) on present embodiment 2 and the market.The degree of accuracy of embodiment 2 is far above them.
The various degree of accuracy of font identification are tabulated in Table VII.
Omnipage 11.0 FineReader 5.0 TypeReader Professional 6.0 TextBridge Pro Millennium Embodiment 2
Benchmark test program 1 46.24% 30.09% 30.39% 27.75% 95.57%
Benchmark test program 2 33.86% 20.36% 28.96% 22.98% 88.73%
Benchmark test program 3 30.49% 10.09% 19.49% 14.52% 95.24%
Benchmark test program (Benchmark) 1: the printed text that all speech have even font capable (820 row, 93344 words altogether).
Benchmark test program 2: the printed text with 2 kinds of different fonts capable (1560 row, 18009 words altogether).
Benchmark test program 3: the printed text with 3 kinds of different fonts capable (288 row, 3637 words altogether).
In brief, present embodiment provides a kind of character recognition method, comprising: the normalization step that the speech of input picture is normalized to predetermined altitude; From the normalization speech, extract the characteristic extraction step of feature; The determining step of the Interval Type of grammatical term for the character; Classification step and discern the identification step of the font information of speech based on the result of classification step.
This character recognition method normalizes to given height with the height of speech image, extract the feature of speech image thus with identical Interval Type with identical grade, because in font identification, dictionary is classified (with reference to the accompanying drawings 11 in step 100) according to the Interval Type of speech.Consider that at first the speech Interval Type is to select corresponding " Interval Type " dictionary (with reference to the accompanying drawings 12) in font classification., the multiword allusion quotation identification of adopting priori to arrange is handled when being unknown in result distance (with reference to the accompanying drawings 13) greater than given threshold value or speech Interval Type.Therefore, this embodiment will make full use of the information about Interval Type, and whole OCR processing becomes more effective.
[embodiment 3]
Embodiment 3 is expert at and adopts speech to mechanism in the OFR processing procedure of image, this speech to mechanism based on the font sorting technique of English word and consider the characteristic that font distributes in the English text of reality simultaneously.Its uses based on the two-stage of context font information regulation technology as a result, and this technology can realize more pinpoint accuracy and more neat output in line of text.Adopt the accurate dimensions recognition methods in the present embodiment.
It supports the identification of 10 kinds of fonts, far more than the common OCR software (such as Omnipage, FineReader etc.) with file space of a whole page restore funcitons.It can realize much higher degree of accuracy (font and font size) than other software.
Accompanying drawing 15 is depicted as the main flow chart of present embodiment 3.
In the step 100 of accompanying drawing 15, at first calculate the X height value and the Interval Type of all speech in being expert at.Preferred X height value and the Interval Type of using the speech that obtains by 1 describing method of the foregoing description, perhaps use those methods in prior art, such as disclosed character height distribution histogram method in US5883974 with for example at Optical fontrecognition from projection profiles (Electronic publishing, VOL.6 (3), 249-260 (September 1993)) disclosed vertical projection distribution.
In the step 200 of accompanying drawing 15, it is right that all speech first speech from row during present embodiment has proposed to go is divided into each speech.
By experiment, known recognition result (having bigger speech width) more reliable than shorter words than long word.
Each speech is to comprising two adjacent speech (if the quantity of each speech is odd number in the row, then last speech is to only comprising a speech).Below will differentiate font with diverse ways than long word and shorter words.
In the step 300 of accompanying drawing 15, what each speech was right can discern by the whole bag of tricks that the font of using English word is classified than long word.These methods comprise disclosed method in the foregoing description 2, as at Script identification in printed bilingual documents to DDHANYA, A G RAMAKRISHNAN_ and PEETA BASA PATI (Sadhana Vol.27, Part 1, February 2002, pp.73-82.
Figure C20051002288100381
Printed inIndia) in the method for disclosed " English word identification " based on speech or as at US6, disclosed posteriority Character Font Recognition method in 337924 and US6,496,600.
In the step 400 of accompanying drawing 15, differentiate font at the shorter words of each speech centering.
In this step, the two kinds of facts that must at first consider are arranged.
1) Shi Ji line of text, if around the shorter words (about) two detailed fonts than long word all identical, then the detailed font of shorter words also very may be identical.
2), more reliable than the Character Font Recognition of the shorter words in the classification of English word font than Character Font Recognition result's (having) of long word than the long word width as mentioned in the step 200.
Consider these facts, be used to carry out the step 400. of accompanying drawing 15 at the process flow diagram shown in the accompanying drawing 16
Will do you in the step 410 of accompanying drawing 16, judge that at first the speech that discern is in the edge of line of text? if not so, the speech that then will discern must have two than long word on every side.
, obtain and two detailed fonts around speech to be identified relatively mutually to step 460 from the step 420 of accompanying drawing 16 than long word.(to be font 1 set by the font than long word of same speech centering, and font 2 is set by the font than long word of last or next speech centering.)
Whether the shorter words of determining current speech centering in step 420 is first speech of speech centering.If the result of determination of step 420 is for being, then a last speech centering is set at " font 2 " than the detailed font of long word, otherwise next speech centering is set to " font 2 " than the detailed font of long word in step 440.
Be set at " font 1 " the current speech centering of step 450 than the detailed font of long word.
" font 2 " whether " font 1 " determining to set in step 460 equals to set.If the detailed font than long word is identical, then we needn't discern this shorter words of current speech centering, and its detailed font also is identical " font 1 " (step 480).Otherwise, need by using the right shorter words (step 470) of the current speech of font Classification and Identification of English word.When describing step 300 by the agency of this method, therefore save description of them at this.
By this process flow diagram, the font of the shorter words of each speech centering differentiated to realize higher degree of accuracy, more speed and more neat output.
Now obtained the font result of each speech, the step 500 of accompanying drawing 15 begins to regulate the font result based on the font information of adjacent speech.
Accompanying drawing 17 is depicted as the expansion process flow diagram of the step 500 of accompanying drawing 15.
By font result who compares each speech and the dictionary that comprises a plurality of fonts of mode standard, order with similarity is extracted three kinds of candidate's fonts for each speech, wherein first candidate's font has maximum matching similarity with respect to the font result of each speech, has therefore obtained corresponding distance value.At this by putting into practice preferred three candidate's fonts.Yet also can adopt two candidate's fonts in the present embodiment or more than three candidate's fonts according to the requirement of reality.
In the step 510 of accompanying drawing 17, the distribution of the detailed font in the row is counted according to first candidate's font of speech.
Then to the remaining step that circulates of each speech in this row.
When current speech was first or last speech in the row, roughly whether the detailed font that the processing of present embodiment will be judged row unanimity (step 520 of accompanying drawing 17).Its standard be in being expert at the speech more than 80% at any time have identical detailed font or be expert in the quantity of speech be no more than at 6 o'clock, have the speech more than 60% to have identical detailed font in the row.If the inhomogeneous basically unanimity of detailed font of row, then the processing of present embodiment is not carried out the adjusting of current speech and is continued circular treatment at other speech in the row.
If the detailed font of row is roughly consistent in step 520, whether the step 521 that then forwards accompanying drawing 17 to is consistent with the main font of this row with first candidate's font of judging current speech.If step 521 judges that first candidate's font of current speech is consistent with the main font of this row, then carry out circular treatment at other speech in this row.
If judge that in step 521 the main font of first candidate's font of current speech and this row is inconsistent, whether consistent and corresponding whether whether the step 522 that then forwards accompanying drawing 17 to also satisfy the condition of appointment with the main font of this row distance with second candidate's font of judging current speech.Also satisfy the condition of appointment if judge second candidate's font distance consistent and corresponding of current speech with the main font of this row, then make the exchange of second current candidate's font and first candidate's font, otherwise the step 523 that forwards accompanying drawing 17 to judge the 3rd current candidate's font whether the distance consistent and corresponding with the main font of this row whether also satisfy the condition of appointment.
If step 523 is judged current the 3rd candidate's font distance consistent and corresponding with the main font of this row and is also satisfied the condition of appointment, then exchange the 3rd current candidate's font and first candidate's font, otherwise whether forward the step 530 of accompanying drawing 17 to reliable to judge its candidate's font.If its candidate's font is unreliable when the distance of determining all candidate's fonts is big, think that then not finishing identification well handles, setting first candidate's font subsequently, to be adjacent candidate's font of speech identical, promptly sets first candidate's font (step 540 of accompanying drawing 17) with the font of adjacent speech.
When current speech was not first in this row or last speech, whether whether first candidate's font roughly consistent and current speech was consistent with the main font of this row to judge in this row in detail font for the step 550 of the processing execution accompanying drawing 17 of present embodiment.Rule of judgment is that the speech more than 80% has identical detailed font in this row.
The step 551 of accompanying drawing 17 further judges that the detailed font of its adjacent 2 speech is whether identical and whether first candidate's font is different with them.
The step 552 of accompanying drawing 17 and 553 further judges whether whether second satisfy the condition of appointment with the 3rd candidate's font with 2 adjacent speech distances consistent and candidate's font.
If second candidate's font of judging current speech in step 552 distance consistent and corresponding with 2 adjacent speech also satisfies the condition of appointment, then exchange second current candidate's font and first candidate's font, otherwise forward step 553 to judge the 3rd current candidate's font.If the 3rd candidate's font of judging current speech in step 553 distance consistent and corresponding with 2 adjacent speech also satisfies the condition of appointment, then exchange current the 3rd candidate's font and first candidate's font, otherwise whether forward the step 560 of accompanying drawing 17 to reliable to judge its candidate's font.If its candidate's font is unreliable when the distance of determining all candidate's fonts is all big, think that then not finishing identification well handles, therefore setting first candidate's font, to be adjacent candidate's font of speech identical, promptly sets first candidate's font (step 570 of accompanying drawing 17) with the font of adjacent speech.
The step 600 of accompanying drawing 15 is further regulated the font result based on the font distributed intelligence of this row.
Accompanying drawing 18 is depicted as the expansion process flow diagram of the step 600 of accompanying drawing 15.
In the step 610 of accompanying drawing 18, distribution, serif and the spacing of detailed font in this row are counted according to first candidate's font of the speech after regulating.
The step 620 of accompanying drawing 18 is judged in this row whether absolute uniformity and whether font is roughly consistent in detail of serif and spacing.
Employed condition is as follows:
The serif of first candidate's font of all speech in this row is identical;
The spacing of first candidate's font of all speech in this row is identical;
In this row more than 75% but be not that 100% speech has identical detailed font.
When all three conditions all satisfy, set first candidate's font in this row and use the font result of first candidate's font of each speech as identification with main font in detail.But when in above-mentioned three conditions one did not satisfy, the result of step 620 was not for, then the processing of this embodiment do not carry out regulate and then current first candidate's font of using each speech as the font result who is discerned.
In the step 700 of accompanying drawing 15, calculating font size from the font of image resolution ratio, identification and speech X height (or speech Interval Type).
Accompanying drawing 19 is depicted as the expansion process flow diagram of the step 700 of accompanying drawing 15.
If the X of speech highly available (step 100 of accompanying drawing 15) then will carry out left branch.
In the step 720 of accompanying drawing 19, by when input known input picture resolution and the font inquiry font size dictionary discerned in " image resolution ratio/font/font size/X height " table, and the X that obtains different font sizes highly tabulates.
In the step 730 of accompanying drawing 19, highly search for X height nearest in tabulation according to the X that imports speech.In tabulation, have minimum | x-x i|/x iX highly be nearest one.Corresponding font size is the size of being discerned.Refer to import X value highly at this " x ", the value of i X height during " xi " refers to tabulate.
Otherwise,, then carry out right branch if the X of speech is highly unavailable or inaccurate.
Data 740 are the speech Interval Type (selected dictionary) from the font classification.
Step 720 and 730. differences that the step 750 of accompanying drawing 19 and 760 is similar to accompanying drawing 19 only are them based on the speech height, rather than the X height.The step 750 of accompanying drawing 19 comprises the priori font size dictionary of " image resolution ratio/font/font size/speech height " table by input picture resolution known when importing and the font inquiry of being discerned, and obtains the speech height tabulation of different font sizes; The nearest speech height of step 760 match search of accompanying drawing 19 in the speech height tabulation of input speech; With
The size that the corresponding font size conduct of identification is discerned.
The application of embodiment 3
The method of introducing in embodiment 3 can be used for page layout analysis and recovery.It can be higher degree of accuracy and speed differentiate the capable font information of English text.
It also can be used in the omnifont OCR system.Font information can be predicted and be used to select to have the OCR dictionary of specific font to improve the OCR degree of accuracy.
Process flow diagram can be simplified as accompanying drawing 20 ground: to English document image carry out that piece is selected, the row segmentation is so that obtain the row image and implement as the defined processing of present embodiment.The font information that is obtained can be used for space of a whole page recovery, list-font OCR etc.
The applicant has compared present embodiment and had the popular OCR software of Character Font Recognition (space of a whole page recovery) function on market.The degree of accuracy of present embodiment is far above them aspect font and font size.The applicant has also compared present embodiment and has wherein only used the foregoing description 2 of single English word font classification alternative steps 200-600 in addition.Present embodiment has improved degree of accuracy and speed.
Table VIII has shown the comparative result of the degree of accuracy of font identification:
Omnipage 11.0 FineReader 5.0 TypeReader Professional 6.0 ?TextBridge ?Pro ?Millennium Embodiment 2 Embodiment 3
Benchmark test program 1 46.24% 30.09% 30.39% 27.75% 95.57% 99.16%
Benchmark test program 2 33.86% 20.36% 28.96% 22.98% 88.73% 90.33%
Benchmark test program 3 30.49% 10.09% 19.49% 14.52% 95.24% 97.09%
Table I X has shown the comparative result of the degree of accuracy of font size identification:
Omnipage 11.0 FineReader 5.0 TypeReader Professional 6.0 TextBridge Pro Millennium Embodiment 2 Embodiment 3
Benchmark test program 1 45.83% 4.65% 38.96% 15.22% 61.25% 61.81%
Benchmark test program 2 40.42% 10.54% 57.79% 18.56% 81.97% 82.93%
Benchmark test program 3 21.14% 20.76% 23.62% 10.61% 76.02% 76.82%
The speed of present embodiment is higher by 28.9% than the speed of the foregoing description 2 aspect " benchmark test program speed ".
Benchmark test program 1: the printed text capable (8020 row, 93344 words altogether) that has even font for all speech
Benchmark test program 2: the printed text with 2 kinds of different fonts capable (1560 row, 18009 words altogether)
Benchmark test program 3: the printed text with 3 kinds of different fonts capable (288 row, 3637 words altogether)
Benchmark test program speed: the printed text of from benchmark test program 1-3, selecting capable (201 row, 2659 words altogether)
In brief, present embodiment has been introduced the method for optical font identification, comprising: be divided into the right partiting step of speech by being about to the speech of input text image; Discern the Character Font Recognition step of each speech centering than the font information of long word; Based on the speech centering that comprises shorter words than the font information of long word and differentiate that at the font information than long word of the speech centering adjacent the font of the font information of each speech centering shorter words differentiates step with described shorter words; Regulate the thin tuning step of the font information of speech according to the font information of adjacent speech; With coarse adjustment step according to the font information of the font information profile adjustment row in the row; Identification step with the font size of discerning the speech of going.
Present embodiment is expert at and has been introduced speech to mechanism in the Character Font Recognition process, the characteristics (refer step 200,300 and accompanying drawing 16) that this speech has been considered the Character Font Recognition feature of speech one-level to mechanism and font distributes in actual English text.This embodiment has also introduced aftertreatment, the two-stage font information and the font distributed intelligence of the adjacent speech of row (respectively based on) regulation technology is as a result adopted in this aftertreatment, and this technology can realize higher degree of accuracy and more neat output (with reference to the accompanying drawings 17 and accompanying drawing 18) in line of text.Present embodiment also adopts the accurate dimensions recognition technology in conjunction with " Interval Type information " and " X height value " (with reference to the accompanying drawings 19).Therefore, present embodiment can be realized higher degree of accuracy and more neat output in line of text, and as the method for introducing is in the present embodiment supported the identification of 10 fonts, far more than popular OCR software with file space of a whole page restore funcitons, such as Omnipage, FineReader etc., and can realize the degree of accuracy more much higher (font and font size) than other software.
(embodiment 4)
As indicated above, first, second and the 3rd embodiment can use separately, perhaps can be used in combination.Accompanying drawing 21 is depicted as the process flow diagram that a fourth embodiment in accordance with the invention is used for optical character identification, wherein can use above-mentioned three embodiment.
Embodiment 1 can be used for the Interval Type of speech image and differentiate and the X high computational, uses embodiment 1 to calculate speech X height and Interval Type in step 100.They are unusual Useful Informations in font classification and the Character Font Recognition.
Step 200,300,400 relate to speech image normalization, font feature extraction and the font classification of having described in embodiment 2.After the font classification, can obtain detailed font (taking all factors into consideration poundage and gradient), and they also can be used for the identification of (with the speech Interval Type of upgrading) font size, for example step 600.The font information (font, font size, poundage, gradient, spacing and serif) of English word can accurately obtain simultaneously, for example step 500.
In addition, embodiment 1-4 is not limited to English word, and based on its handling principle, these embodiment can directly relate to other Roman capitals.
In above-mentioned description, in preferred embodiment, the present invention has been described as method or software program.At the present invention, be understood that easily the present invention is preferred for any known computer system such as on the personal computer.Therefore, computer system will no longer go through.It should be appreciated that image can be directly inputted to computer system (for example passing through digital camera) or digitizing before being input to computer system (for example by scanning).
In addition, just as used herein, have therein the computer-readable storage medium that storage is used to carry out the computer program of said method and for example can comprise that magnetic storage media is such as disk (such as floppy disk) or tape; Optical storage media is such as CD, light belt or machine-readable bar code; The solid-state electronic memory device is such as random-access memory (ram); Or be used for any other physical device or the medium of storage computation machine program.
In addition, those of ordinary skills recognize easily also can be with the equivalent of the above-mentioned software of hardware design.
Reference example 3, accompanying drawing 22 are depicted as the equipment of the said process that is configured to implement described optical font identification, and in brief, this equipment comprises:
Be divided into the right classification apparatus of speech 100 by being about to the speech of input text image;
Discern the character recognition device 101 of each speech centering than the font information of long word;
Based on the speech centering that comprises shorter words than the font information of long word and the font identification device 102 of differentiating the font information of each speech centering shorter words at the font information than long word of the speech centering adjacent with described shorter words;
Regulate the thin tuning device 103 of the font information of speech according to the font information of adjacent speech; With
Coarse adjustment device 104 according to the font information of the font information profile adjustment row in the row; With
The recognition device 105 of the font size of the speech of identification row.
Reference example 2, accompanying drawing 23 are depicted as and are configured to implement the described above-mentioned equipment that is used for the process of Character Font Recognition, and in brief, this equipment comprises:
The speech of input picture is normalized to the normalization device 200 of predetermined altitude;
From the normalization speech, extract the feature deriving means 201 of feature;
The judgment means 202 of the Interval Type of grammatical term for the character;
Sorter 203 and
Discern the recognition device 204 of the font information of speech based on the result who obtains by sorter.
Sorter 203 further comprises the comparison means that the feature with the candidate's font in the dictionary of the extraction feature of normalization speech and the speech Interval Type of being judged compares, and the acquisition device that from comparison means, obtains distance value, therefore, the equipment of the method for enforcement Character Font Recognition can a step comprise the validation apparatus of examining the distance value that is obtained with predetermined threshold value.If verifying attachment is determined distance value less than predetermined threshold value, then the disaggregated classification device uses a plurality of dictionaries, and this disaggregated classification device is configured to;
Compare the extraction feature of normalization speech and the feature of the candidate's font in the dictionary of at least one other speech Interval Type except the speech Interval Type of being judged,
From at least one comparison step, obtain distance value respectively, and
Examine this distance value with another predetermined threshold, so that discern the font information of speech based on the result who examines step.
In addition, if judgment means 202 Interval Type of grammatical term for the character is successfully then used the disaggregated classification device.
Consider above-mentioned situation, accompanying drawing 23 is depicted as the modification equipment of implementing character recognition method, comprising: the normalization device 300 that input picture is normalized to predetermined height; From normalized word, extract the feature deriving means 301 of feature; Directly use the disaggregated classification device 302 of a plurality of dictionaries; With the recognition device 303 of discerning the font information of speech based on the result who obtains by the disaggregated classification device.
Reference example 1, accompanying drawing 25 are depicted as the equipment of all said process that are configured to implement the capable Interval Type of described discriminating input text, and in brief, this equipment comprises:
Use sciagraphy to calculate the capable information calculations device 400 of the capable capable information of input text;
Use sciagraphy to calculate the first word information calculation element 401 of the word information of selected speech in the line of text;
Reliably whether grammatical term for the character information reliability judgment means 402;
Be judged as the base line of reliable selected speech and the second word information calculation element 403 of reaching the standard grade by using the connected unit method not calculate by the reliability judgment means; With
The Interval Type of institute's predicate is carried out the Interval Type labelling apparatus 404 of mark according to row information and word information.
The equipment of differentiating the Interval Type that input text is capable may further include judges whether selected speech is the judgment means of short speech in the line of text, and the first word information calculation element 401 by in the appointed area in speech interval, use projecting method calculate individually judge the word information of lacking speech.
In addition, the equipment of the Interval Type that the discriminating input text is capable obviously can be modified to the equipment that is calculated the X height of speech in input text is capable by use by the Interval Type of the equipment acquisition of the capable Interval Type of described discriminating input text.
With reference to certain embodiments the present invention has been described.It should be understood that the present invention is not limited to description above, those of ordinary skills can carry out various changes and modification to the present invention under the premise without departing from the spirit and scope of the present invention.

Claims (25)

1. an optical font recognition methods comprises the steps:
The speech of the capable image of input text is divided into the right partiting step of a plurality of speech, and wherein, each of described a plurality of speech centerings comprises two adjacent speech;
By using in two included adjacent speech of speech centering of English word font sorting technique identification the Character Font Recognition step of the font information of long long word;
Differentiate that based on the font information of the long word of the speech centering that comprises short speech and at the font information of the long word of another speech centering adjacent the font of the font information of described short speech differentiates step with described short speech, wherein, short speech is short in two included adjacent speech of speech centering
Wherein said font differentiates that step comprises:
Will be with short speech to be identified the font information of font information and the long word of the speech centering that comprises short speech to be identified of the long word of adjacent described another speech centering compare; With
If the font information of the long word of described another the speech centering adjacent with short speech to be identified is identical with the font information of the long word of the speech centering that comprises short speech to be identified, determine that then the short speech to be identified of this speech centering has and the long word of this speech centering and the identical font information of long word of described adjacent speech centering;
If the font information of the long word of described another the speech centering adjacent with short speech to be identified is different from the font information of the long word of the speech centering that comprises short speech to be identified, then by using English word font sorting technique to discern the font information of this weak point speech; And
If the short speech to be identified of this speech centering is first or last speech in the capable image of input text, then by using English word font sorting technique to discern the font information of this weak point speech.
2. optical font recognition methods according to claim 1 wherein also comprises the thin tuning step of regulating the font information of speech according to the font information of adjacent speech, wherein carries out described thin tuning step after described font is differentiated step.
3. optical font recognition methods according to claim 2, the thin tuning step of wherein regulating font information comprises:
Whether first candidate's font of determining each speech in the capable image of input text is available;
If determine that first candidate's font of this speech is unavailable then whether second candidate's font definite this speech is available;
If determine this speech second candidate's font can with second candidate's font of this speech and first candidate's font of this speech are exchanged.
4. optical font recognition methods according to claim 3, the thin tuning step of wherein regulating font information further comprises: if determine that second candidate's font of this speech is unavailable then whether the 3rd candidate's font definite this speech is available, and
If determine this speech the 3rd candidate's font can with the 3rd candidate's font of this speech and first candidate's font of this speech are exchanged.
5. optical font recognition methods according to claim 4 if determine that wherein the 3rd candidate's font of this speech is unavailable, judges then whether all three candidate's fonts of this speech all are reliable; And
, all three candidate's fonts of definite this speech set first candidate's font of this speech if all being insecure with the font of adjacent speech.
6. optical font recognition methods according to claim 5, wherein the thin tuning step comprises following preprocessing process: the font information and the dictionary that comprises the font information of a plurality of fonts that compare each speech in the capable image of input text; And obtain three candidate's fonts and three corresponding distance values of this speech with the order of similarity,
The all insecure condition of all three candidate's fonts of judging this speech comprises that all three distance values are all greater than predetermined threshold value.
7. optical font recognition methods according to claim 3, wherein the thin tuning step comprises following preprocessing process: the font information and the dictionary that comprises the font information of a plurality of fonts that compare each speech in the capable image of input text;
Obtain at least the first and second candidate's fonts and at least two corresponding distance values of this speech with the order of similarity; And
First candidate's font according to this speech is counted the distribution of detailed font in the capable image of input text.
8. optical font recognition methods according to claim 7, wherein at first or last speech in the capable image of input text, determine institute's predicate first candidate's font can with condition comprise:
Detailed font is not consistent in the capable image of input text, perhaps
First candidate's font of institute's predicate is consistent with the main font of the capable image of input text.
9. optical font recognition methods according to claim 7, wherein at first or last speech in the capable image of input text, determine institute's predicate second candidate's font can with condition comprise:
Second candidate's font of institute's predicate distance value consistent and corresponding with the main font of the capable image of input text is less than predetermined threshold value.
10. optical font recognition methods according to claim 7, wherein said candidate's font comprise first candidate's font, second candidate's font and the 3rd candidate's font,
At first or the last speech in the capable image of input text, determine institute's predicate the 3rd candidate's font can with condition comprise:
The 3rd candidate's font of institute's predicate distance value consistent and corresponding with the main font of the capable image of this input text is less than predetermined threshold value.
11. optical font recognition methods according to claim 7, wherein at each speech first and the last speech in the capable image of input text, determine institute's predicate first candidate's font can with condition comprise:
Detailed font is consistent in the capable image of input text, and first candidate's font of current speech is consistent with the main font of the capable image of input text, perhaps
First candidate's font of two adjacent speech in first candidate's font of current speech and the capable image of this input text is different, and perhaps the detailed font of two adjacent speech is identical.
12. optical font recognition methods according to claim 7, wherein at each speech first and the last speech in the capable image of input text, determine institute's predicate second candidate's font can with condition comprise:
Second candidate's font of institute's predicate distance value consistent and corresponding with two adjacent speech is less than predetermined threshold value.
13. optical font recognition methods according to claim 7, wherein said candidate's font comprise first candidate's font, second candidate's font and the 3rd candidate's font,
At each speech first and the last speech in the capable image of input text, determine institute's predicate the 3rd candidate's font can with condition comprise:
The 3rd candidate's font of institute's predicate distance value consistent and corresponding with the main font of two adjacent speech is less than predetermined threshold value.
14. optical font recognition methods according to claim 1 and 2, wherein also comprise coarse adjustment step according to the font information of the font information profile adjustment speech in the capable image of input text, wherein, after differentiating step, described font carries out described coarse adjustment step
The coarse adjustment step of the font information of the capable image of this adjusting input text comprises:
To the serif in the capable image of input text and in the spacing at least one and in detail the distribution of font count;
Determining described in serif and the spacing, whether at least one consistent and whether font is consistent in the capable image of input text in detail, and
Satisfy condition if determine above-mentioned distribution, the first candidate's font that then with main detailed font first candidate's font of all speech in the capable image of this input text is set and uses each speech is as the font result who is discerned.
15. optical font recognition methods according to claim 1 further comprises the identification step of discerning the font size of speech in the capable image of this input text, wherein, carries out described identification step after described font is differentiated step.
16. optical font recognition methods according to claim 15, the identification step of wherein discerning the font size of speech in the capable image of this input text comprises:
Highly whether the input speech X that judges this speech available;
If speech X is highly available in input, then institute's recognition font and the inquiry of input picture resolution with known words comprises the priori font size dictionary of " image resolution ratio/font/font size/X height " table, and the X that obtains different font sizes highly tabulates;
The X of this speech height is highly mated with X in X highly tabulates; And
With the font size of correspondence as the font size of being discerned.
17. optical font recognition methods according to claim 15, the identification step of wherein discerning the font size of speech in the capable image of this input text comprises:
If X is highly unavailable for the input speech, then institute's recognition font and the inquiry of input picture resolution with known words comprises the priori font size dictionary of " image resolution ratio/font/font size/speech height " table, and obtains the speech height tabulation of different font sizes;
The speech height and the speech height in the tabulation of speech height of this speech are mated; And
With the font size of correspondence as the font size of being discerned.
18. a character recognition method comprises:
The speech of input picture is normalized to the normalization step of predetermined altitude;
From normalized speech, extract the characteristic extraction step of feature;
The determining step of the Interval Type of grammatical term for the character;
Classification step, if can obtain the Interval Type of institute's predicate in determining step, then the feature with the candidate's font in the dictionary of the extraction feature of normalization speech and the speech Interval Type of being judged compares, and according to relatively obtaining distance value;
Examine the step of examining of the distance value that obtained with predetermined threshold value;
The disaggregated classification step is if if can not obtain the Interval Type of speech or be not less than predetermined threshold examining step middle distance value in determining step, then compare the extraction feature of normalization speech and the feature of the candidate's font in a plurality of dictionary; And
Identification step, if examining distance value described in the step less than predetermined threshold value, then discern the font information of speech based on the result of classification step, if and in determining step, can not obtain the Interval Type of speech or if distance value is not less than predetermined threshold described in the step examining, then discern the font information of speech based on the result in the disaggregated classification step.
19. character recognition method according to claim 18, wherein the determining step Interval Type known step whether that comprises grammatical term for the character from the outside.
20. character recognition method according to claim 18, wherein, in described classification step, described dictionary based on the Interval Type of being judged at least from the dictionary of X height speech, capitalize the dictionary of the dictionary of dictionary, ascender speech and descender speech of speech and full-height speech and select entirely.
21. character recognition method according to claim 18, wherein each dictionary of Interval Type has candidate's font of at least 40 types.
22. character recognition method according to claim 18, wherein the disaggregated classification step comprises:
The feature of the candidate's font in the extraction feature of comparison normalization speech and the dictionary of at least one other speech Interval Type except the speech Interval Type of being judged,
Respectively according to described at least one relatively obtain distance value, and
Examine this distance value with another predetermined threshold, so that discern the font information of speech based on the result who examines step.
23. character recognition method according to claim 22, the dictionary of at least one speech Interval Type that wherein will compare comprise the dictionary of the dictionary of following order: X height speech, capitalize the dictionary of dictionary, ascender speech and descender speech of speech and the dictionary of full-height speech entirely.
24. an optical font identification equipment comprises:
The speech of the capable image of input text is divided into the right classification apparatus of a plurality of speech, and wherein, each of described a plurality of speech centerings comprises two adjacent speech;
By using in two included adjacent speech of speech centering of English word font sorting technique identification the character recognition device of the font information of long long word;
Differentiate the font identification device of the font information of described short speech based on the font information of the long word of the font information of the long word of the speech centering that comprises short speech and another speech centering adjacent with described short speech, wherein, short speech is short in two included adjacent speech of speech centering
Wherein said font identification device comprises:
Will be with short speech to be identified the font information of font information and the long word of the speech centering that comprises short speech to be identified of the long word of adjacent described another speech centering compare; With
If the font information of the long word of described another the speech centering adjacent with short speech to be identified is identical with the font information of the long word of the speech centering that comprises short speech to be identified, determine that then the short speech to be identified of this speech centering has and the long word of this speech centering and the identical font information of long word of described adjacent speech centering;
If the font information of the long word of described another the speech centering adjacent with short speech to be identified is different from the font information of the long word of the speech centering that comprises short speech to be identified, then by using English word font sorting technique to discern the font information of this weak point speech; And
If the short speech to be identified of this speech centering is first or last speech in the capable image of input text, then by using English word font sorting technique to discern the font information of this weak point speech.
25. a Character Font Recognition equipment comprises:
The speech of input picture is normalized to the normalization device of predetermined altitude;
From normalized speech, extract the feature deriving means of feature;
The judgment means of the Interval Type of grammatical term for the character;
Sorter, if judgment means can obtain the Interval Type of institute's predicate, then the feature with the candidate's font in the dictionary of the extraction feature of normalization speech and the speech Interval Type of being judged compares, and according to relatively obtaining distance value;
Examine the validation apparatus of the distance value that is obtained with predetermined threshold value;
The disaggregated classification device is if if judgment means can not obtain the Interval Type of speech or validation apparatus to be examined distance value and be not less than predetermined threshold, then compare the extraction feature of normalization speech and the feature of the candidate's font in a plurality of dictionary; And
Recognition device, if validation apparatus is examined described distance value less than predetermined threshold value, then discern the font information of speech based on the result of sorter, if and if judgment means can not obtain the Interval Type of speech or validation apparatus and examine described distance value and be not less than predetermined threshold, then discern the font information of speech based on the result in the disaggregated classification device.
CNB2005100228818A 2005-12-09 2005-12-09 Optical character recognition method and equipment and character recognition method and equipment Expired - Fee Related CN100550040C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2005100228818A CN100550040C (en) 2005-12-09 2005-12-09 Optical character recognition method and equipment and character recognition method and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2005100228818A CN100550040C (en) 2005-12-09 2005-12-09 Optical character recognition method and equipment and character recognition method and equipment

Publications (2)

Publication Number Publication Date
CN1979529A CN1979529A (en) 2007-06-13
CN100550040C true CN100550040C (en) 2009-10-14

Family

ID=38130684

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2005100228818A Expired - Fee Related CN100550040C (en) 2005-12-09 2005-12-09 Optical character recognition method and equipment and character recognition method and equipment

Country Status (1)

Country Link
CN (1) CN100550040C (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100535930C (en) * 2007-10-23 2009-09-02 北京大学 Complex structure file image inclination quick detection method
US8401293B2 (en) * 2010-05-03 2013-03-19 Microsoft Corporation Word recognition of text undergoing an OCR process
CN107305446B (en) * 2016-04-25 2020-08-14 北京字节跳动网络技术有限公司 Method and device for acquiring keywords in pressure sensing area
CN109447055B (en) * 2018-10-17 2022-05-03 中电万维信息技术有限责任公司 OCR (optical character recognition) -based character similarity recognition method
US10984279B2 (en) 2019-06-13 2021-04-20 Wipro Limited System and method for machine translation of text

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE19953610A1 (en) * 1999-02-26 2000-09-07 Hewlett Packard Co Font detection device for optical character recognition system selects system font from table whose width best corresponds to width of font in image
US6272238B1 (en) * 1992-12-28 2001-08-07 Canon Kabushiki Kaisha Character recognizing method and apparatus
CN1460244A (en) * 2001-02-01 2003-12-03 松下电器产业株式会社 Sentense recognition device, sentense recognition method, program and medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6272238B1 (en) * 1992-12-28 2001-08-07 Canon Kabushiki Kaisha Character recognizing method and apparatus
DE19953610A1 (en) * 1999-02-26 2000-09-07 Hewlett Packard Co Font detection device for optical character recognition system selects system font from table whose width best corresponds to width of font in image
CN1460244A (en) * 2001-02-01 2003-12-03 松下电器产业株式会社 Sentense recognition device, sentense recognition method, program and medium

Also Published As

Publication number Publication date
CN1979529A (en) 2007-06-13

Similar Documents

Publication Publication Date Title
US6252988B1 (en) Method and apparatus for character recognition using stop words
US8045798B2 (en) Features generation and spotting methods and systems using same
Aradhye A generic method for determining up/down orientation of text in roman and non-roman scripts
US8494273B2 (en) Adaptive optical character recognition on a document with distorted characters
EP1016033B1 (en) Automatic language identification system for multilingual optical character recognition
JP4477468B2 (en) Device part image retrieval device for assembly drawings
Zahedi et al. Farsi/Arabic optical font recognition using SIFT features
CN104966051A (en) Method of recognizing layout of document image
CN100550040C (en) Optical character recognition method and equipment and character recognition method and equipment
Belaïd et al. Handwritten and printed text separation in real document
Mammeri et al. Road-sign text recognition architecture for intelligent transportation systems
US8340428B2 (en) Unsupervised writer style adaptation for handwritten word spotting
Kumar et al. Line based robust script identification for indianlanguages
Ho Fast identification of stop words for font learning and keyword spotting
CN108596182B (en) Manchu component cutting method
Deng et al. A method for detecting document orientation by using Naïve Bayes classifier
CN108564078B (en) Method for extracting axle wire of Manchu word image
Liu et al. An improved algorithm for Identifying Mathematical formulas in the images of PDF documents
Chen et al. Detection and location of multicharacter sequences in lines of imaged text
CN108564089B (en) Manchu component set construction method
CN108596183B (en) Over-segmentation region merging method for Manchu component segmentation
CN108549896B (en) Method for deleting redundant candidate segmentation lines in Manchu component segmentation
Pourasad et al. Farsi font recognition based on spatial matching
JP2576350B2 (en) String extraction device
Egozi et al. An EM based algorithm for skew detection

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20091014

Termination date: 20161209

CF01 Termination of patent right due to non-payment of annual fee