CN109409363A

CN109409363A - The reverse judgement of text image based on content and bearing calibration

Info

Publication number: CN109409363A
Application number: CN201811192521.6A
Authority: CN
Inventors: 林嘉宇; 刘荧
Original assignee: Changsha Xinxi Electronic Technology Co Ltd
Current assignee: Changsha Xinxi Electronic Technology Co Ltd
Priority date: 2018-10-13
Filing date: 2018-10-13
Publication date: 2019-03-01
Anticipated expiration: 2038-10-13
Also published as: CN109409363B

Abstract

The present invention provides a kind of reverse judgement of the text image based on content and bearing calibration.Technical solution includes the following steps: S1, carries out row cutting to text image；S2 rejects non-legible row, retains literal line；S3 takes any literal line L, so that the length of literal line L meets the range of setting；S4 carries out optical character identification to literal line L, calculates the cost function value of literal line L；S5, reverse literal line L obtain literal line L¹；S6, to literal line L¹Optical character identification is carried out, literal line L is calculated¹Cost function value；S7, according to the cost function value of literal line L and literal line L¹Cost function value, judge whether literal line L overturns；If literal line L is reverse, reverse correction is carried out.The main purposes of the present invention first is that text image that may be present overturns problem after processing text image slant correction, there is high accuracy, be easy to the features such as practical.

Description

The reverse judgement of text image based on content and bearing calibration

Technical field

The present invention relates to text extracting technical field, more particularly, to a kind of judging automatically of overturning of text image and Bearing calibration.

Background technique

Image is one of information age most important information carrier.In image processing techniques, text image pre-processes and divides Analysis has application to the industries such as education electronics, secretarial's office, there is important meaning and effect.

In text image pretreatment, the text image of intake can be under normal circumstances inclined；But in subsequent analysis In, the OCR (Optical Character Recognition, optical character identification) of Text region is such as carried out to text image In the process, character to be identified is usually but required to be specification, rectify.Therefore, the slant correction of text image, be OCR it The pretreated important content of preceding text image is printed page analyses and the final OCR such as row cutting, the character locating of text image The important leverage of accuracy.

Currently, when calculating text image true slope angle, if the true slope angle of the text image of intake compared with Greatly, calculated text image tilt angle may be the supplementary angle of text image true slope angle.Using this result to text This image carries out slant correction, and obtained image will be the reverse figure of real text image.

Therefore, before carrying out printed page analysis and OCR using the text image by slant correction, it is necessary to advanced style of writing The reverse judgement and correction of this image.In consideration of it, the present invention is specifically proposed.

Summary of the invention

It is an object of the present invention to propose a kind of text based on content for there may be the text image that text overturns The judgement of this image inversion and bearing calibration.This method makes full use of the grammatical characteristic such as word, word, phrase of text, judges text Whether image overturns, and realizes the correction to reverse text image.

In order to achieve the object, the technical solution of the invention is as follows: a kind of reverse judgement of the text image based on content And bearing calibration, if text image has been subjected to slant correction, which comprises the following steps:

S1: row cutting is carried out to text image；

S2: rejecting non-legible row, retains literal line；

S3: taking any literal line L, so that the length of literal line L meets the range of setting；

S4: OCR is carried out to literal line L, calculates the cost function value of literal line L；

S5: reverse literal line L obtains literal line L¹；

S6: to literal line L¹OCR is carried out, literal line L is calculated¹Cost function value；

S7: according to the cost function value of literal line L and literal line L¹Cost function value, judge whether literal line L overturns； If literal line L is reverse, reverse correction is carried out.

Compared with prior art, the beneficial effects of the present invention are:

(1) according to the retrieval to existing literature, it is not yet found that closing the money for carrying out reverse judgement and correction to text image Material, but but through being commonly encountered this problem in actually using.Therefore, the present invention handle text image slant correction after there may be Text image overturn problem, have innovation and practicality.

(2) present invention takes full advantage of the grammatical characteristic of the different language obtained after OCR.If text diagram is positive , the literal line handled is positive, the word or character obtained after OCR, can correctly, show as normal The probability occurred in rule application is high.It is corresponding, literal line obtained in figure is overturned from it, is knot that is reverse, being obtained after OCR Fruit substantially can be out-of-order word or incorrect character.Attached drawing 2 to attached drawing 13 gives several examples, is all over Britain respectively Positive literal line, reverse literal line under the different situations such as Chinese character, full Chinese character, the English mixed characters of Chinese and they are right The ocr result answered.At this point, the cost function value of the positive literal line in positive figure, can be substantially distinguished from reverse in reverse figure The cost function value of literal line simply can obtain correctly positive figure by contrast judgement.

(3) contrast judgement forward direction figure of the present invention and reverse figure, rather than one width figure of independent judgment, avoid using absolute threshold Value, thus have the characteristics that it is very steady.Chinese character for example, a certain width forward direction figure, in its processed literal line that may happen to Or English word, the probability occurred in conventional document is all not too much high, if judging whether to overturn based on some absolute threshold, It is likely to occur erroneous judgement.But schemed using forward direction and overturn both figures contrast judgement, then the cost function value of positive figure, still has Very big probability is better than the cost function value of reverse figure, so that correct judging result still can be obtained.

(4) the present invention is based on the average row height of literal line in text diagram (the i.e. subsequent parameter mean_text_ referred to Lines_height), the length for determining the literal line handled, accuracy and the engineering of calculating time for having taken into account judgement are excellent Change.On the one hand, longer literal line can more make final result tend to regular situation, thus right wherein the text contained is more Probability processing result is advantageous, can reduce erroneous judgement.But too long of literal line, the time-consuming for handling needs is longer, to engineer application It is unfavorable.Therefore, it is necessary to take into account the accuracy of processing result and handling duration.Method of the present invention using default text number, association It adjusts both above-mentioned.Under the premise of handling duration is acceptable, longer literal line is taken as far as possible.And whether the literal line intercepted is sufficient Enough long, this method is measurement of being come in using average row height, avoids being suitable for different text diagrams using absolute threshold.

Detailed description of the invention

Fig. 1 is the judgement and bearing calibration flow chart that text image of the present invention overturns；

Fig. 2 is the Chinese character literal line sample graph all over Britain to be processed that the present invention extracts in a positive figure；

Fig. 3 is that the present invention carries out the text sample graph obtained after OCR processing to the literal line of Fig. 2；

Fig. 4 is the literal line sample graph to be processed that the present invention extracts the reverse figure of Fig. 2；

Fig. 5 is that the present invention carries out the text sample graph obtained after OCR processing to the literal line of Fig. 4；

Fig. 6 is the full Chinese character literal line sample graph to be processed that the present invention extracts in a positive figure；

Fig. 7 is that the present invention carries out the text sample graph obtained after OCR processing to the literal line of Fig. 6；

Fig. 8 is the literal line sample graph to be processed that the present invention extracts the reverse figure of Fig. 6；

Fig. 9 is that the present invention carries out the text sample graph obtained after OCR processing to the literal line of Fig. 8；

Figure 10 is the Chinese mixed characters literal line sample graph of English to be processed that the present invention extracts in a positive figure；

Figure 11 is that the present invention carries out the text sample graph obtained after OCR processing to the literal line of Figure 10；

Figure 12 is the literal line sample graph to be processed that the present invention extracts the reverse figure of Figure 10；

Figure 13 is that the present invention carries out the text sample graph obtained after OCR processing to the literal line of Figure 12；

Figure 14 is the figure after one slant correction pretreatment of the present invention；

Figure 15 is after the present invention connects the left and right of Figure 14 progress binaryzation, small scale or the upper and lower foreground point distance of swimming Figure；

Figure 16 is the present invention to Figure 15, counts the foreground point quantity in the every row of horizontal direction, obtained prospect projection value；

Figure 17 is the statistical probability of highest preceding 128 Chinese characters of probability of occurrence in the students' work exam pool of the invention counted；

Figure 18 is highest preceding 128 Chinese characters of probability of occurrence in the students' work exam pool of the invention counted.

Specific embodiment

Below in conjunction with Fig. 1 in the embodiment of the present invention, technical solution in the embodiment of the present invention carries out clear, complete Ground description.Obviously, described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Based on this Embodiment in invention, every other reality obtained by those of ordinary skill in the art without making creative efforts Example is applied, shall fall within the protection scope of the present invention.The embodiment of the present invention can according to be defined and covered by claim it is a variety of not It is embodied with mode.

It should be noted that in the following description, understanding for convenience, giving many details.It is apparent that Realization of the invention can be without these details.

It should be noted that in the absence of clear limitations or conflicts, each embodiment in the present invention and its In technical characteristic can be combined with each other and form technical solution.

It should be noted that the present invention is discussed as an example with English and Chinese, but present invention could apply to it In the word processing of country variants such as his phonetic alphabet text or pictograph.

Firstly, it is necessary to explanation, the text image that the present invention is handled is handled by slant correction；Or text This image itself be it is proper, do not need carry out slant correction, directly handled using the present invention.

S1: row cutting is carried out to text image；

Carrying out row cutting to text image is the prior art, is realized using existing method, and present invention offer one is excellent The technical solution of choosing:

Binaryzation is first carried out to text image, for convenience, if background pixel point value is 0, foreground pixel point is taken Value is 1；The distance of swimming for carrying out foreground pixel point to binary image again connects, i.e., adjacent or distance is less than to the prospect of certain threshold value Existing background pixel point is revised as foreground pixel point between pixel, forms binaryzation distance of swimming connection figure.As shown, Figure 14 It is connected by binary conversion treatment with the distance of swimming, forms binaryzation distance of swimming connection figure as shown in figure 15.

To binaryzation distance of swimming connection figure, the foreground pixel point quantity of every row in its horizontal direction is counted, it is as shown in figure 16, horizontal The value of coordinate representation foreground pixel point quantity, ordinate indicate the line label of binaryzation distance of swimming connection figure.To any a line, if The foreground pixel point quantity of the row is less than the projection threshold value of setting, then it is assumed that the row is blank pixel row, is otherwise non-blank-white picture Plain row；Projection threshold value is set according to actual conditions.According to the above method, the corresponding text image point of binaryzation distance of swimming connection figure It is segmented into blank pixel row part and non-blank-white pixel column part, the line number of adjacent non-blank-white pixel column is calculated, obtains non-blank-white Capable height.At this point, a non-blank-white row may correspond to a character row, a formula row or a figure row.

S2: rejecting non-legible row, retains literal line；

It is high to obtain average literal line for the average value for calculating the height text_lines_height (i) of all non-blank-white rows Mean_text_lines_height is spent, wherein i is the serial number of non-blank line.

It uses mean_text_lines_height as threshold value, checks the height text_lines_ of each non-blank-white row height(i).There are three types of situations:

Situation 1: if text_lines_height (i) > > mean_text_lines_height, then it is assumed that the row right and wrong Literal line, the row may be figure, table, formula etc.；

Situation 2: if text_lines_height (i) < < mean_text_lines_height, think that the row is at this time Non-legible row, the row may be horizontal line etc.；

Situation 3: if being not belonging to situation 1 and situation 2, then it is assumed that the row is literal line.

Above-mentioned situation 3 is taken, i.e., only takes literal line, subsequent processes are carried out, without the use of the non-text such as figure, table, horizontal line The part of word row.It should be noted that " being far longer than " operation and situation 2 that how estimate of situation 1 refers to refer to it is " much small In " operation, it is common knowledge to one skilled in the art, can carries out according to the actual situation customized.

For example, working as text_lines_height (i) >=3*mean_text_lines_height, it can think text_ lines_height(i)>>mean_text_lines_height。

S3 takes any literal line L, so that the length of literal line L meets the range of setting；

A literal line is chosen, if the length for the literal line chosen is greater than the N1 of mean_text_lines_height Times, and less than N2 times of mean_text_lines_height, then enabling the row is literal line L, carries out subsequent processing, natural The value of number N1 and N2 rule of thumb obtains, and such as takes N1 ∈ [4,8], N2 ∈ [9,16].If the length for the literal line chosen is small It in N1, is then supplemented, until meeting the requirements.If the length for the literal line chosen is greater than N2, intercepted, until meeting It is required that.For example, attached drawing 14 of the invention, when handling literal line " realities of 10. ' following figure is to probe into plane mirror imaging characteristic ' ", this article Word row is too long, intercepts as " following figure is to probe into plane mirror imaging spy " as shown in Figure 4.

The purpose of above-mentioned way is so that the text in literal line is neither very little, in order to avoid subsequent probabilistic contrastive test As a result unreliable；Make the text in literal line not many again, in case engineering time-consuming is larger.Our experience, optimal is taken The length of literal line is 10 times or so of literal line height (i.e. mean_text_lines_height), that is, takes ten Chinese characters, two The width of ten English alphabets or so calculates the cost function value of positive literal line and reverse literal line at this time, has just had Stronger otherness.And at this point, engineering time-consuming can generally also receive, it can't be too big.In addition, conventional it is considered that English Literary word averagely contains 4 letters, considers further that there are blank spaces between English word, 20 English alphabets, it is generally recognized that can To form 3,4 English words, there is preferable reliability and validity in contrastive test.

OCR is carried out to literal line L.OCR such as carried out to attached drawing 6 of the invention, the result of OCR as shown in fig. 7, this figure knowledge Not the result is that all correctly.In addition, the present invention does not use punctuation mark in subsequent processing, therefore ignore to punctuation mark Recognition result.

The cost function value for calculating literal line L has many schemes optional.An embodiment provided by the invention is: foundation The result of OCR identification.The cost function value for calculating each character or character combination (word), by all characters or character combination Cost function value be added to obtain the cost function value of literal line L.Wherein the cost function value of character or character combination and they The probability used is at positive relationship, and higher using probability, cost function value is bigger；The cost function value of character combination further may be used With the length with character combination at positive relationship, length is longer, then its cost function value is bigger.For example, in Chinese, according to people The cost function value of commonly used word is set as 5, be of little use word to the frequency of use statistical conditions of the commonly used word of related character library by people's daily paper Cost function value be set as 2, the cost function value for the word being not present in character library is set as 0.In English, there is word in day in foundation The probability often occurred in life sets cost function value, and word present in dictionary, probability of occurrence is high, is set as 5, is not present Word is set as 0.And in English, and increase its cost function value according to word length, for example, word " absolutely " length It is 10, its cost function value can be increased 10, and word " he " length is only 2, increased cost function value is 2.According to upper Content is stated it is found that under normal circumstances, the cost function value of positive literal line can all be greater than the cost function of reverse literal line Value.

Specifically, the present invention provides a kind of method of preferred cost function value for calculating literal line L, as described below.

Each character identified to OCR following judge: judge that character is suitble to handle by granularity of word Or using character as granularity processing.Such as English character is suitble to word (rather than alphabetical) for granularity processing, middle text Symbol is suitble to Chinese character (rather than phrase) for granularity processing.For above-mentioned judging result, literal line L is divided into and is suitble to character The part handled for granularity or the part handled by granularity of word.Literal line L can specifically be divided into several parts, by The content of literal line L determines.

The method for carrying out granularity judgement is distinguished and is known preferably by the code word region of the coding codeword distribution of OCR output Other character, in this way English or Chinese, then the granularity of determining character processing.This is because the constraint relationship of English word is very By force.Obtain correct English word in normal forward direction text image, and in the reverse figure of mistake, usually will recognise that it is baffled its Wonderful English alphabet sum number combinatorics on words.For example, Fig. 3 is the recognition result for carrying out OCR to Fig. 2, correct English word is obtained； Fig. 4 is the text image obtained after overturning to Fig. 2, and Fig. 5 is the recognition result that OCR is carried out to Fig. 4, is identified entirely disorderly Number and English alphabetic combination.Therefore, it finds, judge whether there is correct English word in the result of OCR, can strongly support The differentiation of forward direction figure or reverse figure.The processing of word granularity can be used in English text.

And to Chinese, it is substantially based on Chines words processing, it is not necessary to based on Chinese phrase processing.This is because mistake The Chinese character that OCR comes out in reverse figure is as a result, very maximum probability will appear the Chinese character that is of little use.For example, Fig. 7 is the knowledge for carrying out OCR to Fig. 6 Not as a result, the result identified belongs to more common Chinese character；Fig. 8 is the text image obtained after overturning to Fig. 6, figure 9 be the recognition result that OCR is carried out to Fig. 8, includes multiple Chinese characters being of little use in recognition result.Therefore, Chinese is based at character Reason effect, that is, guaranteed.Certainly, in Chinese situation, if being based further on Chinese word processing, the accuracy of differentiation can be into one Step is promoted.

The cost function value that literal line L each section is calculated based on above-mentioned judgement, by the cost function value of each section Addition obtains the cost function value of literal line L.

To literal line L any portion, if being granularity processing with word (rather than alphabetical), whether several characters are checked It is combined into legitimate words or fragments of words；The cost function value for calculating each legitimate words or fragments of words, by all generations Valence functional value is added to obtain the cost function value of the part.

The method that string matching can be used finds legitimate words or fragments of words.Common word can be exported from dictionary, Form common word library.By taking English as an example, according to the practical engineering experience of inventor, common word library is single using 10,000 English Word, the success rate of differentiation is very high, and common word can search out.It, can if OCR is higher to the recognition correct rate of letter Using the accurate matched hash value method of character string, to save matched calculation amount.Particularly, legitimate words, which refer to, combines into syllables correctly Word, if " other " in English is legitimate words；Fragments of words refers to a part in legitimate words, and such as " er " is A part of the legitimate words such as " other ", " teacher ".

Calculate the cost function value of legitimate words or fragments of words, can be used word probability of occurrence sub- cost function value and The sum of sub- cost function value of word length.Word probability of occurrence is higher, then the sub- cost function value of probability of occurrence is bigger；Word Length is longer, then the sub- cost function value of length is bigger.The sub- cost function value of word probability of occurrence is divided into 5 grades by the present invention, point It is not assigned a value of 1 to 5 point, the high sub- cost function value value of probability of occurrence is bigger.Word length contribution, then directly made with its length value For the sub- cost function value of length.

To literal line L any portion, if checking each character that OCR is identified using character as granularity processing, All cost function values are added to obtain the cost function value of the part by the cost function value for calculating each character.Character The probability occurred in routine use is bigger, and cost function value is bigger.

By taking Chinese as an example, inventor has counted pupil, the Chinese characters in common use in middle school student's operation exam pool, altogether 8149 Chinese Word, wherein the statistical probability of maximum preceding 128 Chinese characters of probability of occurrence, as shown in figure 17, abscissa indicate Chinese character serial number, often The corresponding Chinese character of a serial number, indicates 128 Chinese characters altogether；Ordinate indicates the probability that Chinese character occurs.128 Chinese characters are listed in Figure 18 In.According to the regularity of distribution of the probability of occurrence of the 8149 of statistics Chinese characters, Chinese character can be divided into first 10, it is 50 first, preceding 100 A, preceding 300 and 1000 five grades first, cost function value difference assignment 5,4,3,2,1.Probability of occurrence is except first 1000 Chinese character, be assigned a value of 0.By it, the cost function value of the Chinese character of available OCR result, and the Chinese character that probability of occurrence is higher, Cost function value value is bigger.

Obviously, a possibility that cost function value of literal line L is bigger, and text image is forward direction is higher.

S5: reverse literal line L obtains literal line L¹。

Reverse literal line is i.e. to the corresponding image rotation 180 degree of literal line.Literal line as shown in Figure 10 is obtained by overturning To literal line shown in Figure 12.

S6: to literal line L¹OCR is carried out, literal line L is calculated¹Cost function value.

This step is no longer described in detail referring to S4.According to the content of S4 it is found that usually literal line L¹Cost function value very It is small.

If the cost function value of literal line L is greater than literal line L¹Cost function value, then literal line L is positive, no It needs to overturn；Otherwise it is assumed that literal line L be it is reverse, then carry out reverse correction.

Claims

1. a kind of reverse judgement of text image based on content and bearing calibration, special if text image has been subjected to slant correction Sign is, comprising the following steps:

S1: row cutting is carried out to text image；

S2: rejecting non-legible row, retains literal line；

S4: OCR is carried out to literal line L, calculates the cost function value of literal line L, OCR refers to optical character identification；

S5: reverse literal line L obtains literal line L¹；

S7: according to the cost function value of literal line L and literal line L¹Cost function value, judge whether literal line L overturns；If Literal line L is reverse, then carries out reverse correction；

The cost function value of literal line described above, has the characteristic that: for same literal line, positive literal line cost letter Numerical value is markedly different from reverse literal line cost function value.

2. the reverse judgement of the text image according to claim 1 based on content and bearing calibration, which is characterized in that text This image carries out row cutting, using following step:

Binaryzation is first carried out to text image；The distance of swimming for carrying out foreground pixel point to binary image again connects, and forms binaryzation Distance of swimming connection figure；

To binaryzation distance of swimming connection figure, the foreground pixel point quantity of every row in its horizontal direction is counted；To any a line, if should Capable foreground pixel point quantity is less than the projection threshold value of setting, then it is assumed that the row is blank pixel row, is otherwise non-blank pixel Row；Projection threshold value is set according to actual conditions；

The line number for calculating adjacent non-blank-white pixel column, obtains the height of non-blank-white row.

3. the reverse judgement of the text image according to claim 2 based on content and bearing calibration, which is characterized in that reject Non-legible row, the process for retaining literal line are as follows:

The average value for calculating the height text_lines_height (i) of all non-blank-white rows obtains average literal line height Mean_text_lines_height, wherein i is the serial number of non-blank line；

It uses mean_text_lines_height as threshold value, checks the height text_lines_height of each non-blank-white row (i), it is handled according still further to following three kinds of situations:

Situation 1: if text_lines_height (i) > > mean_text_lines_height, then it is assumed that the row is non-legible Row；

Situation 2: if text_lines_height (i) < < mean_text_lines_height, think that the row is non-text at this time Word row；

4. the reverse judgement of the text image according to claim 3 based on content and bearing calibration, which is characterized in that choose When any literal line L, if the length for the literal line chosen is greater than N1 times of mean_text_lines_height, and it is less than N2 times of mean_text_lines_height, then enabling the row is literal line L, and the value of natural number N1 and N2 rule of thumb obtain ?.

5. the reverse judgement of the text image according to claim 4 based on content and bearing calibration, which is characterized in that calculate The cost function value of literal line L, using following step:

Each character identified to OCR following judge: judge that character is suitble to handle still by granularity of word Using character as granularity processing；For above-mentioned judging result, literal line L is divided into the portion for being suitble to handle as granularity using character The part divided or handled using word as granularity；

Based on above-mentioned judgement, the cost function value of literal line L each section is calculated, the cost function value of each section is added Obtain the cost function value of literal line L；

Wherein, to literal line L any portion, if checking whether several characters are combined into using word as granularity processing Legitimate words or fragments of words；The cost function value for calculating each legitimate words or fragments of words, by all legitimate words Or the cost function value of fragments of words is added, and obtains the cost function value of the part；Calculate the generation of legitimate words or fragments of words Valence functional value uses the sum of the sub- cost function value of word probability of occurrence and the sub- cost function value of word length；Word probability of occurrence Higher, then the sub- cost function value of probability of occurrence is bigger；Word length is longer, then the sub- cost function value of length is bigger；

Again: to literal line L any portion, if checking each character that OCR is identified using character as granularity processing, The cost function value for calculating each character is added all character cost function values to obtain the cost function value of the part； The probability that character occurs in routine use is bigger, and cost function value is bigger.

6. the reverse judgement of the text image according to claim 5 based on content and bearing calibration, which is characterized in that use The method of string matching finds legitimate words or fragments of words.

7. the reverse judgement of the text image according to claim 5 based on content and bearing calibration, which is characterized in that carry out Granularity judgement method, preferably by OCR output coding codeword distribution code word region, the character of Division identification, then Determine the granularity of character processing.