CN103679165A

CN103679165A - OCR (optical character recognition) character recognition method and system

Info

Publication number: CN103679165A
Application number: CN201310752624.4A
Authority: CN
Inventors: 王海峰; 和为
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2013-12-31
Filing date: 2013-12-31
Publication date: 2014-03-26
Anticipated expiration: 2033-12-31
Also published as: CN103679165B

Abstract

The invention provides an OCR (optical character recognition) character recognition method. The method comprises the following steps of executing the OCR character recognition for an image in a target area selected by a user so as to obtain a recognized word string; calculating the quantity of sub-word strings in the recognized word string; judging whether the number of characters in a first sub-word string W1 and the number of characters in a kth sub-word string WK are smaller than a preset value or not when the quantity of the sub-word strings in the word string is more than 2; judging the noise probability score of W1 and/or the noise probability score of the WK is greater than a preset noise or not if the number of the characters in the W1 and/or the number of characters in the WK is smaller than the preset value; determining the W1 and/or WK is noise if the noise probability score of W1 and/or the noise probability score of WK is greater than the preset noise, and deleting W1 and/or WK from the word string so as to obtain a novel word string. According to the embodiment, the OCR translation accuracy for the OCR recognition result can be enhanced. The invention also provides an OCR character recognition system.

Description

OCR character identifying method and system

Technical field

The present invention relates to character recognition technologies field, particularly a kind of OCR character identifying method and system.

Background technology

At present a lot of translation APP products are all supported the interpretative function of taking pictures, its operation steps for example: user holds mobile terminal (as smart mobile phone) and takes pictures facing to the foreign language that will translate, and the photo of take is coverd with one deck gray scale; User slides coveing with on the photo of gray scale with finger, the word " wiping " of wanting to translate out; OCR identification is carried out in the region that user is clashed, and obtains foreign language text; Call mechanical translation module, OCR result is translated, finally represent to user.

Whole operating process as shown in Figure 1.But in said process, have a problem, user in " wiping " word, because finger has blocked screen, often can left and right or neighbouring word also together " wiping " in OCR scope.As above shown in figure, this expects this word of translation Obama user, but in practical operation left and right each marked several letters more, cause the result of OCR to be " it Obama I ", through mechanical translation, the final translation result obtaining is " Obama, I ".Such translation result can cause puzzlement to user, affects user and experiences.

Summary of the invention

Object of the present invention is intended at least solve one of described technological deficiency.

For this reason, one object of the present invention is to propose a kind of OCR character identifying method.The method can promote the accuracy to the OCR translation of the result of OCR identification.

Another object of the present invention is to propose a kind of OCR character recognition system.

For achieving the above object, the embodiment of first aspect present invention discloses a kind of OCR character identifying method, comprise the following steps: the image in the target area that user is selected carries out OCR character recognition to obtain the word string of identification, wherein, institute's predicate string comprises K sub-word string, every sub-word string at least comprises 1 character, and described K is positive integer; Calculate the quantity of the word string neutron word string of described identification; If the quantity of institute's predicate string neutron word string is greater than 2, judge described the 1st sub-word string W ₁the number of middle character and described K sub-word string W _kwhether the number of middle character is less than preset value; If described W ₁the number of middle character and/or W _kthe number of middle character is less than described preset value, judges described W ₁noise probability score and/or W _knoise probability score whether be greater than default noise; If so, judge described W ₁and/or described W _kfor noise and from institute's predicate string, delete described W ₁and/or described W _kto obtain new word string.

According to the OCR character identifying method of the embodiment of the present invention, for the result of OCR identification in OCR translation, carry out noise reduction process, thus, can identify and delete the OCR noise conventionally bringing due to user misoperation.Like this, after denoising, can promote and purify translation result, make translation result more accurate, improve user and experience.

In addition, OCR character identifying method according to the above embodiment of the present invention can also have following additional technical characterictic:

In some instances, also comprise: if the quantity of institute's predicate string neutron word string equals 2, judge described W ₁whether the number of middle character is less than described W _kthe number of middle character; If described W ₁the number of middle character is less than described W _kthe number of middle character, further judges described W ₁whether the number of middle character is less than preset value; If described W ₁the number of middle character is less than described preset value, further judges described W ₁noise probability score whether be greater than default noise; If so, judge described W ₁for noise and from institute's predicate string, delete described W ₁to obtain new word string.

In some instances, also comprise: if described W ₁the number of middle character is greater than described W _kthe number of middle character, further judges described W _kwhether the number of middle character is less than preset value; If described W _kthe number of middle character is less than described preset value, further judges described W _knoise probability score whether be greater than default noise; If so, judge described W _kfor noise and from institute's predicate string, delete described W _kto obtain new word string.

In some instances, described noise obtains by following formula:

P _left＝αlogp(W ₁)+βlogp(W ₂|W ₁)，

P _right＝αlogp(W _k)+βlogp(W _k|W _k-1)。

In some instances, also comprise: described new word string is carried out to OCR translation.

The embodiment of second aspect present invention provides a kind of OCR character recognition system, comprise: identification module, for the image in the target area that user is selected, carry out OCR character recognition to obtain the word string of identification, wherein, institute's predicate string comprises K sub-word string, every sub-word string at least comprises 1 character, and described K is positive integer; Computing module, for calculating the quantity of the word string neutron word string of described identification; Denoising module, is greater than 2 for the quantity at institute's predicate string neutron word string, judges described the 1st sub-word string W ₁the number of middle character and described K sub-word string W _kwhether the number of middle character is less than preset value, if while being less than described preset value, judges described W ₁noise probability score and/or described W _knoise probability score whether be greater than default noise, if be greater than described default noise, judge described W ₁and/or described W _kfor noise and from institute's predicate string, delete described W ₁and/or described W _kto obtain new word string.

According to the OCR character recognition system of the embodiment of the present invention, for the result of OCR identification in OCR translation, carry out noise reduction process, thus, can identify and delete the OCR noise conventionally bringing due to user misoperation.Like this, after denoising, can promote and purify translation result, make translation result more accurate, improve user and experience.

In some instances, described denoising module also for: if the quantity of institute's predicate string neutron word string equals 2, judge described W ₁whether the number of middle character is less than described W _kthe number of middle character; If described W ₁the number of middle character is less than described W _kthe number of middle character, further judges described W ₁whether the number of middle character is less than preset value; If described W ₁the number of middle character is less than described preset value, further judges described W ₁noise probability score whether be greater than default noise; If so, judge described W ₁for noise and from institute's predicate string, delete described W ₁to obtain new word string.

In some instances, described denoising module also for: if described W ₁the number of middle character is greater than described W _kthe number of middle character, further judges described W _kwhether the number of middle character is less than preset value; If described W _kthe number of middle character is less than described preset value, further judges described W _knoise probability score whether be greater than default noise; If so, judge described W _kfor noise and from institute's predicate string, delete described W _kto obtain new word string.

In some instances, described noise obtains by following formula:

P _left＝αlogp(W ₁)+βlogp(W ₂|W ₁)，

P _right＝αlogp(W _k)+βlogp(W _k|W _k-1)。

In some instances, also comprise: translation module, for described new word string is carried out to OCR translation.

The aspect that the present invention is additional and advantage in the following description part provide, and part will become obviously from the following description, or recognize by practice of the present invention.

Accompanying drawing explanation

Of the present invention and/or additional aspect and advantage will become from the following description of the accompanying drawings of embodiments and obviously and easily understand, wherein:

Fig. 1 is a kind of interface schematic diagram of OCR identification translation;

Fig. 2 is the process flow diagram of OCR character identifying method according to an embodiment of the invention;

Fig. 3 is the process flow diagram of OCR character identifying method in accordance with another embodiment of the present invention; And

Fig. 4 is the structural drawing of OCR character recognition system according to an embodiment of the invention.

Embodiment

Describe embodiments of the invention below in detail, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has the element of identical or similar functions from start to finish.Below by the embodiment being described with reference to the drawings, be exemplary, only for explaining the present invention, and can not be interpreted as limitation of the present invention.

In description of the invention, it will be appreciated that, term " longitudinally ", " laterally ", " on ", orientation or the position relationship of the indication such as D score, 'fornt', 'back', " left side ", " right side ", " vertically ", " level ", " top ", " end " " interior ", " outward " be based on orientation shown in the drawings or position relationship, only the present invention for convenience of description and simplified characterization, rather than indicate or imply that the device of indication or element must have specific orientation, with specific orientation, construct and operation, therefore can not be interpreted as limitation of the present invention.

In description of the invention, it should be noted that, unless otherwise prescribed and limit, term " installation ", " being connected ", " connection " should be interpreted broadly, for example, can be mechanical connection or electrical connection, also can be the connection of two element internals, can be to be directly connected, and also can indirectly be connected by intermediary, for the ordinary skill in the art, can understand as the case may be the concrete meaning of described term.

Below in conjunction with accompanying drawing, describe according to OCR character identifying method and the system of the embodiment of the present invention.

Fig. 2 is the process flow diagram of OCR character identifying method according to an embodiment of the invention.

As shown in Figure 2, OCR character identifying method according to an embodiment of the invention, comprises the following steps:

Step S201: the image in the target area that user is selected carries out OCR character recognition to obtain the word string of identification, and wherein, word string comprises K sub-word string, and every sub-word string at least comprises 1 character, and K is positive integer.

Step S202: the quantity of calculating the word string neutron word string of identification.

Step S203: if the quantity of word string neutron word string is greater than 2, judge the 1st sub-word string W ₁the number of middle character and K sub-word string W _kwhether the number of middle character is less than preset value.

Step S204: if W ₁the number of middle character and/or W _kthe number of middle character is less than preset value, judges W ₁noise probability score and/or W _knoise probability score whether be greater than default noise.

Step S205: if judge W ₁and/or W _kfor noise and from word string, delete W ₁and/or W _kto obtain new word string.

In one embodiment of the invention, this OCR character identifying method, further comprising the steps of:

If the quantity of 1 word string neutron word string equals 2, judge W ₁whether the number of middle character is less than W _kthe number of middle character.

If 2 W ₁the number of middle character is less than W _kthe number of middle character, further judges W ₁whether the number of middle character is less than preset value.

If 3 W ₁the number of middle character is less than preset value, further judges W ₁noise probability score whether be greater than default noise.

4 if judge W ₁for noise and from word string, delete W ₁to obtain new word string.

Further, described method also comprises:

If 1 W ₁the number of middle character is greater than W _kthe number of middle character, further judges W _kwhether the number of middle character is less than preset value.

If 2 W _kthe number of middle character is less than preset value, further judges W _knoise probability score whether be greater than default noise.

3 if judge W _kfor noise and from word string, delete W _kto obtain new word string.

In one embodiment of the invention, noise obtains by following formula:

P _left＝αlogp(W ₁)+βlogp(W ₂|W ₁)，

P _right＝αlogp(W _k)+βlogp(W _k|W _k-1)。

The OCR character identifying method of the embodiment of the present invention, after obtaining new word string, also comprises: new word string is carried out to OCR translation.

As a concrete example, to suppose in OCR translation, OCR recognition result (the word string that identification obtains) is a word string W who comprises k word ^k: W ₁w ₂w ₃w ₄w _k-2w _k-1w _k.W ^kmiddle W ₁and W _kit may be the noise that user misoperation is brought.Generally, the length of noise generally can more than one word.It is exactly to calculate respectively W that OCR recognition result is carried out to noise reduction ₁and W _knoise probability score, if noise probability score is greater than a certain threshold value (being the default noise in above-mentioned example), judge W ₁and/or W _kit is noise.

Shown in Fig. 3, the concrete step that determines whether noise comprises:

Step S301: start input W ^k=W ₁w _k.

Step S302: judge whether K equals 1, if it is performs step S303, otherwise execution step S304.

Step S303: return to W ₁.

Step S304: judge whether K equals 2, if it is performs step S305, otherwise execution step S308.

Step S305: judgement W ₁whether the number of the character comprising is less than W ₂(be W _k, K equals 2) and the number of the character that comprises, i.e. len (W ₁) <len (W ₂), if so, perform step S306, otherwise execution step S307.

Step S306: another T={W ₁, wherein, T represents that comprises a sub-word string W ₁set.

Step S307, another T={W _k, wherein, T represents that comprises a sub-word string W _kset.

Step S308: another T={W ₁, W _k, wherein, T represents that comprises a sub-word string W ₁with sub-word string W _kset.Shown in Fig. 1, T={it, I}.

Step S309: in deletion set T, character length (being the number of character) is greater than the word of preset value, and wherein, the alphabetical number generally including due to the English word of translating for needs is greater than 3, therefore, this preset value can be made as but be not limited to 3.

Step S310: for the word of set T, calculating noise probability score NoisyScore (), if noise probability score is greater than threshold value θ (i.e. default noise), thinks that the sub-word string that set T comprises is noise.

Step S311: finish.

In above-mentioned example, the computing method of noise probability score NoiseScore () can adopt the method for similar statistical language model, if leftmost word (is W ₁), calculate P _left, if rightmost word (is W _k), calculate P _right, concrete formula is:

P _left＝αlogp(W ₁)+βlogp(W ₂|W ₁)；

P _right＝αlogp(W _k)+βlogp(W _k|W _k-1)。

P (w wherein _i| w _i-1) expression binary phrase w _i-1w _iprobability, its statistical method is:

p (w_{i} | w_{i - 1}) = \frac{count (w_{i - 1} w_{i})}{\underset{w_{i}}{Σ} count (w_{i - t} w_{i})}

And p (w _i) expression monobasic word w _iprobability, its statistical method is:

p (w_{i}) = \frac{count (w_{i})}{Σ_{w_{i}^{'}} count (w_{i}^{'})}

Wherein, α and β are the weights of monobasic word and binary phrase, and value is respectively but is not limited to-1 and-0.5.

Add up by experiment, the threshold value θ (i.e. default noise) that can set noise probability score NoisyScore () is 10.5.

Fig. 4 is the structural drawing of OCR character recognition system according to an embodiment of the invention.As shown in Figure 4, OCR character recognition system 400 according to an embodiment of the invention, comprising: identification module 410, computing module 420 and denoising module 430.

Wherein, identification module 410 carries out OCR character recognition to obtain the word string of identification for the image in the target area that user is selected, and wherein, word string comprises K sub-word string, and every sub-word string at least comprises 1 character, and K is positive integer.Computing module 420 is for calculating the quantity of the word string neutron word string of identification.Denoising module 430 is greater than 2 for the quantity at word string neutron word string, judges the 1st sub-word string W ₁the number of middle character and K sub-word string W _kwhether the number of middle character is less than preset value, if while being less than preset value, and judgement W ₁noise probability score and/or W _knoise probability score whether be greater than default noise, if be greater than default noise, judge W ₁and/or W _kfor noise and from word string, delete W ₁and/or W _kto obtain new word string.

In one embodiment of the invention, denoising module 430 also for: if the quantity of word string neutron word string equals 2, judge W ₁whether the number of middle character is less than W _kthe number of middle character; If W ₁the number of middle character is less than W _kthe number of middle character, further judges W ₁whether the number of middle character is less than preset value; If W ₁the number of middle character is less than described preset value, further judges W ₁noise probability score whether be greater than default noise; If so, judge W ₁for noise and from word string, delete W ₁to obtain new word string.

Further, denoising module 430 also for: if W ₁the number of middle character is greater than W _kthe number of middle character, further judges W _kwhether the number of middle character is less than preset value; If W _kthe number of middle character is less than preset value, further judges W _knoise probability score whether be greater than default noise; If so, judge W _kfor noise and from word string, delete W _kto obtain new word string.

Wherein, noise can obtain by following formula:

P _left＝αlogp(W ₁)+βlogp(W ₂|W ₁)，

P _right＝αlogp(W _k)+βlogp(W _k|W _k-1)。

Certainly, the OCR character recognition system 400 of the embodiment of the present invention, also comprises: translation module (not shown), translation module is for carrying out OCR translation to new word string.

Specifically, shown in Fig. 3, the processing procedure of the OCR character recognition system 400 of the embodiment of the present invention is as follows:

Suppose in OCR translation, OCR recognition result (the word string that identification obtains) is a word string W who comprises k word ^k: W ₁w ₂w ₃w ₄w _k-2w _k-1w _k.W ^kmiddle W ₁and W _kit may be the noise that user misoperation is brought.Generally, the length of noise generally can more than one word.It is exactly to calculate respectively W that OCR recognition result is carried out to noise reduction ₁and W _knoise probability score, if noise probability score is greater than a certain threshold value (being the default noise in above-mentioned example), judge W ₁and/or W _kit is noise.

Shown in Fig. 3, concrete processing procedure comprises:

Step S301: start input W ^k=W ₁w _k.

Step S303: return to W ₁.

Step S311: finish.

P _left＝αlogp(W ₁)+βlogp(W ₂|W ₁)；

P _right＝αlogp(W _k)+βlogp(W _k|W _k-1)。

p (w_{i} | w_{i - 1}) = \frac{count (w_{i - 1} w_{i})}{\underset{w_{i}}{Σ} count (w_{i - t} w_{i})}

p (w_{i}) = \frac{count (w_{i})}{Σ_{w_{i}^{'}} count (w_{i}^{'})}

In the description of this instructions, the description of reference term " embodiment ", " some embodiment ", " example ", " concrete example " or " some examples " etc. means to be contained at least one embodiment of the present invention or example in conjunction with specific features, structure, material or the feature of this embodiment or example description.In this manual, the schematic statement of described term is not necessarily referred to identical embodiment or example.And the specific features of description, structure, material or feature can be with suitable mode combinations in any one or more embodiment or example.

Although illustrated and described embodiments of the invention, for the ordinary skill in the art, be appreciated that without departing from the principles and spirit of the present invention and can carry out multiple variation, modification, replacement and modification to these embodiment, scope of the present invention is by claims and be equal to and limit.

Claims

1. an OCR character identifying method, is characterized in that, comprises the following steps:

Image in the target area that user is selected carries out OCR character recognition to obtain the word string of identification, and wherein, institute's predicate string comprises K sub-word string, and every sub-word string at least comprises 1 character, and described K is positive integer;

Calculate the quantity of the word string neutron word string of described identification;

If the quantity of institute's predicate string neutron word string is greater than 2, judge described the 1st sub-word string W ₁the number of middle character and described K sub-word string W _kwhether the number of middle character is less than preset value;

If described W ₁the number of middle character and/or W _kthe number of middle character is less than described preset value, judges described W ₁noise probability score and/or W _knoise probability score whether be greater than default noise;

If so, judge described W ₁and/or described W _kfor noise and from institute's predicate string, delete described W ₁and/or described W _kto obtain new word string.

2. OCR character identifying method according to claim 1, is characterized in that, also comprises:

If the quantity of institute's predicate string neutron word string equals 2, judge described W ₁whether the number of middle character is less than described W _kthe number of middle character;

If described W ₁the number of middle character is less than described W _kthe number of middle character, further judges described W ₁whether the number of middle character is less than preset value;

If described W ₁the number of middle character is less than described preset value, further judges described W ₁noise probability score whether be greater than default noise;

If so, judge described W ₁for noise and from institute's predicate string, delete described W ₁to obtain new word string.

3. OCR character identifying method according to claim 2, is characterized in that, also comprises:

If described W ₁the number of middle character is greater than described W _kthe number of middle character, further judges described W _kwhether the number of middle character is less than preset value;

If described W _kthe number of middle character is less than described preset value, further judges described W _knoise probability score whether be greater than default noise;

If so, judge described W _kfor noise and from institute's predicate string, delete described W _kto obtain new word string.

4. OCR character identifying method according to claim 1, is characterized in that, described noise obtains by following formula: P _left=α logp (W ₁)+β logp (W ₂| W ₁), P _right=α logp (W _k)+β logp (W _k| W _k-1).

5. according to the OCR character identifying method described in claim 1-4 any one, it is characterized in that, also comprise: described new word string is carried out to OCR translation.

6. an OCR character recognition system, is characterized in that, comprising:

Identification module, carries out OCR character recognition to obtain the word string of identification for the image in the target area that user is selected, and wherein, institute's predicate string comprises K sub-word string, and every sub-word string at least comprises 1 character, and described K is positive integer;

Computing module, for calculating the quantity of the word string neutron word string of described identification;

Denoising module, is greater than 2 for the quantity at institute's predicate string neutron word string, judges described the 1st sub-word string W ₁the number of middle character and described K sub-word string W _kwhether the number of middle character is less than preset value, if while being less than described preset value, judges described W ₁noise probability score and/or described W _knoise probability score whether be greater than default noise, if be greater than described default noise, judge described W ₁and/or described W _kfor noise and from institute's predicate string, delete described W ₁and/or described W _kto obtain new word string.

7. OCR character recognition system according to claim 6, is characterized in that, described denoising module also for:

8. OCR character recognition system according to claim 7, is characterized in that, described denoising module also for:

9. OCR character recognition system according to claim 6, is characterized in that, described noise obtains by following formula: P _left=α logp (W ₁)+β logp (W ₂| W ₁), P _right=α logp (W _k)+β logp (W _k| W _k-1).

10. according to the OCR character recognition system described in claim 6-9 any one, it is characterized in that, also comprise:

Translation module, for carrying out OCR translation to described new word string.