CN103679165B

CN103679165B - OCR (optical character recognition) character recognition method and system

Info

Publication number: CN103679165B
Application number: CN201310752624.4A
Authority: CN
Inventors: 王海峰; 和为
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2013-12-31
Filing date: 2013-12-31
Publication date: 2017-02-08
Anticipated expiration: 2033-12-31
Also published as: CN103679165A

Abstract

The invention provides an OCR (optical character recognition) character recognition method. The method comprises the following steps of executing the OCR character recognition for an image in a target area selected by a user so as to obtain a recognized word string; calculating the quantity of sub-word strings in the recognized word string; judging whether the number of characters in a first sub-word string W1 and the number of characters in a kth sub-word string WK are smaller than a preset value or not when the quantity of the sub-word strings in the word string is more than 2; judging the noise probability score of W1 and/or the noise probability score of the WK is greater than a preset noise or not if the number of the characters in the W1 and/or the number of characters in the WK is smaller than the preset value; determining the W1 and/or WK is noise if the noise probability score of W1 and/or the noise probability score of WK is greater than the preset noise, and deleting W1 and/or WK from the word string so as to obtain a novel word string. According to the embodiment, the OCR translation accuracy for the OCR recognition result can be enhanced. The invention also provides an OCR character recognition system.

Description

OCR character identifying method and system

Technical field

The present invention relates to character recognition technologies field, particularly to a kind of OCR character identifying method and system.

Background technology

Much translation APP product all supports interpretative function of taking pictures at present, and its operating procedure is for example：User holds mobile terminal （As smart mobile phone）Take pictures against foreign language to be translated, the photo of take is covered last layer gray scale；User is coveing with gray scale with finger Photo on slide, want translate word " wiping " out；The region that user is clashed carries out OCR identification, obtains foreign language literary composition This；Call machine translation module, OCR result is translated, be ultimately rendered to user.

Whole operation process is as shown in Figure 1.But have a problem in said process, user when " wiping " word, by Block screen in finger, often left and right or neighbouring word " have been wiped " in OCR scope also together.As above in figure institute Show, user this expect translation this word of Obama, but in practical operation left and right each marked several letters, lead to the knot of OCR Fruit is " it Obama I ", and through machine translation, the final translation result obtaining is " Obama, I ".Such translation result User can be caused to perplex, affect Consumer's Experience.

Content of the invention

The purpose of the present invention is intended at least solve one of described technological deficiency.

For this reason, it is an object of the present invention to proposing a kind of OCR character identifying method.The method can be lifted to be known to OCR The accuracy of the OCR translation of other result.

Further object is that proposing a kind of OCR character recognition system.

For reaching above-mentioned purpose, the embodiment of first aspect present invention discloses a kind of OCR character identifying method, including with Lower step：The word string that the image in target area that user is selected carries out OCR character recognition to be identified, wherein, described Word string includes K sub- word string, and every sub- word string at least includes 1 character, and described K is positive integer；Calculate the word string of described identification The quantity of neutron word string；If the quantity of described word string neutron word string is more than 2, judge described 1st sub- word string W₁Middle character Number and sub- word string W of described k-th_KWhether the number of middle character is less than preset value；If described W₁The number of middle character and/ Or W_KThe number of middle character is less than described preset value, then judge described W₁Noise probability score and/or W_KNoise probability score Whether more than default noise；If it is, judging described W₁And/or described W_KDelete described W for noise and from described word string₁ And/or described W_KTo obtain new word string.

OCR character identifying method according to embodiments of the present invention, the result for OCR identification in OCR translation carries out noise reduction Process, thus, can recognize that and delete the OCR noise being typically due to that user misoperation is brought.So, after denoising, can be lifted With purification translation result, make translation result more accurate, improve Consumer's Experience.

In addition, OCR character identifying method according to the above embodiment of the present invention can also have the technology spy adding as follows Levy：

In some instances, also include：If the quantity of described word string neutron word string is equal to 2, judge described W₁Middle word Whether the number of symbol is less than described W_KThe number of middle character；If described W₁The number of middle character is less than described W_KMiddle character Number, then determine whether described W₁Whether the number of middle character is less than preset value；If described W₁The number of middle character is less than described Preset value, then determine whether described W₁Noise probability score whether more than default noise；If it is, judging described W₁For Noise simultaneously deletes described W from described word string₁To obtain new word string.

In some instances, also include：If described W₁The number of middle character is more than described W_KThe number of middle character, then enter One step judges described W_KWhether the number of middle character is less than preset value；If described W_KThe number of middle character is less than described preset value, Then determine whether described W_KNoise probability score whether more than default noise；If it is, judging described W_KFor noise and from Described W is deleted in described word string_KTo obtain new word string.

In some instances, described noise is obtained by equation below：

P_left=α logp (W₁)+βlogp(W₂|W₁),

P_right=α logp (W_k)+βlogp(W_k|W_k-1).

In some instances, also include：OCR translation is carried out to described new word string.

The embodiment of second aspect present invention provides a kind of OCR character recognition system, including：Identification module, for right The word string that the image in target area that user selects carries out OCR character recognition to be identified, wherein, described word string includes K Individual sub- word string, every sub- word string at least includes 1 character, and described K is positive integer；Computing module, for calculating described identification The quantity of word string neutron word string；Denoising module, is more than 2 for the quantity in described word string neutron word string, judges described 1st son Word string W₁The number of middle character and sub- word string W of described k-th_KWhether the number of middle character is less than preset value, if less than described pre- If during value, judging described W₁Noise probability score and/or described W_KNoise probability score whether more than default noise, if More than described default noise, then judge described W₁And/or described W_KDelete described W for noise and from described word string₁And/or institute State W_KTo obtain new word string.

OCR character recognition system according to embodiments of the present invention, the result for OCR identification in OCR translation carries out noise reduction Process, thus, can recognize that and delete the OCR noise being typically due to that user misoperation is brought.So, after denoising, can be lifted With purification translation result, make translation result more accurate, improve Consumer's Experience.

In some instances, described denoising module is additionally operable to：If the quantity of described word string neutron word string is equal to 2, sentence Break described W₁Whether the number of middle character is less than described W_KThe number of middle character；If described W₁The number of middle character is less than described W_KThe number of middle character, then determine whether described W₁Whether the number of middle character is less than preset value；If described W₁Middle character Number is less than described preset value, then determine whether described W₁Noise probability score whether more than default noise；If it is, Judge described W₁Delete described W for noise and from described word string₁To obtain new word string.

In some instances, described denoising module is additionally operable to：If described W₁The number of middle character is more than described W_KMiddle character Number, then determine whether described W_KWhether the number of middle character is less than preset value；If described W_KThe number of middle character is less than Described preset value, then determine whether described W_KNoise probability score whether more than default noise；If it is, judging described W_KDelete described W for noise and from described word string_KTo obtain new word string.

In some instances, described noise is obtained by equation below：

P_left=α logp (W₁)+βlogp(W₂|W₁),

P_right=α logp (W_k)+βlogp(W_k|W_k-1).

In some instances, also include：Translation module, for carrying out OCR translation to described new word string.

The aspect that the present invention adds and advantage will be set forth in part in the description, and partly will become from the following description Obtain substantially, or recognized by the practice of the present invention.

Brief description

Of the present invention and/or additional aspect and advantage will become from the following description of the accompanying drawings of embodiments Substantially and easy to understand, wherein：

Fig. 1 is a kind of interface schematic diagram of OCR identification translation；

Fig. 2 is the flow chart of OCR character identifying method according to an embodiment of the invention；

Fig. 3 is the flow chart of OCR character identifying method in accordance with another embodiment of the present invention；And

Fig. 4 is the structure chart of OCR character recognition system according to an embodiment of the invention.

Specific embodiment

Embodiments of the invention are described below in detail, the example of described embodiment is shown in the drawings, wherein from start to finish The element that same or similar label represents same or similar element or has same or like function.Below with reference to attached The embodiment of figure description is exemplary, is only used for explaining the present invention, and is not construed as limiting the claims.

In describing the invention it is to be understood that term " longitudinal ", " horizontal ", " on ", D score, "front", "rear", The orientation of instruction such as "left", "right", " vertical ", " level ", " top ", " bottom " " interior ", " outward " or position relationship are based on accompanying drawing institute The orientation showing or position relationship, are for only for ease of the description present invention and simplify description, rather than the dress of instruction or hint indication Put or element must have specific orientation, with specific azimuth configuration and operation, therefore it is not intended that limit to the present invention System.

In describing the invention, it should be noted that unless otherwise prescribed and limit, term " installation ", " being connected ", " connection " should be interpreted broadly, for example, it may be the connection of mechanical connection or electrical connection or two element internals, can To be to be joined directly together it is also possible to be indirectly connected to by intermediary, for the ordinary skill in the art, can basis Concrete condition understands the concrete meaning of described term.

Below in conjunction with Description of Drawings OCR character identifying method according to embodiments of the present invention and system.

Fig. 2 is the flow chart of OCR character identifying method according to an embodiment of the invention.

As shown in Fig. 2 OCR character identifying method according to an embodiment of the invention, comprise the following steps：

Step S201：The word string that the image in target area that user is selected carries out OCR character recognition to be identified, Wherein, word string includes K sub- word string, and every sub- word string at least includes 1 character, and K is positive integer.

Step S202：Calculate the quantity of the word string neutron word string of identification.

Step S203：If the quantity of word string neutron word string is more than 2, judge the 1st sub- word string W₁The number of middle character Sub- word string W with k-th_KWhether the number of middle character is less than preset value.

Step S204：If W₁The number of middle character and/or W_KThe number of middle character is less than preset value, then judge W₁Make an uproar Sound probability score and/or W_KNoise probability score whether more than default noise.

Step S205：If it is, judging W₁And/or W_KFor noise and from word string delete W₁And/or W_KNew to obtain Word string.

In one embodiment of the invention, this OCR character identifying method, further comprising the steps of：

If the quantity of 1 word string neutron word string is equal to 2, judge W₁Whether the number of middle character is less than W_KMiddle character Number.

If 2 W₁The number of middle character is less than W_KThe number of middle character, then determine whether W₁Whether the number of middle character Less than preset value.

If 3 W₁The number of middle character is less than preset value, then determine whether W₁Noise probability score whether more than pre- If noise.

4 if it is, judge W₁For noise and from word string delete W₁To obtain new word string.

Further, methods described also includes：

If 1 W₁The number of middle character is more than W_KThe number of middle character, then determine whether W_KWhether the number of middle character Less than preset value.

If 2 W_KThe number of middle character is less than preset value, then determine whether W_KNoise probability score whether more than pre- If noise.

3 if it is, judge W_KFor noise and from word string delete W_KTo obtain new word string.

In one embodiment of the invention, noise is obtained by equation below：

P_left=α logp (W₁)+βlogp(W₂|W₁),

P_right=α logp (W_k)+βlogp(W_k|W_k-1).

The OCR character identifying method of the embodiment of the present invention, after obtaining new word string, also includes：New word string is carried out OCR translates.

As a specific example it is assumed that OCR translation in, OCR recognition result（Identify the word string obtaining）It is one Comprise word string W of k word^k：W₁W₂W₃W₄…W_k-2W_k-1W_k.W^kMiddle W₁And W_kIt is probably the noise that user misoperation is brought.Generally In the case of, the length of noise typically will not more than one word.OCR recognition result is carried out with noise reduction is exactly to calculate W respectively₁And W_k Noise probability score, if noise probability score be more than a certain threshold value（I.e. default noise in above-mentioned example）, then judge W₁ And/or W_kIt is noise.

In conjunction with shown in Fig. 3, specifically determining whether that the step of noise includes：

Step S301：Start, input W^k=W₁…W_k.

Step S302：Whether judge K equal to 1, if it is execution step S303, otherwise execution step S304.

Step S303：Return W₁.

Step S304：Whether judge K equal to 2, if it is execution step S305, otherwise execution step S308.

Step S305：Judge W₁Including the number of character whether be less than W₂（I.e. W_k, K is equal to 2）Including character Number, i.e. len (W₁)<len(W₂), if it is, execution step S306, otherwise execution step S307.

Step S306：Another T={ W₁, wherein, T represents sub- word string W of inclusion₁Set.

Step S307, another T={ W_k, wherein, T represents sub- word string W of inclusion_kSet.

Step S308：Another T={ W₁, W_k, wherein, T represents sub- word string W of inclusion₁With sub- word string W_kSet.In conjunction with figure Shown in 1, then T={ it, I }.

Step S309：Delete character length in set T（The i.e. number of character）More than the word of preset value, wherein, due to It is more than 3 for the alphabetical number that the English word needing to be translated generally includes, therefore, this preset value can be set to but not limit In 3.

Step S310：For the word of set T, calculate noise probability score NoisyScore (), if noise probability obtains Divide and be more than threshold θ（Preset noise）Then it is assumed that the sub- word string that set T includes is noise.

Step S311：Terminate.

In above-mentioned example, the computational methods of noise probability score NoiseScore () can be using similar statistical language mould The method of type, if leftmost word（I.e. W₁）, then calculate P_left, if rightmost word（I.e. W_k）, then calculate P_right, concrete formula is：

P_left=α logp (W₁)+βlogp(W₂|W₁)；

P_right=α logp (W_k)+βlogp(W_k|W_k-1).

Wherein p (w_i|w_i-1) represent binary phrase w_i-1w_iProbability, its statistical method is：

p (w_{i} | w_{i - 1}) = \frac{count (w_{i - 1} w_{i})}{\underset{w_{i}}{Σ} count (w_{i - t} w_{i})}

And p (w_i) represent unitary word w_iProbability, its statistical method is：

p (w_{i}) = \frac{count (w_{i})}{Σ_{w_{i}^{'}} count (w_{i}^{'})}

Wherein, α and β is unitary word and the weight of binary phrase, and value is respectively but is not limited to -1 and -0.5.

By experiment statisticses, the threshold θ of noise probability score NoisyScore () can be set（Preset noise）For 10.5.

Fig. 4 is the structure chart of OCR character recognition system according to an embodiment of the invention.As shown in figure 4, according to this The OCR character recognition system 400 of a bright embodiment, including：Identification module 410, computing module 420 and denoising module 430.

Wherein, identification module 410 is used for carrying out OCR character recognition to obtain to the image in the target area of user's selection The word string of identification, wherein, word string includes K sub- word string, and every sub- word string at least includes 1 character, and K is positive integer.Calculate mould Block 420 is used for calculating the quantity of the word string neutron word string of identification.Denoising module 430 is used for being more than in the quantity of word string neutron word string 2, judge the 1st sub- word string W₁The number of middle character and sub- word string W of k-th_KWhether the number of middle character is less than preset value, if During less than preset value, judge W₁Noise probability score and/or W_KNoise probability score whether more than default noise, if greatly In default noise, then judge W₁And/or W_KFor noise and from word string delete W₁And/or W_KTo obtain new word string.

In one embodiment of the invention, denoising module 430 is additionally operable to：If the quantity of word string neutron word string is equal to 2, Then judge W₁Whether the number of middle character is less than W_KThe number of middle character；If W₁The number of middle character is less than W_KMiddle character Number, then determine whether W₁Whether the number of middle character is less than preset value；If W₁The number of middle character is less than described preset value, Then determine whether W₁Noise probability score whether more than default noise；If it is, judging W₁Delete for noise and from word string Except W₁To obtain new word string.

Further, denoising module 430 is additionally operable to：If W₁The number of middle character is more than W_KThe number of middle character, then enter one Step judges W_KWhether the number of middle character is less than preset value；If W_KThe number of middle character is less than preset value, then determine whether W_K Noise probability score whether more than default noise；If it is, judging W_KFor noise and from word string delete W_KNew to obtain Word string.

Wherein, noise can be obtained by equation below：

P_left=α logp (W₁)+βlogp(W₂|W₁),

P_right=α logp (W_k)+βlogp(W_k|W_k-1).

Certainly, the OCR character recognition system 400 of the embodiment of the present invention, also includes：Translation module（In figure is not shown）, turn over Translate module for OCR translation is carried out to new word string.

Specifically, in conjunction with shown in Fig. 3, the processing procedure of the OCR character recognition system 400 of the embodiment of the present invention is as follows：

Assume in OCR translation, OCR recognition result（Identify the word string obtaining）It is word string W comprising k word^k： W₁W₂W₃W₄…W_k-2W_k-1W_k.W^kMiddle W₁And W_kIt is probably the noise that user misoperation is brought.Under normal circumstances, the length one of noise As will not more than one word.OCR recognition result is carried out with noise reduction is exactly to calculate W respectively₁And W_kNoise probability score, if Noise probability score is more than a certain threshold value（I.e. default noise in above-mentioned example）, then judge W₁And/or W_kIt is noise.

In conjunction with shown in Fig. 3, specific processing procedure includes：

Step S301：Start, input W^k=W₁…W_k.

Step S303：Return W₁.

Step S311：Terminate.

P_left=α logp (W₁)+βlogp(W₂|W₁)；

P_right=α logp (W_k)+βlogp(W_k|W_k-1).

p (w_{i} | w_{i - 1}) = \frac{count (w_{i - 1} w_{i})}{\underset{w_{i}}{Σ} count (w_{i - t} w_{i})}

And p (w_i) represent unitary word w_iProbability, its statistical method is：

p (w_{i}) = \frac{count (w_{i})}{Σ_{w_{i}^{'}} count (w_{i}^{'})}

In the description of this specification, reference term " embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or the spy describing with reference to this embodiment or example Point is contained at least one embodiment or the example of the present invention.In this manual, to the schematic representation of described term not Necessarily refer to identical embodiment or example.And, the specific features of description, structure, material or feature can be any One or more embodiments or example in combine in an appropriate manner.

Although an embodiment of the present invention has been shown and described, for the ordinary skill in the art, permissible Understand and can carry out multiple changes, modification, replacement to these embodiments without departing from the principles and spirit of the present invention And modification, the scope of the present invention by claims and its equivalent limits.

Claims

1. a kind of OCR character identifying method is it is characterised in that comprise the following steps：

The word string that the image in target area that user is selected carries out OCR character recognition to be identified, wherein, described word string Including K sub- word string, every sub- word string at least includes 1 character, and described K is positive integer；

Calculate the quantity of the word string neutron word string of described identification；

If the quantity of described word string neutron word string is more than 2, judge described 1st sub- word string W₁The number of middle character and described Sub- word string W of k-th_KWhether the number of middle character is less than preset value；

If described W₁The number of middle character and/or W_KThe number of middle character is less than described preset value, then judge described W₁Noise Probability score and/or W_KNoise probability score whether more than default noise, wherein, described noise probability score is used for evaluating son Whether word string is noise；

If it is, judging described W₁And/or described W_KDelete described W for noise and from described word string₁And/or described W_KWith To new word string.

2. OCR character identifying method according to claim 1 is it is characterised in that also include：

If the quantity of described word string neutron word string is equal to 2, judge described W₁Whether the number of middle character is less than described W_KMiddle word The number of symbol；

If described W₁The number of middle character is less than described W_KThe number of middle character, then determine whether described W₁Middle character Whether number is less than preset value；

If described W₁The number of middle character is less than described preset value, then determine whether described W₁Noise probability score whether More than default noise；

If it is, judging described W₁Delete described W for noise and from described word string₁To obtain new word string.

3. OCR character identifying method according to claim 2 is it is characterised in that also include：

If described W₁The number of middle character is more than described W_KThe number of middle character, then determine whether described W_KMiddle character Whether number is less than preset value；

If described W_KThe number of middle character is less than described preset value, then determine whether described W_KNoise probability score whether More than default noise；

If it is, judging described W_KDelete described W for noise and from described word string_KTo obtain new word string.

4. OCR character identifying method according to claim 1 is it is characterised in that described noise probability score is by as follows Formula obtains：P_left=α logp (W₁)+βlogp(W₂|W₁), P_right=α logp (W_k)+βlog(W_k|W_k-1), wherein, α and β is Unitary word and the weight of binary phrase, p (w_i|w_i-1) it is binary phrase w_i-1w_iProbability, p (w_i) it is unitary word w_iGeneral Rate.

5. the OCR character identifying method according to any one of claim 1-4 is it is characterised in that also include：To described new Word string carries out OCR translation.

6. a kind of OCR character recognition system is it is characterised in that include：

Identification module, the word string carrying out OCR character recognition to be identified for the image in target area that user is selected, Wherein, described word string includes K sub- word string, and every sub- word string at least includes 1 character, and described K is positive integer；

Computing module, for calculating the quantity of the word string neutron word string of described identification；

Denoising module, is more than 2 for the quantity in described word string neutron word string, judges described 1st sub- word string W₁Middle character Number and sub- word string W of described k-th_KWhether the number of middle character is less than preset value, during if less than described preset value, judges described W₁Noise probability score and/or described W_KNoise probability score whether more than default noise, default make an uproar if greater than described Sound, then judge described W₁And/or described W_KDelete described W for noise and from described word string₁And/or described W_KNew to obtain Word string, wherein, described noise probability score is used for evaluating whether sub- word string is noise.

7. OCR character recognition system according to claim 6 is it is characterised in that described denoising module is additionally operable to：

8. OCR character recognition system according to claim 7 is it is characterised in that described denoising module is additionally operable to：

9. OCR character recognition system according to claim 6 is it is characterised in that described noise probability score is by as follows Formula obtains：P_left=α logp (W₁)+βlogp(W₂|W₁), P_right=α logp (W_k)+βlogp(W_k|W_k-1), wherein, α and β is Unitary word and the weight of binary phrase, p (w_i|w_i-1) it is binary phrase w_i-1w_iProbability, p (w_i) it is unitary word w_iGeneral Rate.

10. the OCR character recognition system according to any one of claim 6-9 is it is characterised in that also include：Translation module, For OCR translation is carried out to described new word string.