CN103679165B - OCR (optical character recognition) character recognition method and system - Google Patents

OCR (optical character recognition) character recognition method and system Download PDF

Info

Publication number
CN103679165B
CN103679165B CN201310752624.4A CN201310752624A CN103679165B CN 103679165 B CN103679165 B CN 103679165B CN 201310752624 A CN201310752624 A CN 201310752624A CN 103679165 B CN103679165 B CN 103679165B
Authority
CN
China
Prior art keywords
word string
noise
character
ocr
less
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310752624.4A
Other languages
Chinese (zh)
Other versions
CN103679165A (en
Inventor
王海峰
和为
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201310752624.4A priority Critical patent/CN103679165B/en
Publication of CN103679165A publication Critical patent/CN103679165A/en
Application granted granted Critical
Publication of CN103679165B publication Critical patent/CN103679165B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention provides an OCR (optical character recognition) character recognition method. The method comprises the following steps of executing the OCR character recognition for an image in a target area selected by a user so as to obtain a recognized word string; calculating the quantity of sub-word strings in the recognized word string; judging whether the number of characters in a first sub-word string W1 and the number of characters in a kth sub-word string WK are smaller than a preset value or not when the quantity of the sub-word strings in the word string is more than 2; judging the noise probability score of W1 and/or the noise probability score of the WK is greater than a preset noise or not if the number of the characters in the W1 and/or the number of characters in the WK is smaller than the preset value; determining the W1 and/or WK is noise if the noise probability score of W1 and/or the noise probability score of WK is greater than the preset noise, and deleting W1 and/or WK from the word string so as to obtain a novel word string. According to the embodiment, the OCR translation accuracy for the OCR recognition result can be enhanced. The invention also provides an OCR character recognition system.

Description

OCR character identifying method and system
Technical field
The present invention relates to character recognition technologies field, particularly to a kind of OCR character identifying method and system.
Background technology
Much translation APP product all supports interpretative function of taking pictures at present, and its operating procedure is for example:User holds mobile terminal (As smart mobile phone)Take pictures against foreign language to be translated, the photo of take is covered last layer gray scale;User is coveing with gray scale with finger Photo on slide, want translate word " wiping " out;The region that user is clashed carries out OCR identification, obtains foreign language literary composition This;Call machine translation module, OCR result is translated, be ultimately rendered to user.
Whole operation process is as shown in Figure 1.But have a problem in said process, user when " wiping " word, by Block screen in finger, often left and right or neighbouring word " have been wiped " in OCR scope also together.As above in figure institute Show, user this expect translation this word of Obama, but in practical operation left and right each marked several letters, lead to the knot of OCR Fruit is " it Obama I ", and through machine translation, the final translation result obtaining is " Obama, I ".Such translation result User can be caused to perplex, affect Consumer's Experience.
Content of the invention
The purpose of the present invention is intended at least solve one of described technological deficiency.
For this reason, it is an object of the present invention to proposing a kind of OCR character identifying method.The method can be lifted to be known to OCR The accuracy of the OCR translation of other result.
Further object is that proposing a kind of OCR character recognition system.
For reaching above-mentioned purpose, the embodiment of first aspect present invention discloses a kind of OCR character identifying method, including with Lower step:The word string that the image in target area that user is selected carries out OCR character recognition to be identified, wherein, described Word string includes K sub- word string, and every sub- word string at least includes 1 character, and described K is positive integer;Calculate the word string of described identification The quantity of neutron word string;If the quantity of described word string neutron word string is more than 2, judge described 1st sub- word string W1Middle character Number and sub- word string W of described k-thKWhether the number of middle character is less than preset value;If described W1The number of middle character and/ Or WKThe number of middle character is less than described preset value, then judge described W1Noise probability score and/or WKNoise probability score Whether more than default noise;If it is, judging described W1And/or described WKDelete described W for noise and from described word string1 And/or described WKTo obtain new word string.
OCR character identifying method according to embodiments of the present invention, the result for OCR identification in OCR translation carries out noise reduction Process, thus, can recognize that and delete the OCR noise being typically due to that user misoperation is brought.So, after denoising, can be lifted With purification translation result, make translation result more accurate, improve Consumer's Experience.
In addition, OCR character identifying method according to the above embodiment of the present invention can also have the technology spy adding as follows Levy:
In some instances, also include:If the quantity of described word string neutron word string is equal to 2, judge described W1Middle word Whether the number of symbol is less than described WKThe number of middle character;If described W1The number of middle character is less than described WKMiddle character Number, then determine whether described W1Whether the number of middle character is less than preset value;If described W1The number of middle character is less than described Preset value, then determine whether described W1Noise probability score whether more than default noise;If it is, judging described W1For Noise simultaneously deletes described W from described word string1To obtain new word string.
In some instances, also include:If described W1The number of middle character is more than described WKThe number of middle character, then enter One step judges described WKWhether the number of middle character is less than preset value;If described WKThe number of middle character is less than described preset value, Then determine whether described WKNoise probability score whether more than default noise;If it is, judging described WKFor noise and from Described W is deleted in described word stringKTo obtain new word string.
In some instances, described noise is obtained by equation below:
Pleft=α logp (W1)+βlogp(W2|W1),
Pright=α logp (Wk)+βlogp(Wk|Wk-1).
In some instances, also include:OCR translation is carried out to described new word string.
The embodiment of second aspect present invention provides a kind of OCR character recognition system, including:Identification module, for right The word string that the image in target area that user selects carries out OCR character recognition to be identified, wherein, described word string includes K Individual sub- word string, every sub- word string at least includes 1 character, and described K is positive integer;Computing module, for calculating described identification The quantity of word string neutron word string;Denoising module, is more than 2 for the quantity in described word string neutron word string, judges described 1st son Word string W1The number of middle character and sub- word string W of described k-thKWhether the number of middle character is less than preset value, if less than described pre- If during value, judging described W1Noise probability score and/or described WKNoise probability score whether more than default noise, if More than described default noise, then judge described W1And/or described WKDelete described W for noise and from described word string1And/or institute State WKTo obtain new word string.
OCR character recognition system according to embodiments of the present invention, the result for OCR identification in OCR translation carries out noise reduction Process, thus, can recognize that and delete the OCR noise being typically due to that user misoperation is brought.So, after denoising, can be lifted With purification translation result, make translation result more accurate, improve Consumer's Experience.
In addition, OCR character identifying method according to the above embodiment of the present invention can also have the technology spy adding as follows Levy:
In some instances, described denoising module is additionally operable to:If the quantity of described word string neutron word string is equal to 2, sentence Break described W1Whether the number of middle character is less than described WKThe number of middle character;If described W1The number of middle character is less than described WKThe number of middle character, then determine whether described W1Whether the number of middle character is less than preset value;If described W1Middle character Number is less than described preset value, then determine whether described W1Noise probability score whether more than default noise;If it is, Judge described W1Delete described W for noise and from described word string1To obtain new word string.
In some instances, described denoising module is additionally operable to:If described W1The number of middle character is more than described WKMiddle character Number, then determine whether described WKWhether the number of middle character is less than preset value;If described WKThe number of middle character is less than Described preset value, then determine whether described WKNoise probability score whether more than default noise;If it is, judging described WKDelete described W for noise and from described word stringKTo obtain new word string.
In some instances, described noise is obtained by equation below:
Pleft=α logp (W1)+βlogp(W2|W1),
Pright=α logp (Wk)+βlogp(Wk|Wk-1).
In some instances, also include:Translation module, for carrying out OCR translation to described new word string.
The aspect that the present invention adds and advantage will be set forth in part in the description, and partly will become from the following description Obtain substantially, or recognized by the practice of the present invention.
Brief description
Of the present invention and/or additional aspect and advantage will become from the following description of the accompanying drawings of embodiments Substantially and easy to understand, wherein:
Fig. 1 is a kind of interface schematic diagram of OCR identification translation;
Fig. 2 is the flow chart of OCR character identifying method according to an embodiment of the invention;
Fig. 3 is the flow chart of OCR character identifying method in accordance with another embodiment of the present invention;And
Fig. 4 is the structure chart of OCR character recognition system according to an embodiment of the invention.
Specific embodiment
Embodiments of the invention are described below in detail, the example of described embodiment is shown in the drawings, wherein from start to finish The element that same or similar label represents same or similar element or has same or like function.Below with reference to attached The embodiment of figure description is exemplary, is only used for explaining the present invention, and is not construed as limiting the claims.
In describing the invention it is to be understood that term " longitudinal ", " horizontal ", " on ", D score, "front", "rear", The orientation of instruction such as "left", "right", " vertical ", " level ", " top ", " bottom " " interior ", " outward " or position relationship are based on accompanying drawing institute The orientation showing or position relationship, are for only for ease of the description present invention and simplify description, rather than the dress of instruction or hint indication Put or element must have specific orientation, with specific azimuth configuration and operation, therefore it is not intended that limit to the present invention System.
In describing the invention, it should be noted that unless otherwise prescribed and limit, term " installation ", " being connected ", " connection " should be interpreted broadly, for example, it may be the connection of mechanical connection or electrical connection or two element internals, can To be to be joined directly together it is also possible to be indirectly connected to by intermediary, for the ordinary skill in the art, can basis Concrete condition understands the concrete meaning of described term.
Below in conjunction with Description of Drawings OCR character identifying method according to embodiments of the present invention and system.
Fig. 2 is the flow chart of OCR character identifying method according to an embodiment of the invention.
As shown in Fig. 2 OCR character identifying method according to an embodiment of the invention, comprise the following steps:
Step S201:The word string that the image in target area that user is selected carries out OCR character recognition to be identified, Wherein, word string includes K sub- word string, and every sub- word string at least includes 1 character, and K is positive integer.
Step S202:Calculate the quantity of the word string neutron word string of identification.
Step S203:If the quantity of word string neutron word string is more than 2, judge the 1st sub- word string W1The number of middle character Sub- word string W with k-thKWhether the number of middle character is less than preset value.
Step S204:If W1The number of middle character and/or WKThe number of middle character is less than preset value, then judge W1Make an uproar Sound probability score and/or WKNoise probability score whether more than default noise.
Step S205:If it is, judging W1And/or WKFor noise and from word string delete W1And/or WKNew to obtain Word string.
In one embodiment of the invention, this OCR character identifying method, further comprising the steps of:
If the quantity of 1 word string neutron word string is equal to 2, judge W1Whether the number of middle character is less than WKMiddle character Number.
If 2 W1The number of middle character is less than WKThe number of middle character, then determine whether W1Whether the number of middle character Less than preset value.
If 3 W1The number of middle character is less than preset value, then determine whether W1Noise probability score whether more than pre- If noise.
4 if it is, judge W1For noise and from word string delete W1To obtain new word string.
Further, methods described also includes:
If 1 W1The number of middle character is more than WKThe number of middle character, then determine whether WKWhether the number of middle character Less than preset value.
If 2 WKThe number of middle character is less than preset value, then determine whether WKNoise probability score whether more than pre- If noise.
3 if it is, judge WKFor noise and from word string delete WKTo obtain new word string.
In one embodiment of the invention, noise is obtained by equation below:
Pleft=α logp (W1)+βlogp(W2|W1),
Pright=α logp (Wk)+βlogp(Wk|Wk-1).
The OCR character identifying method of the embodiment of the present invention, after obtaining new word string, also includes:New word string is carried out OCR translates.
As a specific example it is assumed that OCR translation in, OCR recognition result(Identify the word string obtaining)It is one Comprise word string W of k wordk:W1W2W3W4…Wk-2Wk-1Wk.WkMiddle W1And WkIt is probably the noise that user misoperation is brought.Generally In the case of, the length of noise typically will not more than one word.OCR recognition result is carried out with noise reduction is exactly to calculate W respectively1And Wk Noise probability score, if noise probability score be more than a certain threshold value(I.e. default noise in above-mentioned example), then judge W1 And/or WkIt is noise.
In conjunction with shown in Fig. 3, specifically determining whether that the step of noise includes:
Step S301:Start, input Wk=W1…Wk.
Step S302:Whether judge K equal to 1, if it is execution step S303, otherwise execution step S304.
Step S303:Return W1.
Step S304:Whether judge K equal to 2, if it is execution step S305, otherwise execution step S308.
Step S305:Judge W1Including the number of character whether be less than W2(I.e. Wk, K is equal to 2)Including character Number, i.e. len (W1)<len(W2), if it is, execution step S306, otherwise execution step S307.
Step S306:Another T={ W1, wherein, T represents sub- word string W of inclusion1Set.
Step S307, another T={ Wk, wherein, T represents sub- word string W of inclusionkSet.
Step S308:Another T={ W1, Wk, wherein, T represents sub- word string W of inclusion1With sub- word string WkSet.In conjunction with figure Shown in 1, then T={ it, I }.
Step S309:Delete character length in set T(The i.e. number of character)More than the word of preset value, wherein, due to It is more than 3 for the alphabetical number that the English word needing to be translated generally includes, therefore, this preset value can be set to but not limit In 3.
Step S310:For the word of set T, calculate noise probability score NoisyScore (), if noise probability obtains Divide and be more than threshold θ(Preset noise)Then it is assumed that the sub- word string that set T includes is noise.
Step S311:Terminate.
In above-mentioned example, the computational methods of noise probability score NoiseScore () can be using similar statistical language mould The method of type, if leftmost word(I.e. W1), then calculate Pleft, if rightmost word(I.e. Wk), then calculate Pright, concrete formula is:
Pleft=α logp (W1)+βlogp(W2|W1);
Pright=α logp (Wk)+βlogp(Wk|Wk-1).
Wherein p (wi|wi-1) represent binary phrase wi-1wiProbability, its statistical method is:
p ( w i | w i - 1 ) = count ( w i - 1 w i ) &Sigma; w i count ( w i - t w i )
And p (wi) represent unitary word wiProbability, its statistical method is:
p ( w i ) = count ( w i ) &Sigma; w i &prime; count ( w i &prime; )
Wherein, α and β is unitary word and the weight of binary phrase, and value is respectively but is not limited to -1 and -0.5.
By experiment statisticses, the threshold θ of noise probability score NoisyScore () can be set(Preset noise)For 10.5.
OCR character identifying method according to embodiments of the present invention, the result for OCR identification in OCR translation carries out noise reduction Process, thus, can recognize that and delete the OCR noise being typically due to that user misoperation is brought.So, after denoising, can be lifted With purification translation result, make translation result more accurate, improve Consumer's Experience.
Fig. 4 is the structure chart of OCR character recognition system according to an embodiment of the invention.As shown in figure 4, according to this The OCR character recognition system 400 of a bright embodiment, including:Identification module 410, computing module 420 and denoising module 430.
Wherein, identification module 410 is used for carrying out OCR character recognition to obtain to the image in the target area of user's selection The word string of identification, wherein, word string includes K sub- word string, and every sub- word string at least includes 1 character, and K is positive integer.Calculate mould Block 420 is used for calculating the quantity of the word string neutron word string of identification.Denoising module 430 is used for being more than in the quantity of word string neutron word string 2, judge the 1st sub- word string W1The number of middle character and sub- word string W of k-thKWhether the number of middle character is less than preset value, if During less than preset value, judge W1Noise probability score and/or WKNoise probability score whether more than default noise, if greatly In default noise, then judge W1And/or WKFor noise and from word string delete W1And/or WKTo obtain new word string.
In one embodiment of the invention, denoising module 430 is additionally operable to:If the quantity of word string neutron word string is equal to 2, Then judge W1Whether the number of middle character is less than WKThe number of middle character;If W1The number of middle character is less than WKMiddle character Number, then determine whether W1Whether the number of middle character is less than preset value;If W1The number of middle character is less than described preset value, Then determine whether W1Noise probability score whether more than default noise;If it is, judging W1Delete for noise and from word string Except W1To obtain new word string.
Further, denoising module 430 is additionally operable to:If W1The number of middle character is more than WKThe number of middle character, then enter one Step judges WKWhether the number of middle character is less than preset value;If WKThe number of middle character is less than preset value, then determine whether WK Noise probability score whether more than default noise;If it is, judging WKFor noise and from word string delete WKNew to obtain Word string.
Wherein, noise can be obtained by equation below:
Pleft=α logp (W1)+βlogp(W2|W1),
Pright=α logp (Wk)+βlogp(Wk|Wk-1).
Certainly, the OCR character recognition system 400 of the embodiment of the present invention, also includes:Translation module(In figure is not shown), turn over Translate module for OCR translation is carried out to new word string.
Specifically, in conjunction with shown in Fig. 3, the processing procedure of the OCR character recognition system 400 of the embodiment of the present invention is as follows:
Assume in OCR translation, OCR recognition result(Identify the word string obtaining)It is word string W comprising k wordk: W1W2W3W4…Wk-2Wk-1Wk.WkMiddle W1And WkIt is probably the noise that user misoperation is brought.Under normal circumstances, the length one of noise As will not more than one word.OCR recognition result is carried out with noise reduction is exactly to calculate W respectively1And WkNoise probability score, if Noise probability score is more than a certain threshold value(I.e. default noise in above-mentioned example), then judge W1And/or WkIt is noise.
In conjunction with shown in Fig. 3, specific processing procedure includes:
Step S301:Start, input Wk=W1…Wk.
Step S302:Whether judge K equal to 1, if it is execution step S303, otherwise execution step S304.
Step S303:Return W1.
Step S304:Whether judge K equal to 2, if it is execution step S305, otherwise execution step S308.
Step S305:Judge W1Including the number of character whether be less than W2(I.e. Wk, K is equal to 2)Including character Number, i.e. len (W1)<len(W2), if it is, execution step S306, otherwise execution step S307.
Step S306:Another T={ W1, wherein, T represents sub- word string W of inclusion1Set.
Step S307, another T={ Wk, wherein, T represents sub- word string W of inclusionkSet.
Step S308:Another T={ W1, Wk, wherein, T represents sub- word string W of inclusion1With sub- word string WkSet.In conjunction with figure Shown in 1, then T={ it, I }.
Step S309:Delete character length in set T(The i.e. number of character)More than the word of preset value, wherein, due to It is more than 3 for the alphabetical number that the English word needing to be translated generally includes, therefore, this preset value can be set to but not limit In 3.
Step S310:For the word of set T, calculate noise probability score NoisyScore (), if noise probability obtains Divide and be more than threshold θ(Preset noise)Then it is assumed that the sub- word string that set T includes is noise.
Step S311:Terminate.
In above-mentioned example, the computational methods of noise probability score NoiseScore () can be using similar statistical language mould The method of type, if leftmost word(I.e. W1), then calculate Pleft, if rightmost word(I.e. Wk), then calculate Pright, concrete formula is:
Pleft=α logp (W1)+βlogp(W2|W1);
Pright=α logp (Wk)+βlogp(Wk|Wk-1).
Wherein p (wi|wi-1) represent binary phrase wi-1wiProbability, its statistical method is:
p ( w i | w i - 1 ) = count ( w i - 1 w i ) &Sigma; w i count ( w i - t w i )
And p (wi) represent unitary word wiProbability, its statistical method is:
p ( w i ) = count ( w i ) &Sigma; w i &prime; count ( w i &prime; )
Wherein, α and β is unitary word and the weight of binary phrase, and value is respectively but is not limited to -1 and -0.5.
By experiment statisticses, the threshold θ of noise probability score NoisyScore () can be set(Preset noise)For 10.5.
OCR character recognition system according to embodiments of the present invention, the result for OCR identification in OCR translation carries out noise reduction Process, thus, can recognize that and delete the OCR noise being typically due to that user misoperation is brought.So, after denoising, can be lifted With purification translation result, make translation result more accurate, improve Consumer's Experience.
In the description of this specification, reference term " embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or the spy describing with reference to this embodiment or example Point is contained at least one embodiment or the example of the present invention.In this manual, to the schematic representation of described term not Necessarily refer to identical embodiment or example.And, the specific features of description, structure, material or feature can be any One or more embodiments or example in combine in an appropriate manner.
Although an embodiment of the present invention has been shown and described, for the ordinary skill in the art, permissible Understand and can carry out multiple changes, modification, replacement to these embodiments without departing from the principles and spirit of the present invention And modification, the scope of the present invention by claims and its equivalent limits.

Claims (10)

1. a kind of OCR character identifying method is it is characterised in that comprise the following steps:
The word string that the image in target area that user is selected carries out OCR character recognition to be identified, wherein, described word string Including K sub- word string, every sub- word string at least includes 1 character, and described K is positive integer;
Calculate the quantity of the word string neutron word string of described identification;
If the quantity of described word string neutron word string is more than 2, judge described 1st sub- word string W1The number of middle character and described Sub- word string W of k-thKWhether the number of middle character is less than preset value;
If described W1The number of middle character and/or WKThe number of middle character is less than described preset value, then judge described W1Noise Probability score and/or WKNoise probability score whether more than default noise, wherein, described noise probability score is used for evaluating son Whether word string is noise;
If it is, judging described W1And/or described WKDelete described W for noise and from described word string1And/or described WKWith To new word string.
2. OCR character identifying method according to claim 1 is it is characterised in that also include:
If the quantity of described word string neutron word string is equal to 2, judge described W1Whether the number of middle character is less than described WKMiddle word The number of symbol;
If described W1The number of middle character is less than described WKThe number of middle character, then determine whether described W1Middle character Whether number is less than preset value;
If described W1The number of middle character is less than described preset value, then determine whether described W1Noise probability score whether More than default noise;
If it is, judging described W1Delete described W for noise and from described word string1To obtain new word string.
3. OCR character identifying method according to claim 2 is it is characterised in that also include:
If described W1The number of middle character is more than described WKThe number of middle character, then determine whether described WKMiddle character Whether number is less than preset value;
If described WKThe number of middle character is less than described preset value, then determine whether described WKNoise probability score whether More than default noise;
If it is, judging described WKDelete described W for noise and from described word stringKTo obtain new word string.
4. OCR character identifying method according to claim 1 is it is characterised in that described noise probability score is by as follows Formula obtains:Pleft=α logp (W1)+βlogp(W2|W1), Pright=α logp (Wk)+βlog(Wk|Wk-1), wherein, α and β is Unitary word and the weight of binary phrase, p (wi|wi-1) it is binary phrase wi-1wiProbability, p (wi) it is unitary word wiGeneral Rate.
5. the OCR character identifying method according to any one of claim 1-4 is it is characterised in that also include:To described new Word string carries out OCR translation.
6. a kind of OCR character recognition system is it is characterised in that include:
Identification module, the word string carrying out OCR character recognition to be identified for the image in target area that user is selected, Wherein, described word string includes K sub- word string, and every sub- word string at least includes 1 character, and described K is positive integer;
Computing module, for calculating the quantity of the word string neutron word string of described identification;
Denoising module, is more than 2 for the quantity in described word string neutron word string, judges described 1st sub- word string W1Middle character Number and sub- word string W of described k-thKWhether the number of middle character is less than preset value, during if less than described preset value, judges described W1Noise probability score and/or described WKNoise probability score whether more than default noise, default make an uproar if greater than described Sound, then judge described W1And/or described WKDelete described W for noise and from described word string1And/or described WKNew to obtain Word string, wherein, described noise probability score is used for evaluating whether sub- word string is noise.
7. OCR character recognition system according to claim 6 is it is characterised in that described denoising module is additionally operable to:
If the quantity of described word string neutron word string is equal to 2, judge described W1Whether the number of middle character is less than described WKMiddle word The number of symbol;
If described W1The number of middle character is less than described WKThe number of middle character, then determine whether described W1Middle character Whether number is less than preset value;
If described W1The number of middle character is less than described preset value, then determine whether described W1Noise probability score whether More than default noise;
If it is, judging described W1Delete described W for noise and from described word string1To obtain new word string.
8. OCR character recognition system according to claim 7 is it is characterised in that described denoising module is additionally operable to:
If described W1The number of middle character is more than described WKThe number of middle character, then determine whether described WKMiddle character Whether number is less than preset value;
If described WKThe number of middle character is less than described preset value, then determine whether described WKNoise probability score whether More than default noise;
If it is, judging described WKDelete described W for noise and from described word stringKTo obtain new word string.
9. OCR character recognition system according to claim 6 is it is characterised in that described noise probability score is by as follows Formula obtains:Pleft=α logp (W1)+βlogp(W2|W1), Pright=α logp (Wk)+βlogp(Wk|Wk-1), wherein, α and β is Unitary word and the weight of binary phrase, p (wi|wi-1) it is binary phrase wi-1wiProbability, p (wi) it is unitary word wiGeneral Rate.
10. the OCR character recognition system according to any one of claim 6-9 is it is characterised in that also include:Translation module, For OCR translation is carried out to described new word string.
CN201310752624.4A 2013-12-31 2013-12-31 OCR (optical character recognition) character recognition method and system Active CN103679165B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310752624.4A CN103679165B (en) 2013-12-31 2013-12-31 OCR (optical character recognition) character recognition method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310752624.4A CN103679165B (en) 2013-12-31 2013-12-31 OCR (optical character recognition) character recognition method and system

Publications (2)

Publication Number Publication Date
CN103679165A CN103679165A (en) 2014-03-26
CN103679165B true CN103679165B (en) 2017-02-08

Family

ID=50316655

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310752624.4A Active CN103679165B (en) 2013-12-31 2013-12-31 OCR (optical character recognition) character recognition method and system

Country Status (1)

Country Link
CN (1) CN103679165B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106599857A (en) * 2016-12-20 2017-04-26 广东欧珀移动通信有限公司 Image identification method, apparatus, computer-readable storage medium and terminal device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5448474A (en) * 1993-03-03 1995-09-05 International Business Machines Corporation Method for isolation of Chinese words from connected Chinese text
CN1477559A (en) * 2002-08-23 2004-02-25 华为技术有限公司 Method for implementing long character string prefix matching
CN103186587A (en) * 2011-12-30 2013-07-03 牟颖 Method for quickly translating English word of book through mobile phone

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2144189A3 (en) * 2008-07-10 2014-03-05 Samsung Electronics Co., Ltd. Method for recognizing and translating characters in camera-based image

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5448474A (en) * 1993-03-03 1995-09-05 International Business Machines Corporation Method for isolation of Chinese words from connected Chinese text
CN1477559A (en) * 2002-08-23 2004-02-25 华为技术有限公司 Method for implementing long character string prefix matching
CN103186587A (en) * 2011-12-30 2013-07-03 牟颖 Method for quickly translating English word of book through mobile phone

Also Published As

Publication number Publication date
CN103679165A (en) 2014-03-26

Similar Documents

Publication Publication Date Title
US9262412B2 (en) Techniques for predictive input method editors
US9928831B2 (en) Speech data recognition method, apparatus, and server for distinguishing regional accent
CN114399769B (en) Training method of text recognition model, and text recognition method and device
CN110718226B (en) Speech recognition result processing method and device, electronic equipment and medium
WO2013052330A2 (en) Interactive text editing
CN112559800B (en) Method, apparatus, electronic device, medium and product for processing video
CN104133561A (en) Auxiliary information display method and device based on input method
CN106328145A (en) Voice correction method and voice correction device
US10504508B2 (en) Response generation device, dialog control system, and response generation method
CN115438650B (en) Contract text error correction method, system, equipment and medium fusing multi-source characteristics
US20100153110A1 (en) Voice recognition system and method of a mobile communication device
KR970049402A (en) Image processing method and apparatus, and storage medium
KR102618483B1 (en) Device and method to filter text
CN109978044B (en) Training data generation method and device, and model training method and device
CN103679165B (en) OCR (optical character recognition) character recognition method and system
CN113033346A (en) Text detection method and device and electronic equipment
CN103778210B (en) Method and device for judging specific file type of file to be analyzed
JP2012093968A (en) Character recognition apparatus and character recognition method, recognition character correction apparatus and recognition character correction method and program
CN114580391A (en) Chinese error detection model training method, device, equipment and storage medium
CN113378541A (en) Text punctuation prediction method, device, system and storage medium
CN110728137B (en) Method and device for word segmentation
CN104134064A (en) Character recognition method and device
CN103455162A (en) Input processing method and device
CN108021918B (en) Character recognition method and device
CN102378005A (en) Moving image processing apparatus, moving image processing method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant