CN102024139A

CN102024139A - Device and method for recognizing character strings

Info

Publication number: CN102024139A
Application number: CN2009101738708A
Authority: CN
Inventors: 白洪亮; 郑大念; 孙俊; 诹访美佐子; 武部浩明; 堀田悦伸; 于浩; 直井聪
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2009-09-18
Filing date: 2009-09-18
Publication date: 2011-04-20
Also published as: JP2011065646A

Abstract

The invention discloses a method for recognizing character strings. The method comprises the following steps of: partitioning a character string image into a plurality of fragments; performing optical character recognition (OCR) on the plurality of fragments so as to obtain candidate characters, wherein each fragment corresponds to at least one candidate character; acquiring the candidate characters of the fragments and/or the statistical information of the character combination of the candidate characters of the fragments; and determining candidate character strings according to the statistical information and the OCR recognition confidence of the candidate characters. Moreover, the invention also discloses a device for recognizing the character strings.

Description

Character string identification device and method

Technical field

The present invention relates to a kind of character string identification device and method, promptly identify the apparatus and method of character string by character string picture.

Background technology

Now, discerning various character informations by the OCR technology is very general things.For example, the user is writing a string character on the paper or on the touch-screen, is converted to character string picture by scan, take pictures or induction etc., with this character string picture input recognition system, thereby discerns and export this string value.

Have various character strings in actual applications, for example character string may be made up of letter fully, also may be mixed by letter and number and form, and may comprise separator ". ", also may comprise separator " ".

Identification for the handwritten form of such information, particularly these information becomes more and more important now, because along with Internet fast development, the situation that uses this character string when transmitting information is more and more frequent, for example the user can be on touch-screen handwriting input Email address.Yet,, do not have special high-efficiency method at present for the identification of this character string.

Summary of the invention

The objective of the invention is to propose a kind of method and apparatus of identification string.Provide hereinafter about brief overview of the present invention, so that basic comprehension about some aspect of the present invention is provided.Should be appreciated that this general introduction is not about exhaustive general introduction of the present invention.It is not that intention is determined key of the present invention or pith, neither be intended to limit scope of the present invention.Its purpose only is to provide some notion with the form of simplifying, with this as the preorder in greater detail of argumentation after a while.

According to an aspect of the present invention, provide a kind of method of identification string, may further comprise the steps: character string picture has been divided into a plurality of fragments; Described a plurality of fragments are carried out OCR identification, obtain candidate characters, wherein, each fragment is corresponding at least one candidate characters; Obtain the statistical information of the character combination that the candidate characters of the candidate characters of fragment and/or fragment forms; And determine candidate character strings in conjunction with the OCR recognition confidence of described statistical information and candidate characters.

According to another aspect of the present invention, provide a kind of character string identification device, having comprised: fragment is divided module, is used for character string picture is divided into a plurality of fragments; The OCR identification module is used for described a plurality of fragments are carried out OCR identification, obtains candidate characters, and wherein, each fragment is corresponding at least one candidate characters; The statistical information acquisition module is used to obtain the statistical information of the character combination that the candidate characters of the candidate characters of fragment and/or fragment forms; And the first character string determination module, be used for determining candidate character strings in conjunction with the OCR recognition confidence of described statistical information and candidate characters.

In addition, embodiments of the invention also provide the computer program that is used to realize above-mentioned character string identification method.

In addition, embodiments of the invention also provide the computer program of computer-readable medium form at least, record the computer program code that is used to realize above-mentioned character string identification method on it.

Description of drawings

The present invention can be by being better understood with reference to hereinafter given in conjunction with the accompanying drawings description.Described accompanying drawing comprises in this manual and forms the part of this instructions together with following detailed description, and is used for further illustrating the preferred embodiments of the present invention and explains principle and advantage of the present invention.In the accompanying drawings:

Fig. 1 shows the indicative flowchart according to the method for the first embodiment of the present invention.

Fig. 2 shows the indicative flowchart of method according to a second embodiment of the present invention.

Fig. 3 shows a recognition result to character string.

Fig. 4 shows on the right the OCR recognition result is screened The selection result afterwards, and on the left side shows corresponding path profile.

Fig. 5 shows common syllable of being made up of two characters and/or the syllable of being made up of three characters.

Fig. 6 shows at the training of syllable and statistics.

Fig. 7 shows the indicative flowchart of the method for a third embodiment in accordance with the invention.

Fig. 8 shows the statistics of the tlv triple of numeral and letter formation.

Fig. 9 shows the indicative flowchart of the method for a fourth embodiment in accordance with the invention.

The indicative flowchart of the method for OCR recognition result is proofreaied and correct in the known storehouse that utilizes that Figure 10 shows according to a fifth embodiment of the invention.

Figure 11 shows the indicative flowchart of method according to a sixth embodiment of the invention.

Figure 12 shows character string identification device according to a seventh embodiment of the invention.

Figure 13 shows the character string identification device according to the eighth embodiment of the present invention.

Figure 14 shows the character string identification device according to the ninth embodiment of the present invention.

Figure 15 shows the separator identification module according to the character string identification device of the tenth embodiment of the present invention.

Figure 16 shows the character string identification device according to the 11st embodiment of the present invention.

Figure 17 shows and can be used for implementing the schematic block diagram of computing machine according to an embodiment of the invention.

Embodiment

To be described one exemplary embodiment of the present invention in conjunction with the accompanying drawings hereinafter.For clarity and conciseness, all features of actual embodiment are not described in instructions.Yet, should understand, in the process of any this practical embodiments of exploitation, must make a lot of decisions specific to embodiment, so that realize developer's objectives, and these decisions may change to some extent along with the difference of embodiment.In addition, might be very complicated and time-consuming though will also be appreciated that development, concerning the those skilled in the art that have benefited from present disclosure, this development only is customary task.

At this, what also need to illustrate a bit is, for fear of having blured the present invention because of unnecessary details, only show in the accompanying drawings with according to the closely-related apparatus structure of the solution of the present invention, and omitted other details little with relation of the present invention.

Describe the present invention below by specific embodiment.

First embodiment

Pure OCR identification itself is technique known.Be explanation and the understanding that makes things convenient for the back, below the OCR identifying done one and briefly describe.

Input of character string image to be identified can be the user write on the paper or on the touch-screen and by scanning, take pictures or character string picture that conversion such as induction obtains.For example, the image of user's hand-written input of character string " hanashiro " on touch-screen.

Usually, after the character string picture that obtains input, this character string picture of handwriting input is carried out pre-service, for example carry out binary conversion treatment so that analog image, color digital image, grayscale image etc. are converted to bianry image.Carry out the connected domain analysis then.Connected domain is meant the image-region that is formed by the similar pixel that adjoins each other (for example foreground pixel).Described similar pixel can refer to foreground pixel (for example black pixel) under the situation of bianry image, also can refer to background pixel (for example white pixel); Under the situation of grayscale image, can be meant gray level pixel within the specific limits.Usually the connected domain of considering has neighbours territory connected domain, eight neighborhood connected domains or the like.Because the notion of connected domain is known to those of ordinary skills, therefore here is not described further.At this, can search for connected domain by various known method, for example analyze all connected domain in the searching character image by eight neighborhood connected domains.

In eight neighborhood connected domains are analyzed, at first find a foreground point, and be seed with it, search does not have the foreground point of accessed mistake in eight neighborhoods of this seed then, and be new seed points with them, and recurrence continues top search procedure, up to can not find new seed points, finish search, the foreground point of exporting all accessed mistakes is as a connected domain; Seek the new foreground point that does not have accessed mistake, and be seed, can find another piece connected domain with it.So, up to all accessed mistake of all points.Analyze about eight neighborhood connected domains, for example can be referring to Digital Image Processing (4th Edition), W.K.Pratt, John Wiley﹠amp; Sons, Inc.2007.

Then, can extract feature based on the result of connected domain analysis (can also carry out character cutting), utilize sorter to carry out OCR identification.

But, only rely on OCR identification can not reach gratifying result to character string.The applicant finds, is comprising the rule of certain combination under many circumstances in the character string.Can obtain this rule by statistics, and it is utilized in the identification of character string, thereby can obviously improve the accuracy of identification a large amount of character strings.

In one embodiment of the invention, a kind of statistical information that can utilize character combination discerns to determine character string to be identified in conjunction with OCR method has been proposed.

Under many circumstances, the character string that discern some common or commonly used character strings often.For example for the character string " hanashiro " of Japanese, if a large amount of days clictions are added up, can be occurred in front under the situation of character combination of " ha ", " na " this probability of character combination appears subsequently.If the result who in identification OCR is discerned combines with this statistical information so, then should obtain better result.

In step S110, character string picture is divided into a plurality of fragments.This cutting step can be carried out by means of multiple prior art.Still be example with " hanashiro ", can carry out cutting and identification based on the connected domain analysis result.Can at first the prospect connected domain be cut into some adjacent fragments, wherein 1～3 adjacent segment may be formed single complete character.Usually adopt double-stranded Elastic Matching algorithm to search for all possible cut-off for this cutting, use dynamic programming to seek best cut-off subsequently.Cutting for connected domain, concrete " the Handwritten Email Address Recognition with Syntax and Lexicons " that can be shown referring to Danian Zheng, ICFHR2008 and patented claim CN200810080950.4 (" character information identification device and method ").

In step S120, described a plurality of fragments are carried out OCR identification, obtain candidate characters.One skilled in the art will appreciate that when character picture is discerned resulting only is that character picture to be identified is the probability of certain character.For example, character picture " h " may be identified as " h " or " b " in OCR identification, but they have different probability, claim recognition confidence again.Therefore, each fragment is corresponding at least one candidate characters.

In step S130, obtain the statistical information of the character combination that the candidate characters of the candidate characters of fragment and/or fragment forms.For example for character combination: " ha ", can add up a large amount of vocabulary in advance, draw probability comprising this character combination.Equally, can obtain the probability that character combination " han " occurs in vocabulary.Certainly, can be defined for the lexical scoping of adding up at this.For example add up all Business Names, all names or the like, thus can obtain corresponding probability.If learning this character string when certain character string is discerned is to belong to certain field or scope, then can utilizes above-mentioned corresponding probability, thereby obtain better result.Need to prove that above-mentioned statistic processes can be finished in advance, in step S130, can only directly this statistics be used as input.

Based on the above-mentioned statistical information to each character combination, can obtain following result: if certain character or character combination, how many probability that then occurs certain character or character combination thereafter is.For example, if the probability that known " ha " occurs, the probability that also known " han " occurs can draw under the situation that " ha " occur according to condition probability formula P (B|A)=P (AB)/P (A) so, and how many probability that occurs " n " subsequently is.Conversely, can certainly obtain " h " appears at " an " probability before what is.

In step S140, determine candidate character strings in conjunction with the OCR recognition confidence of described statistical information and candidate characters.For example, a plurality of OCR candidate characters for same fragment, can select the higher candidate characters of its probability of occurrence in current context according to described statistical information, perhaps give certain weight respectively and select candidate characters described statistical probability and its OCR recognition confidence, perhaps in being higher than the candidate characters of certain threshold value, the OCR degree of confidence utilize described statistical information to select, or the like.

By method,, therefore realized higher identification accuracy owing to except the OCR recognition confidence, also used the statistical information of the candidate characters combination of a plurality of fragments according to this embodiment.

Need to prove that above-mentioned statistical information can comprise the probability that character, character types and/or a character combination occur with at least one book character, character types and/or character combination.As for character string " hanashiro123 ", can add up the probability that " h " occurs with " a ", the probability that " ha " occurs with " n ", the probability that " h " occurs with " an ", the probability that " ha " occurs with " na ", the probability that " ro " occurs with " 12 ", the probability that letter and number occurs together, the probability that a plurality of (for example three) numeral occurs continuously, the probability that a plurality of contiguous alphabets and a plurality of continuous number occur, or the like.Its purpose all is for the characters/words that can obtain back (perhaps front) appearance according to the incompatible statistics of characters/words symbol type/character group that front (perhaps back) occurs accords with the probability of type/character combination, thereby is used in identification.

When described statistical information relates to character combination, can determine the candidate characters combination of a plurality of fragments that described character combination relates in conjunction with the OCR degree of confidence simultaneously, method and aforementioned similar.The probability that perhaps can will be referred to character combination is converted to the probability that relates to single character and is used.

Second embodiment

On the basis of first embodiment, the inventor discovers that in any language, syllable all has the association on the statistical significance.With the user of Japan example by name, wherein usually comprise some syllables, certain association is arranged on statistical significance between these syllables, therefore can utilize this internal association to realize the better recognition effect.For example comprised four syllables " ha ", " na ", " shi " and " ro " in the Japanese user name " hanashiro ", therefore, can make up character combination in the said method according to these syllables.For example, described character combination can comprise a syllable as " ha ", " na " etc., can comprise that two syllables as " hana ", " nashi " etc., can also comprise three syllables as " hanashi ", or the like.Therefore for identification string, for example can search for the n-tuple that adjacent syllable is formed in n-unit dictionary, if the n-tuple that finds adjacent syllable to form, then this syllable obtains higher score, otherwise this syllable only is the OCR identification probability of itself.For this n-unit group of methods based on syllable, continuing with " hanashiro " among second embodiment below is that example specifically describes (is example with the tlv triple method), and wherein Fig. 2 shows the respective flow chart of this method.

In step S210, as described in first embodiment, at first the image division with character string " hanashiro " is a plurality of fragments, this supposition be divided into just corresponding to " h ", " a ", " n ", " a " ... image segments, in step S220, each fragment is carried out OCR identification then, thereby determine the OCR identification probability at each character in " hanashiro ".Figure 3 illustrates this recognition result.In the drawings, every row has been listed the OCR identification probability of corresponding character respectively.For example for first image segments, the OCR probability that is identified as " h " is 0.114, and the probability that is identified as " n " is 0.101, and the probability that is identified as " k " is " 0.101 ", and the probability that is identified as " m " is 0.074, or the like.For second image segments, the probability that is identified as " a " is 0.132, and the probability that is identified as " u " is 0.082, or the like.

In step S230, the OCR recognition result of each character is screened.The principle of screening is the smaller recognition result of filtering recognition confidence, and keeps the bigger recognition result of recognition confidence.For example can utilize formula Cer _i/ Max (Cer _i)＞T screens, and wherein i represents the sequence number of recognition result, in recognition result shown in Figure 3 as an example, has all listed 10 candidate characters for each character, and i gets 1 to 10 so here.Cer _iBe the OCR probability of i candidate characters, T is a threshold value, for example can be made as 0.75.This formula shows, if the degree of confidence of certain candidate characters is with respect to the maximum confidence in all candidate characters and Yan Taixiao, think that then character to be identified can not be this candidate characters, in subsequent calculations, do not consider this candidate characters, thereby can greatly reduce calculated amount.In this way, obtained the The selection result shown in Fig. 4 the right.As we can see from the figure for first character, remaining three candidate characters " h ", " n " and " k ", for second character, only surplus next candidate characters " a ", for the 3rd character, remaining three candidate characters " n ", " h " and " m " or the like.Be appreciated that and can utilize other screening modes to screen fully, for example, can directly determine to screen, perhaps directly utilize the candidate characters of the predetermined quantity of degree of confidence maximum with the irrelevant threshold value of maximum confidence, or the like.

In step S240, candidate characters is combined as syllable and/or three syllables that character is formed of two characters compositions according to the spelling rule of Japanese., a large amount of Japanese user names are analyzed in advance for this reason, and obtained result for example shown in Figure 5, wherein the left side shows the syllable that two common characters are formed, and the right shows the syllable that three common characters are formed.As seen, because first candidate characters is " h ", " n " and " k ", second candidate characters is " a ", therefore can determine that first candidate's syllable is " ha ", " na " and " ka ", and the rest may be inferred in the back.For the candidate characters of not including certain syllable in, then still keep single character.For example, for seven characters of the 5th character to the, candidate characters is respectively " s ", " h "/" k ", " i ", and their array mode can be " shi ", " s "-" hi " or " s "-" ki ".

In step S250, enumerate all possible paths (just candidate characters may make up) based on top result according to syllable, for every paths, because " syllable " of each node correspondence of path or " char " can have several candidate's recognition results, so can be combined into several isometric candidate character strings.Show the figure that comprises all paths that is used for identification " hanashiro " on Fig. 4 left side.

Next, in step S260, count the score in conjunction with statistical information and OCR recognition confidence at each node among Fig. 4.Owing to do not have other nodes before first node, therefore directly utilize the probability of its probability that on statistical significance, occurs separately and OCR identification to count the score for first node.Since second node, since its with before node exist statistical related, therefore based on the score of node before, the identification probability that probability that the OCR of the probability of present node and present node discerns calculates present node appearred under the situation of node before occurring.Utilize above-mentioned thought, can come calculating probability at each node.

For example, utilized following formula to count the score at this at each node in every paths:

Score(S _p)＝Score(S _p-1)+logP _nlp(S _p|h _p-1)+logP _ocr(S _p) (1)

S in the formula represents syllable, and is made up of character.P represents the sequence number of syllable, for example for first syllable ha, then p=1.In above-mentioned formula, first Score (S _P-1) in the expression path before present node the score of node, for first node in the path, because node not, then this is zero.P in second _Nlp(S _p| h _P-1) be illustrated in the natural language, there is historical information h _P-1Situation under the probability (probability that promptly before occurring, occurs present node under the situation of node) that occurs of the current syllable that obtains of statistics.For first node in the path, because then there is not historical information in node not in this, so the probability that in natural language, occurs of first node of this direct representation.P _Ocr(S _p) be the probability of the OCR identification of syllable, its for example the OCR identification probability of each character by will forming this syllable multiply each other and obtain.Utilizing the score of the final node in each paths that said method calculates is exactly the score of this paths, and the path that can choose the score maximum is as the character string recognition result with high confidence level.It should be noted that, utilized the logarithm value of probability in the equation above, and when combination last node mark and natural language probability and OCR identification probability, utilized addition, but those of ordinary skills understand, can utilize the mark of described probability and last node with any mathematical form, as long as the mark in each node, each path has been considered described each factor simultaneously.

The independent probability of each syllable and some syllables probability together, also i.e. above P _Nlp, can obtain from number of ways, import in the method that can directly be used in as the outside according to present embodiment.For example, can utilize the SRILM kit a large amount of relevant character strings is trained and to add up, thereby draw independent probability and some syllables probability together of each syllable.For example, Fig. 6 shows at the training of syllable and statistics, has wherein used hundreds thousand of effective Japanese Email address user names to add up.Can see that statistics comprises the probability (carried out denary logarithm calculating at the probability here, below directly this result of calculation is called probability) of single syllable (1-grams), double-tone joint group (2-grams) and triphone group (3-grams).For simple and clear consideration, only show the concrete statistics of the probability of three syllables here.As seen, the probability that occurs " ka-ha-ta " in user name is-1.873356, and the probability that " ke-ha-ra " occur is-0.001828611, or the like.For the SRILM kit that uses as an example at this, it is a kind of language statistics instrument well known to those skilled in the art, specifically can referring to Http:// www.speech.sri.com/projects/srilm/, no longer go through at this.In the present embodiment, directly use the corresponding probable value in this statistics to calculate.

For first node " ha " in first path " ha-na-shi-ro " among Fig. 4, calculate the score Score (S of first node ₁).May make natural language probability and the OCR identification probability information of utilizing syllable " ha " in various manners at this, for example make up or the like with certain weighting.Is that example is calculated at this with above-mentioned system of equations (1).Because p=1, do not have node before, therefore according to above-mentioned equation with the probability of the natural language of syllable " ha " and the probability addition of OCR identification, obtain the score Score (S of first node ₁).For second node in this path, promptly for syllable " na ", the score probability of the OCR identification of score, the probability that occurred present node before occurring under the situation of node and the present node of front nodal point with it is relevant, may adopt various combinations or computation rule to determine the score of this node equally.At this is example with above-mentioned equation (1) still.Can see that according to this equation the score of this syllable is not only relevant with current syllable itself, and is also relevant with syllable before, promptly relevant with " ha ".According to the statistics of SRILM kit, can easily calculate conditional probability logP wherein _Nlp(S _p| h _P-1).Wherein the thought of Biao Daing is, has utilized the inherent law of the statistics of a large amount of examples being determined language, thereby is for example " ha " if determine syllable before, and how many probability that then current syllable is " na " should be.Need to prove the h here at this _P-1Be not to show the historical information of only having utilized previous syllable here, but the historical information of more a plurality of syllables before can utilizing.In current approach of the present invention, use the historical information of two syllables before respectively since the 3rd syllable, therefore be called tlv triple method based on syllable.Certainly, also may count multisyllable more together information and in calculating, utilize.Score (S _P-1) be illustrated in the score of the node before the present node on the associated pathway, the score of a node before can only using in the present invention.

According to above-mentioned formula, then can calculate the score of each syllable in first path, thus draw the score in this first path.Similarly, can obtain for example " ha-ha-shi-ro ", the Third Road footpath score of " na-na-shi-ro " or the like for example of second path, thereby determine maximum sub-path, so in step S270, can determine the candidate character strings of the probability maximum that identified.

Owing to treat identification string based on syllable in the above-described embodiments and divide and utilized the associated with each other probability of these syllables in natural language that draws based on statistics, therefore improved the accuracy of identification greatly.

The 3rd embodiment

In actual conditions, for example " hanashiro123 " situation that letter and number mixes also usually appear.For this situation, also there is not highly effective recognition methods in the prior art.The inventor discovers that this situation usually appears in more user-defined titles, for example in the Email user name.By a large amount of this character strings are added up, can draw the certain law of the combination that letter and number wherein occurs.For example, be letter if can count a preceding a character, b probability that character is a letter or number in back then, if perhaps front c character is digital, then after d the probability (a, b, c, d are natural number) that character is a letter or number.Therefore based on the thought of the first embodiment of the present invention, proposed a kind of n-unit group of methods and solved this problem based on character.Fig. 7 shows the respective flow chart (is example with the tlv triple method) of this method.

At first, in step S710, the character string picture with input is divided into a plurality of fragments equally.

In step S720, at first carry out OCR identification, to determine the OCR identification probability at each fragment in the character string of this letter and number mixing.Similar shown in Figure 3, can obtain a series of probability equally at this at each character.

In step S730, screen at the OCR recognition result of each character.The principle of screening is the smaller recognition result of filtering recognition confidence, and keeps the bigger recognition result of recognition confidence.For example can utilize formula Cer _i/ Max (Cer _i)＞T screens, and wherein i represents the sequence number of recognition result, and in recognition result shown in Figure 3 as an example, if all listed 10 candidate characters for each character, i gets 1 to 10 so here.Cer _iBe the OCR probability of i candidate characters, T is a threshold value, for example can be made as 0.75.This formula shows, if the degree of confidence of certain candidate characters is with respect to the maximum confidence in all candidate characters and Yan Taixiao, think that then character to be identified can not be this candidate characters, in subsequent calculations, do not consider this candidate characters, thereby can greatly reduce calculated amount.In this way, can obtain being similar to the The selection result shown in Fig. 4 the right equally.Be appreciated that and can utilize other screening modes to screen fully, for example, can directly determine to screen, perhaps directly utilize the candidate characters of the predetermined quantity of degree of confidence maximum with the irrelevant threshold value of maximum confidence, or the like.

In step S740, enumerate all possible paths based on The selection result, for every paths, because the " char " of each node correspondence of path can have several candidate's recognition results, so can be combined into several isometric identification speech.So, obtain being similar to the path profile among Fig. 4.

In step S750, come the score of computing node in conjunction with statistical information and OCR recognition confidence at each node in the path profile.Utilize following formula at this:

Score(C _p)＝Score(C _p-1)+logP _nlp(C _p|h _p-1)+logP _ocr(C _p) (2)

In this formula, be that with the difference of formula (1) formula (1) calculates based on the probability of syllable S, and formula (2) calculates based on the probability of character or character combination C.It can be seen from formula, in the recognition methods of the character string that letter and number is mixed proposed by the invention, utilized the probability P that statistics obtains in the natural language equally _Nlp, OCR identification probability P _OcrWith historical information h _P-1

Before each node is calculated, can utilize the SRILM kit a large amount of similar character strings is trained and to add up equally, thereby draw the probability of the various built-up sequences of probability that each character occurs separately and letter and number.Equally, can have multiple existing approach to obtain described probability, the present invention only need directly utilize statistics to get final product.But, below training and statistic processes are illustrated for the ease of the reader understanding.

At first, choose a large amount of Email address user names and add up as the sample storehouse, for example the applicant has used 696818 effective Email address user names to add up in the present invention.With mp2003@dokkyomed.ac.jp is example, extracts user name " mp2003 " at this.

Subsequently, all user names are carried out following replacement:

-letter in the user name is replaced with " a "; And

-numeral in the user name is replaced with " 0 ".

For example, " mp2003 " replaces with " aa0000 ".

Next, utilize the SRILM kit that all samples are carried out statistical study.

The result of statistics comprises the statistics an of tuple, two tuples, tlv triple or the like.In Fig. 8, only show the statistics of tlv triple, wherein "＜s〉" the representative beginning, "＜/s〉" the representative end.From this statistics as seen, three numerals probability of (" 000 ") together are that the probability that-0.4126034, two numerals finish (" 00＜/s〉") together subsequently is-0.4345168, or the like.

According to above-mentioned statistics, for example can obtain following conditional probability: if first character is a letter/number, then second character is letter/number how many probability is; If first character is a letter/number, second character is letter/number, and then the 3rd character is letter/number how many probability is; Or the like, also promptly obtain the item logP in the formula (2) _Nlp(C _p| h _P-1).Be similar at formula (1) set forth like that, the h here _P-1Be not to show the historical information of only having utilized previous character here, but the historical information of more a plurality of characters before can utilizing.In current approach of the present invention, use the historical information of two characters before respectively since the 3rd character, therefore be called tlv triple method based on character.Certainly, the more polysyllabic historical information before those skilled in the art expect using easily, its principle all is based on thought proposed by the invention.

Calculate to similar according to all the other of formula (2), be not described in detail in this according to the calculating of formula (1).

At last, in step S760, obtain the score and the maximum sub-path that gets of each character, thus the candidate character strings of definite probability maximum that is identified.

Utilized tlv triple in the above-described embodiments based on character, wherein utilized the probability of adding up alphabetical and digital certain combination of appearance that draws, promptly utilized the name of a large number of users to be accustomed to this information, thereby obviously improved identification accuracy the character string of numeral and letter mixing.Need to prove, in method, only added up the various probabilistic informations of general numeral and monogram according to present embodiment.Those skilled in the art expect easily, and also possible is specifically adds up the various probabilistic informations of different digital and different monograms.For example, can add up the combined information of " ab1 ", " 12b ", and be not only a statistics numeral of two letters or a letter of two numerals or the like so general information.Its principle does not break away from thought proposed by the invention.

The 4th embodiment

The applicant discovers that in the character string of needs identification, the frequency that separator ". " occurs is increasing.Because along with the increasingly extensive application of internet, no matter be network address or Email address, all comprise this separator ". ", because it plays the effect of separating domain names at different levels in domain name part, and the frequency of its appearance is higher in Email.Therefore the identification for this separator becomes more and more important.A fourth embodiment in accordance with the invention has proposed a kind of method of discerning this separator.Fig. 9 shows the step of this method.

In step S910, carry out the connected domain analysis for the character string picture of importing.According to connected domain analysis,, for example determine the coordinate of the position of connected domain, the number of pixels in the connected domain or the like for each parameter that each connected domain CC can obtain this connected domain to character string picture.

In step S920, determine threshold value in order to discern separator.For example can select some connected domains of number of pixels minimum in the connected domain, thereby calculate the mean value of the number of pixels of these connected domains, according to this mean value setting threshold according to the number of pixels value of connected domain.For example, use top three connected domains to calculate, obtain mean value Av3 at this.Then, T1=α Av3 is set as threshold value.α can according to circumstances select the parameter adjusted, its objective is in order to reach best recognition effect.For example can select α=3.Certainly, threshold value determine be not limited to upper type.For example, can also directly determine a threshold value according to the result who utilizes great amount of samples to train, perhaps the result who obtains according to the sample training that utilizes the active user determines threshold value, or the like.

Afterwards, whether the number of pixels value of judging each connected domain in step S930 is less than T1, if its number of pixels value, thinks then that this connected domain is candidate's a separator ". " less than T1.Because different user writing style difference, so the size of separator ". " may be bigger in difference in the writing of different user.Considered by the method for above-mentioned a plurality of connected domain calculating pixel quantity mean values according to minimum and to be equivalent to this fact of writing style difference (thereby the size of the point that writes out is different with different user) of different user come separator ". " is discerned in the mode of a kind of " self-adaptation ".

Owing to also have point ". " in character " i ", " j ", whether what therefore also further judgement was identified in step S940 should ". " be positioned at the character row bottom, if determine that then this point is a separator.This judgement can be carried out based on the coordinate parameters of each connected domain.

The 5th embodiment

The applicant notices, for character string to be identified, belong to certain database if can learn this character string, then can utilize the candidate character strings of this character string and the similarity between the character string in the predefined database further to improve the accuracy of identification.For example, can utilize the design feature of the information obtained from the outside (for example information that provides of user, the environment of using present embodiment etc.) or character string itself to estimate or know that character string to be identified belongs to a certain database (for example name, exabyte, university and research institution's name or the like).For instance, if the character string of identification is " fujitsu ", and according to this character string of ten-four that other approach for example provide according to the user be Japan certain exabyte, then can utilize predefined Japanese firm name database to come similarity by each character string and candidate character strings in the computational data storehouse, the similarity of knowing " fujitsu " in this character string and this database is the highest, thereby it is identified as " fujitsu ".Below this embodiment is elaborated.

Because for character string to be identified, there are some candidate character strings, and each character in each character string all has certain degree of confidence, therefore in method according to present embodiment, during similarity between the character string in calculated candidate character string and database, utilized the OCR recognition confidence in the candidate character strings.

Figure 10 shows the method flow diagram according to present embodiment.

In step S1010, in predefined database, seek the similar character string of discerning to OCR of result according to the OCR recognition result of character string.The comparison of character string and search have many prior aries of supporting utilization, for example can utilize TDAG (ternary directed acyclic graph) at this, TDAG is a kind of method that is used for finding out from database the character string similar to current string, technology for this area calculating personnel know is not described in detail in this.

In step S1020, calculate the similarity between the similar special domain name in the storehouse of finding among SOME RESULTS that OCR identification obtains and the step S1010.At this, for example can use LD (Levenshtein Distance) algorithm computation similarity.Consider economy, can not compare, but for example use two candidate character strings of OCR identification probability maximum to calculate at all candidate's recognition results.Because a character string can be by inserting character, delete character, substitute character obtain the another one character string, suppose character string A is converted to character string B, the 3 kinds of performed minimum number of operation in front are called the LD distance of AB, the LD distance are then obtained the similarity of AB divided by the length of character string.In the prior art, the criterion calculation formula of LD distance is as follows:

LD (i, j) = \{\begin{matrix} LD (i - 1, j) + 1 \\ LD (i, j - 1) + 1 \\ LD (i - 1, j - 1) + COST \end{matrix} - - - (3)

Wherein:

COST = \{\begin{matrix} 0 & C (j) &Element; CG (i) \\ 1 & C (j) &NotElement; CG (i) \end{matrix} - - - (4)

In formula (3), LD (i, j) distance between corresponding j the character C (j) in i the character C (i) in expression first character string (can corresponding to the candidate character strings of the OCR identification) partial character string of i character (in other words by) and second character string (can corresponding to the quilt in the storehouse than the character string) partial character string of j character (in other words by), i, j are respectively natural number.For last character of first character string and second character string, (i j) just represents distance between these two character strings to LD.First formula in above-mentioned (3) formula represents, if compare with second character string with after the i character deletion in first character string, then this first character string (by the i character) and second character string distance will (i-1 adds 1 on basis j) at LD.Second formula in above-mentioned (3) formula is just in time opposite with first formula, i.e. expression is if compare with first character string after the j character deletion in second character string, then this second character string (by the j character) and first character string distance will (i adds 1 on basis j-1) at LD.Conversely, this second formula is equivalent to insert a character in first character string.The 3rd formula in above-mentioned (3) formula is represented, if i character in first character string or j character replacement in second character string are compared for after another character, then the distance between them will (i-1 adds COST on basis j-1) at LD.

It will be appreciated that have first character string mode that obtains second character string of making amendment multiple.(i, j) value are got wherein minimum value as final LD value can to obtain a LD for each mode.

In formula (4), j character of C (j) expression second character string, and the set of all OCR identification candidate characters of the image segments of i character correspondence in CG (i) expression first character string.The implication of formula (4) is, if the quilt in the storehouse than j the character (i the target that character will be replaced by in first character string just) in the character string (second character string just) in the set of the candidate characters of i the pairing image segments of character of first character string, then the cost of Ti Huaning is 0, otherwise cost is 1.

Yet the shortcoming of this prior art is, only considered character discerns at OCR whether candidate characters is concentrated exists, and do not consider the OCR recognition confidence of this character, thereby cause the distance of a plurality of character strings in character string that OCR identifies and the storehouse identical (otherwise perhaps, certain character string distance in a plurality of candidate character strings and the storehouse is identical), thus be unfavorable for utilizing this distance to discern optimization.

Based on the problems referred to above, according to this embodiment of the invention the COST function is improved, particularly, the distance with a certain candidate characters corresponding characters string can be reduced the value of recognition confidence that a respective segments corresponding to this candidate characters be identified as the character of the relevant position in the character string that is compared in the database.For example, this COST function can be revised as:

COST = \{\begin{matrix} - p (C (j)) & C (j) &Element; CG (i) \\ 1 & C (j) &NotElement; CG (i) \end{matrix} - - - (5)

The implication of the following formula of this COST function is: 1) if i the character C (i) of first character string that OCR identifies (is the some candidate characters CG (i among the CG (i), k), k is for to be not more than | CG (i) | natural number) identical with j the character C (j) of second character string in the storehouse, C (i)=CG (i just, k)=C (j), think that then it may be second character string in the storehouse that this first character string has bigger, therefore will be corresponding apart from the OCR recognition confidence p (CG (i that reduces this candidate characters, k)), just p (C (j)) or p (C (i)).2) if i the character C (i) of first character string that OCR identifies (j the character that is second character string in some candidate characters CG (i, k)) and the storehouse among the CG (i) is inequality, just C (i)=CG (i, k)!=C (j), but j character of second character string also exists in CG (i), be C (j) ∈ CG (i), then will corresponding distance reduce the value p (C (j)) of the recognition confidence of the character (C (j) just) that a respective segments corresponding to this candidate characters is identified as the relevant position in second character string that is compared in the database.(1) and (2) above-mentioned kind situation all is the situation of C (j) ∈ CG (i), because plant in the situation (1), (i k) is a kind of situation of C (j) ∈ CG (i) to C (j)=CG.That is to say, reduce the value of degree of confidence that a respective segments corresponding to this candidate characters is identified as the character of the relevant position in the character string that is compared in the database with the distance of a certain candidate characters corresponding characters string.

The implication of the following formula of the COST function in the above-mentioned formula (5) is, if j the character C (j) of second character string do not exist (at this moment in CG (i), inevitably, i the character C (i) of first character string that OCR identifies (is the some candidate characters CG (i among the CG (i), k)) inequality with C (j)), then still calculate according to existing computing method.

Can see in conjunction with formula (3) and (5), calculate first character string that OCR identifies and second character string in the storehouse apart from the time, considered corresponding OCR identification probability.To be identified as the OCR identification probability of character of the relevant position in second character string that is compared in the database big more for i character in first character string of candidate's OCR identification, illustrate that then this character may be the respective symbols of second character string in the storehouse more, so the distance between them should be more little.

In another preferred embodiment, can improve above-mentioned formula (5), for example be revised as:

COST = \{\begin{matrix} 1 - p (C (j)) & C (j) &Element; CG (i) \\ 1 & C (j) &NotElement; CG (i) \end{matrix} - - - (6)

This formula can be realized the effect of above-mentioned formula (5) equally.In the following formula of formula (6), because Can think that the OCR degree of confidence of C (j) is 0, thereby the following formula unification can be following formula, obtain following formula (7):

COST＝1-p(C(j)) (7)

In another preferred embodiment, can also be further improved the COST function.Particularly, arbitrary candidate characters of a certain fragment in candidate character strings (i.e. i the character C (i) of first character string=CG (i, k)) with database in the character string that is compared the relevant position character (i.e. j the character C (j) of second character string) not simultaneously, the described distance of this candidate characters corresponding characters string can increase a value p corresponding to the recognition confidence of this candidate characters (C (i)).Its thought is, when the recognition confidence of certain candidate characters is big more, then from the angle of OCR identification, it just should not be replaced by the respective symbols of the character string in the storehouse more, is reflected to the last distance of corresponding character string that then should make of distance and becomes greatly.When C (i)=C (j), owing to do not have the different character with C (j) in first character string, so can be considered as these " different characters " be " sky ", and being considered as its OCR identification probability is 0.Therefore, the COST function as further modification is:

COST = \{\begin{matrix} 1 - p (C (j)) + p (C (i)) & C (i) &NotEqual; C (j) \\ 1 - p (C (j))) & C (i) = C (j) \end{matrix} - - - (8)

It can be rewritten as:

COST＝1-p(C(j))+p(C(i))|(C(i)≠C(j)) (9)

Wherein p (C (i)) is C (j) the OCR recognition confidence of C (i) simultaneously not in C (i) and second character string in first character string, p (C (i)) | the implication of (C (i) ≠ C (j)) is only to add p (C (i)) value when C (i) ≠ C (j), when C (i)=C (j), because do not have the C (i) different with C (j), then this gets 0.

Also can further simplify for formula (3).If will be interpreted as that respectively first character string lacks character (just being equipped with " sky " character in corresponding positions), second character string lacks character (just being equipped with " sky " character in corresponding positions) to the situation that first character string is inserted and deleted from first character string, and, be that p (C (i)) or p (C (j)) are 0 in the formula (9) for null character (NUL).So, the LD distance calculation when the described insertion of formula (3), deletion and replacement operation, can all unify to be following formula (10) based on the COST formula of formula (9):

LD(i，j)＝LD(i-1，j-1)+COST (10)

In step S1030, judge that whether two distances between the character string are less than predetermined threshold value, if this distance is less than predetermined threshold value, then in step S1040, the OCR identification string replaced with the character string in the corresponding storehouse, otherwise the result of output OCR identification.

Owing to utilized the database that comprises character string to be identified to proofread and correct, therefore further improved recognition correct rate to this character string.

The 6th embodiment

Aforementioned first to the 5th embodiment can carry out combination in any.A kind of optimal way is, to make up according to the method for the various embodiments described above, thereby for example can have fixed mode character string and for example carry out efficient and accurate recognition in Email address, the network address etc., particularly can carry out efficient and accurate recognition hand-written this character string to be separated into a plurality of fields and at least a portion field by separator.

Below to set forth for the example that is identified as of Email address " hanashiro@itto.or.jp ".

An effective Email address, for example " hanashiro@itto.or.jp " is made up of 3 parts: " @ " character and the domain name " itto.or.jp " of user name " hanashiro ", centre.Wherein domain name belongs to hierarchical structure, can be divided into general domain name and special domain name.General domain name is extensive and general, and as " or " and " jp ", special domain name representative has the tissue or the colony of this domain name, as " itto ".Point ". " character is usually as the separator between a plurality of fields in user name and the domain name.

Figure 11 shows according to this embodiment of the invention the method that this Email address is discerned that proposes.

Before carrying out actual identification, as known to those skilled in the art, carry out the connected domain analysis for the character string of Email address, this is described hereinbefore, no longer repeats at this.

On the result's that connected domain is analyzed basis, in step S1110, the separator among the Email is discerned.Except ". " character, " " character also is regarded as separator, because it separates user name and domain name.Can use the method in a fourth embodiment in accordance with the invention to discern for ". " character, not repeat them here.For the identification of " @ " character, can discern based on any existing character identifying method.But because the singularity of " @ " character, at F.Kimura, K.Takashina, S.Tsuruoka and Y.Miyake.Modified QuadraticDiscriminant Functions and the Application to Chinese CharacterRecognition.IEEE Trans.Pattern Analysis and Machine Intelligence, vol.9, no.1, Jan.1987, pp.149-153 has proposed the method for a kind of identification " " character, below this method is done a summary.

At first, search may be the fragment of " @ ", cause " @ " size is bigger, so some can not become the fragment of " @ " by the size filtering earlier, particularly, judge the width of fragment and,, think that then this fragment can not be " @ " if be judged to be width or highly be not more than predetermined threshold value highly whether respectively greater than predetermined threshold value.Then for the fragment of passing through, test them in the improvement quadric discriminant function of " @ " (modified quadratic discriminant function, the MQDF) output valve on, and convert degree of confidence (class conditional probability) to.Certainly, need in advance training sample set by " @ " to train one before this and improve quadric discriminant function.Because of containing one and " @ " character only in the Email address, so in all degree of confidence, select the fragment of the maximum correspondence, as " @ " character.

By above-mentioned result, important separator " " and ". " in the Email address have been obtained.These separators are divided into different piece with the Email address.For example, user name part " hanashiro ", special domain name part " itto " and general domain name part " or ", " jp " have been divided into for hanashiro@itto.or.jp.

Next,, respectively various piece is handled as boundary with separator.Can carry out the identification of character one by one to each part, also can carry out integral body identification, and can respectively or be used in combination aforementioned each embodiment.In the reason process, can utilize the priori of relevant identifying object to determine to be separated the character of symbol various piece separately herein, thereby in corresponding to the step of aforementioned each embodiment, use appropriate dictionary or database and/or corresponding statistics.For example, for e-mail address, from after generally put general domain name (in general having to two layer), the special domain name (user's domain name) that separates forward, be " " user name afterwards then.Thereby can use general domain name dictionary (for example TLD dictionary and/or second level domain dictionary), special domain name database, username database or the like respectively.Similar rule is arranged for network address.

As a kind of embodiment, can be from the back to pre-treatment.Therefore, for example in identification, in step S1120, general domain name is discerned, can be carried out integral body identification, also can carry out the identification of character one by one e-mail address.

In this step, for example, can scan from back to front connected domain, when finding first separator ". ", then all connected domains after this separator are carried out integral body identification as subimage, when finding second separator ". ", then the connected domain between first and second separators is carried out integral body identification as subimage, or the like.This identification for example can be adopted common improvement quadratic classifier MQDF (preamble is stated).

Because general domain name has hierarchical structure (TLD and second level domain), certain inherent rule is wherein arranged, therefore, can further utilize this rule to improve discrimination and recognition speed in to the identification of general domain name according to of the present invention.For example, if for the recognition result of last domain name, promptly first recognition result is one of " com ", " edu ", " org ", " net ", then general domain name identifying finishes, because according to the definition rule of domain name, must be user's domain name before this.If for example first recognition result is country domain name for example " jp ", then the general domain name that next will discern must be one of second level domains such as " ac ", " ad ", " co ", " ne ", " or ", and this is that the definition rule of domain name determines equally.Therefore, if identify subsequently domain name for the probability of one of " ac ", " ad ", " co ", " ne ", " or " greater than a certain threshold value Tr, Tr=0.7 for example is then for the end of identification of general domain name, because must be user's domain name after this domain name.

Owing to utilized regular such priori of domain name definition here, thus should the integral body recognition result apparently higher than general recognition result.Certainly, also can not carry out integral body identification, and carry out character recognition one by one, and can utilize dictionary (or database) and/or the corresponding statistics selected based on above-mentioned priori to use aforementioned first to the 6th embodiment individually or in combination equally.

Next, in step S1130, discern for special domain name and user name.Can utilize according to the method for above-mentioned first, second and the 3rd embodiment at this and to discern.

At last, in step S1140, can use existing storehouse that recognition result is proofreaied and correct.For example for the Email address of Japan, if the general domain that identifies " .ac.jp " by name, the special domain name before then meaning must be the domain name of university or research institution.Therefore, can utilize the domain name storehouse of university or research institution to come recognition result is carried out spell check and correction.In addition, can utilize the storehouse of common user name (comprising Business Name) special domain name recognition result is checked and to be revised.This inspection and correction can utilize the method according to above-mentioned the 5th embodiment, are for example undertaken by the character string of calculating OCR identification and the distance of the character string in the storehouse.In addition, self-evident, this inspection and correction in fact go for for example any part of Email address of character string, as long as the database that can support utilization is arranged.

By utilizing the method according to this invention, advantageously can have fixed mode character string and for example carry out efficient and accurate recognition in Email address, the network address etc., particularly can carry out efficient and accurate recognition hand-written this character string to be separated into a plurality of fields and at least a portion field by separator.

The 7th embodiment

The seventh embodiment of the present invention is corresponding to the first embodiment of the present invention.

Figure 12 shows character string identification device according to a seventh embodiment of the invention, and it comprises fragment division module 1202, OCR identification module 1204, statistical information acquisition module 1206 and the first character string determination module 1208.

Fragment division module 1202 designed to be used character string picture is divided into a plurality of fragments.This division can be carried out by means of multiple prior art.With " hanashiro " image is example, can carry out cutting and identification based on the connected domain analysis result.After cutting, obtain a plurality of image segments " h ", " a ", " n ", " a " ..., " o ".

OCR identification module 1204 designed to be used described a plurality of fragments is carried out OCR identification, obtains candidate characters.One skilled in the art will appreciate that when character is discerned resulting only is that character to be identified is the probability of certain character.For example, character " h " may be identified as " h " or " b " in OCR identification, but they have different probability, claim recognition confidence again.Therefore, each fragment is corresponding at least one candidate characters.

Statistical information acquisition module 1206 designed to be used the statistical information of the character combination that the candidate characters of the candidate characters that obtains fragment and/or fragment forms.For example, can add up a large amount of vocabulary in advance, draw probability comprising this character combination for character combination " ha ".Equally, can obtain the probability that character combination " han " occurs in vocabulary.Certainly, can be defined for the lexical scoping of adding up at this.For example add up all Business Names, all names or the like, thus can obtain corresponding probability.If learning this character string when certain character string is discerned is to belong to certain field or scope (referring to the description to first to the 6th embodiment), then can utilizes above-mentioned corresponding probability, thereby obtain better result.Need to prove that above-mentioned statistic processes can be finished in advance, statistical information acquisition module 1206 can only directly use this statistics.

The first character string determination module 1208 designed to be used in conjunction with the OCR recognition confidence of described statistical information and candidate characters determines candidate character strings.For example, a plurality of OCR candidate characters for same fragment, can select the higher candidate characters of its probability of occurrence in current context according to described statistical information, perhaps give certain weight respectively and select candidate characters described statistical probability and its OCR recognition confidence, perhaps in being higher than the candidate characters of certain threshold value, the OCR degree of confidence utilize described statistical information to select, or the like.

Owing to except the OCR recognition confidence, also used the statistical information of the candidate characters combination of a plurality of fragments according to the character string identification device of this embodiment, so realized higher identification accuracy.

The 8th embodiment

The eighth embodiment of the present invention is corresponding to the second embodiment of the present invention.

Figure 13 shows the character string identification device according to the eighth embodiment of the present invention, and it comprises fragment division module 1302, OCR identification module 1304, screening module 1306, syllable composite module 1308, path generation module 1310, score computing module 1312 and the first character string determination module 1314.

Fragment division module 1302 designed to be used character string picture is divided into a plurality of fragments.As in a second embodiment, with character string " hanashiro " image is example, it is a plurality of fragments with character string " hanashiro " image division, this supposition be divided into just corresponding to " h ", " a ", " n ", " a " ... image segments.

OCR identification module 1304 designed to be used each fragment is carried out OCR identification, thereby determines the OCR identification probability at each image segments in " hanashiro " image.Correspondingly figure 3 illustrates recognition result.Specify and see also second embodiment, do not repeat them here.

The OCR recognition result that screening module 1306 designed to be used each character screens.The principle of screening is the smaller recognition result of filtering recognition confidence, and keeps the bigger recognition result of recognition confidence.For example can utilize formula Cer _i/ Max (Cer _i)＞T screens, and wherein i represents the sequence number of recognition result, in recognition result shown in Figure 3 as an example, has all listed 10 candidate characters for each character, and i gets 1 to 10 so here.Cer _iBe the OCR probability of i candidate characters, T is a threshold value, for example can be made as 0.75.This formula shows, if the degree of confidence of certain candidate characters is with respect to the maximum confidence in all candidate characters and Yan Taixiao, think that then character to be identified can not be this candidate characters, in subsequent calculations, do not consider this candidate characters, thereby can greatly reduce calculated amount.In this way, obtained the The selection result shown in Fig. 4 the right.Specifying equally can be referring to second embodiment.Be appreciated that and can utilize other screening modes to screen fully, for example, can directly determine to screen, perhaps directly utilize the candidate characters of the predetermined quantity of degree of confidence maximum with the irrelevant threshold value of maximum confidence, or the like.

Syllable composite module 1308 designed to be used candidate characters is combined as the syllable (generally being made up of two or three characters) that a plurality of characters are formed., a large amount of Japanese user names are analyzed in advance for this reason, and obtained result for example shown in Figure 5, wherein the left side shows the syllable that two common characters are formed, and the right shows the syllable that three common characters are formed.Equally, specifying can be referring to second embodiment.

Path generation module 1310 designed to be used based on top result according to syllable and enumerates all possible paths (just candidate characters may make up), for every paths, because " syllable " of each node correspondence of path or " char " can have several candidate's recognition results, so can be combined into several isometric candidate character strings.Show the figure that comprises all paths that is used for discerning " hanashiro " image on Fig. 4 left side.

Score computing module 1312 designed to be used at each node among Fig. 4 and counts the score in conjunction with statistical information and OCR recognition confidence.Wherein this statistical information is the information about independent probability and some syllables probability together of each syllable.Specifically referring to the description among second embodiment.Owing to do not have other nodes before first node, therefore directly utilize the probability of its probability that on statistical significance, occurs separately and OCR identification to count the score for first node.Since second node, since its with before node exist statistical related, therefore based on the score of node before, the identification probability that probability that the OCR of the probability of present node and present node discerns calculates present node appearred under the situation of node before occurring.Utilize above-mentioned thought, can come calculating probability at each node.Concrete computation process for score can not repeat them here referring to second embodiment.

The first character string determination module 1314 designed to be used the candidate character strings of determining the probability maximum that identified according to above-mentioned score result of calculation.

Owing to treat identification string according to the character string identification device of present embodiment based on syllable and divide and utilized the associated with each other probability of these syllables in natural language that draws based on statistics, therefore improved the accuracy of identification greatly.

The 9th embodiment

The ninth embodiment of the present invention is corresponding to the third embodiment of the present invention.

Figure 14 shows the character string identification device according to the ninth embodiment of the present invention, and it comprises fragment division module 1402, OCR identification module 1404, screening module 1406, path generation module 1408, score computing module 1410 and the first character string determination module 1412.As described at the 3rd embodiment, this character string identification device is specially adapted to occur the situation that letter and number mixes, for example " hanashiro123 ".

Fragment division module 1402 designed to be used the character string picture that will import and is divided into a plurality of fragments.

Each fragment that OCR identification module 1404 designed to be used in the character string of mixing at this letter and number is carried out OCR identification, to determine the OCR identification probability.Similar shown in Figure 3, can obtain a series of probability equally at this at each character.

The OCR recognition result that screening module 1406 designed to be used at each character screens.The principle of screening is the smaller recognition result of filtering recognition confidence, and keeps the bigger recognition result of recognition confidence.For example can utilize formula Cer _i/ Max (Cer _i)＞T screens, and wherein i represents the sequence number of recognition result, and in recognition result shown in Figure 3 as an example, if all listed 10 candidate characters for each character, i gets 1 to 10 so here.Cer _iBe the OCR probability of i candidate characters, T is a threshold value, for example can be made as 0.75.In this way, can obtain being similar to the The selection result shown in Fig. 4 the right equally.Be appreciated that and can utilize other screening modes to screen fully, for example, can directly determine to screen, perhaps directly utilize the candidate characters of the predetermined quantity of degree of confidence maximum with the irrelevant threshold value of maximum confidence, or the like.

Path generation module 1408 designed to be used based on The selection result and enumerates all possible paths, for every paths, because the " char " of each node correspondence of path can have several candidate's recognition results, so can be combined into several isometric identification speech.So, obtain being similar to the path profile among Fig. 4.

Score computing module 1410 designed to be used the score of coming computing node at each node in the path profile in conjunction with statistical information and OCR recognition confidence.The statistical information here is particularly about the statistical information of the probability of occurrence of the various combinations of letter and number.For can not repeating them here referring to the 3rd embodiment about the acquisition of this statistical information and the concrete computation process of score.

The first character string determination module 1412 designed to be used according to the score of resulting each character and the maximum sub-path that gets, thus the candidate character strings of definite probability maximum that is identified.

Because the character string identification device according to present embodiment has utilized the probability of adding up alphabetical and digital certain combination of appearance that draws, promptly utilized the name of a large number of users to be accustomed to this information, thereby obviously improved identification accuracy the character string of numeral and letter mixing.

The tenth embodiment

The tenth embodiment of the present invention is corresponding to the fourth embodiment of the present invention.

Figure 15 shows the separator identification module according to the tenth embodiment of the present invention, and it comprises connected domain analytic unit 1502 and separator determining unit 1504.

The character string picture that connected domain analytic unit 1502 designed to be used input carries out the connected domain analysis.According to connected domain analysis,, for example determine the coordinate of the position of connected domain, the number of pixels in the connected domain or the like for each parameter that each connected domain CC can obtain this connected domain to character string picture.

The result that separator determining unit 1504 designed to be used based on the connected domain analysis determines separator.Wherein separator determining unit 1504 is at first determined threshold value in order to discern separator.For example can select some connected domains of number of pixels minimum in the connected domain, thereby calculate the mean value of the number of pixels of these connected domains, according to this mean value setting threshold according to the number of pixels value of connected domain.For example, use top three connected domains to calculate, obtain mean value Av3 at this.Then, T1=α Av3 is set as threshold value.α can according to circumstances select the parameter adjusted, its objective is in order to reach best recognition effect.For example can select α=3.Certainly, threshold value determine be not limited to upper type.For example, can also directly determine a threshold value according to the result who utilizes great amount of samples to train, perhaps the result who obtains according to the sample training that utilizes the active user determines threshold value, or the like.

Then separator determining unit 1504 judge each connected domain the number of pixels value whether less than T1, if its number of pixels value, thinks then that this connected domain is candidate's a separator ". " less than T1.Because different user writing style difference, so the size of separator ". " may be bigger in difference in the writing of different user.Considered by the method for above-mentioned a plurality of connected domain calculating pixel quantity mean values according to minimum and to be equivalent to this fact of writing style difference (thereby the size of the point that writes out is different with different user) of different user come separator ". " is discerned in the mode of a kind of " self-adaptation ".

Owing in character " i ", " j ", also have point ". ", thus separator determining unit 1504 also further judge identified should ". " whether be positioned at the character row bottom, if determine that then this point is separator.This judgement can be carried out based on the coordinate parameters of each connected domain.

The 11 embodiment

The 11st embodiment of the present invention is corresponding to the sixth embodiment of the present invention.

Figure 16 shows the character string identification device according to the 11st embodiment of the present invention, this character string identification device for example can have fixed mode character string and for example carries out efficient and accurate recognition in Email address, the network address etc. be separated into a plurality of fields and at least a portion field by separator, particularly can carry out efficient and accurate recognition to hand-written this character string.This character string identification device comprises separator identification module 1602, general domain name identification module 1604, special domain name and the user name identification module 1606 and the second character string determination module 1608.

On the result's that connected domain is analyzed basis, separator identification module 1602 designed to be used the separator in the identification string.For the Email address, except ". " character, " " character also is regarded as separator.Identifying for ". " character is described in a fourth embodiment in accordance with the invention, is described in according to a sixth embodiment of the invention for the identifying of " " character, does not repeat them here.

These separators are divided into different piece with the Email address.For example, user name part " hanashiro ", special domain name part " itto " and general domain name part " or ", " jp " have been divided into for hanashiro@itto.or.jp.

General domain name identification module 1604 designed to be used for general domain name to be discerned.To general domain name, can carry out integral body identification, also can carry out the identification of character one by one.Because general domain name has hierarchical structure, certain inherent rule is wherein arranged, therefore, can utilize this rule to improve discrimination and recognition speed in to the identification of general domain name according to of the present invention.Particular content about the identification of general domain name can no longer repeat at this referring to the application's the 6th embodiment.

Special domain name and user name identification module 1606 designed to be used to be discerned special domain name and user name.Can utilize according to the method for above-mentioned first, second and the 3rd embodiment at this and to discern, no longer repeat at this equally.

The second character string determination module 1608 designed to be used and uses existing storehouse recognition result is checked and to be proofreaied and correct.For example for the Email address of Japan, if the general domain that identifies " .ac.jp " by name, the special domain name before then meaning must be the domain name of university or research institution.Therefore, can utilize the domain name storehouse of university or research institution to come recognition result is carried out spell check and correction.In addition, for common special domain name, for example " fujitsu " or the like or common user name can utilize corresponding storehouse to come recognition result is checked and revised.This inspection and correction can utilize the method according to above-mentioned the 5th embodiment, are for example undertaken by the character string of calculating OCR identification and the distance of the character string in the storehouse, do not repeat them here.

By utilizing character string identification device according to present embodiment, advantageously can have fixed mode character string and for example carry out efficient and accurate recognition in Email address, the network address etc., particularly can carry out efficient and accurate recognition hand-written this character string to be separated into a plurality of fields and at least a portion field by separator.

In addition, it should be understood that various example as herein described and embodiment all are exemplary, the invention is not restricted to this.In this manual, statements such as " first ", " second " only are for described feature is distinguished on literal, clearly to describe the present invention.Therefore, it should be considered as having any determinate implication.

Each forms module in the said apparatus, the unit can be configured by the mode of software, firmware, hardware or its combination.Dispose spendable concrete means or mode and be well known to those skilled in the art, do not repeat them here.Under situation about realizing by software or firmware, from storage medium or network the program that constitutes this software is installed to the computing machine with specialized hardware structure (multi-purpose computer 1700 for example shown in Figure 17), this computing machine can be carried out various functions etc. when various program is installed.

In Figure 17, CPU (central processing unit) (CPU) 1701 carries out various processing according to program stored among ROM (read-only memory) (ROM) 1702 or from the program that storage area 1708 is loaded into random-access memory (ram) 1703.In RAM 1703, also store data required when CPU 1701 carries out various processing or the like as required.CPU 1701, ROM 1702 and RAM 1703 are connected to each other via bus 1704.Input/output interface 1705 also is connected to bus 1704.

Following parts are connected to input/output interface 1705: importation 1706 (comprising keyboard, mouse or the like), output 1707 (comprise display, such as cathode ray tube (CRT), LCD (LCD) etc. and loudspeaker etc.), storage area 1708 (comprising hard disk etc.), communications portion 1709 (comprising that network interface unit is such as LAN card, modulator-demodular unit etc.).Communications portion 1709 is handled such as the Internet executive communication via network.As required, driver 1710 also can be connected to input/output interface 1705.Detachable media 1711 is installed on the driver 1710 as required such as disk, CD, magneto-optic disk, semiconductor memory or the like, makes the computer program of therefrom reading be installed to as required in the storage area 1708.

Realizing by software under the situation of above-mentioned series of processes, such as detachable media 1711 program that constitutes software is being installed such as the Internet or storage medium from network.

It will be understood by those of skill in the art that this storage medium is not limited to shown in Figure 17 wherein having program stored therein, distribute separately so that the detachable media 1711 of program to be provided to the user with equipment.The example of detachable media 1711 comprises disk (comprising floppy disk (registered trademark)), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprising mini-disk (MD) (registered trademark)) and semiconductor memory.Perhaps, storage medium can be hard disk that comprises in ROM 1702, the storage area 1708 or the like, computer program stored wherein, and be distributed to the user with the equipment that comprises them.

The present invention also proposes a kind of program product that stores the instruction code that machine readable gets.When described instruction code is read and carried out by machine, can carry out above-mentioned method according to the embodiment of the invention.

Correspondingly, being used for carrying the above-mentioned storage medium that stores the program product of the instruction code that machine readable gets is also included within of the present invention open.Described storage medium includes but not limited to floppy disk, CD, magneto-optic disk, storage card, memory stick or the like.

At last, also need to prove, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thereby make and comprise that process, method, article or the equipment of a series of key elements not only comprise those key elements, but also comprise other key elements of clearly not listing, or also be included as this process, method, article or equipment intrinsic key element.In addition, do not having under the situation of more restrictions, the key element that limits by statement " comprising ... ", and be not precluded within process, method, article or the equipment that comprises described key element and also have other identical element.

Though more than describe embodiments of the invention in conjunction with the accompanying drawings in detail, should be understood that embodiment described above just is used to illustrate the present invention, and be not construed as limiting the invention.For a person skilled in the art, can make various changes and modifications above-mentioned embodiment and do not deviate from the spirit and scope of the invention.Therefore, scope of the present invention is only limited by appended claim and equivalents thereof.

Need to prove, used the Japanese user name in the foregoing description as example, however user name be not limited thereto, but can be the user name of any language.Be not difficult to find out by above description,, provide following scheme according to embodiments of the invention:

The method of 1. 1 kinds of identification strings of remarks may further comprise the steps:

Character string picture is divided into a plurality of fragments;

Described a plurality of fragments are carried out OCR identification, obtain candidate characters, wherein, each fragment is corresponding at least one candidate characters;

Obtain the statistical information of the character combination that the candidate characters of the candidate characters of fragment and/or fragment forms; And

OCR recognition confidence in conjunction with described statistical information and candidate characters is determined candidate character strings.

Remarks 2. is according to remarks 1 described method, and wherein said statistical information comprises: the probability that character, character types and/or character combination occur with at least one book character, character types and/or character combination.

Remarks 3. is according to remarks 1 or 2 described methods, and wherein said character combination is the character combination of syllabication or the combination of character of the same type.

Remarks 4. is according to remarks 1 or 2 described methods, and wherein, described character string comprises separator, and wherein, this method also comprises the separator in the identification string.

Remarks 5. is according to remarks 4 described methods, and wherein the step of the separator in the identification string comprises:

Character string picture is carried out the connected domain analysis, obtain the connected domain of foreground pixel;

Number of pixels according to described connected domain is determined separator.

Remarks 6. is according to remarks 5 described methods, wherein, determine that according to the number of pixels of described connected domain the step of separator comprises: the number of pixels according to the minimum a plurality of connected domains of number of pixels is determined threshold value, when the number of pixels of connected domain is positioned at the character string picture bottom less than this threshold value and this connected domain, determine that this connected domain is a separator.

Remarks 7. also comprises based on the distance between the character string in OCR recognition confidence calculated candidate character string and the predefined database and determines candidate character strings according to remarks 1 described method.

Remarks 8. is according to remarks 7 described methods, wherein, reduce the value of recognition confidence that a respective segments corresponding to this candidate characters is identified as the character of the relevant position in the character string that is compared in the database with the distance of a certain candidate characters corresponding characters string.

Remarks 9. is according to remarks 7 or 8 described methods, wherein, the character of the relevant position in the character string that is compared in arbitrary candidate characters of a certain fragment in the candidate character strings and the database not simultaneously, the described distance of this candidate characters corresponding characters string increases the value corresponding to the recognition confidence of this candidate characters.

10. 1 kinds of character string identification devices of remarks comprise:

Fragment is divided module, is used for character string picture is divided into a plurality of fragments;

The OCR identification module is used for described a plurality of fragments are carried out OCR identification, obtains candidate characters, and wherein, each fragment is corresponding at least one candidate characters;

The statistical information acquisition module is used to obtain the statistical information of the character combination that the candidate characters of the candidate characters of fragment and/or fragment forms; And

The first character string determination module is used for determining candidate character strings in conjunction with the OCR recognition confidence of described statistical information and candidate characters.

Remarks 11. is according to remarks 10 described character string identification devices, and wherein said statistical information comprises: the probability that character, character types and/or character combination occur with at least one book character, character types and/or character combination.

Remarks 12. is according to remarks 10 or 11 described character string identification devices, and wherein said character combination is the character combination of syllabication or the combination of character of the same type.

Remarks 13. is according to remarks 10 or 11 described character string identification devices, and wherein, described character string comprises separator, and wherein, this character string identification device also comprises the separator identification module, is used for the separator of identification string.

Remarks 14. is according to remarks 13 described character string identification devices, and wherein the separator identification module comprises:

The connected domain analytic unit is used for character string picture is carried out the connected domain analysis, obtains the connected domain of foreground pixel; And

The separator determining unit is used for determining separator according to the number of pixels of described connected domain.

Remarks 15. is according to remarks 14 described character string identification devices, wherein, the separator determining unit is arranged to when determining separator according to the number of pixels of described connected domain determines threshold value according to the number of pixels of the minimum a plurality of connected domains of number of pixels, when the number of pixels of connected domain less than this threshold value and when being positioned at the character row bottom, determine that this connected domain is a separator.

Remarks 16. also comprises the second character string determination module according to remarks 10 described character string identification devices, is used for determining candidate character strings based on the distance between the character string of OCR recognition confidence calculated candidate character string and predefined database.

Remarks 17. is according to remarks 16 described character string identification devices, and wherein the second character string determination module is configured to make distance with a certain candidate characters corresponding characters string to reduce the value of degree of confidence that a respective segments corresponding to this candidate characters is identified as the character of the relevant position in the character string that is compared in the database.

Remarks 18. is according to remarks 16 or 17 described character string identification devices, wherein the second character string determination module be configured to make the relevant position in the character string that is compared in arbitrary candidate characters of a certain fragment in the candidate character strings and the database character not simultaneously, value of described distance increase of this candidate characters corresponding characters string corresponding to the recognition confidence of this candidate characters.

19. 1 kinds of program products of remarks, this program product comprises the executable instruction of machine, when carrying out described instruction on messaging device, described instruction makes described messaging device carry out as remarks 1 described method.

20. 1 kinds of storage mediums of remarks, this storage medium comprises machine-readable program code, when carrying out described program code on messaging device, described program code makes described messaging device carry out as remarks 1 described method.

Claims

1. the method for an identification string may further comprise the steps:

Character string picture is divided into a plurality of fragments;

2. character string identification device comprises:

3. character string identification device according to claim 2, wherein said statistical information comprises: the probability that character, character types and/or character combination occur with at least one book character, character types and/or character combination.

4. according to claim 2 or 3 described character string identification devices, wherein said character combination is the character combination of syllabication or the combination of character of the same type.

5. according to claim 2 or 3 described character string identification devices, wherein, described character string comprises separator, and wherein, this character string identification device also comprises the separator identification module, is used for the separator of identification string.

6. character string identification device according to claim 5, wherein the separator identification module comprises:

7. character string identification device according to claim 6, wherein, the separator determining unit is arranged to when determining separator according to the number of pixels of described connected domain determines threshold value according to the number of pixels of the minimum a plurality of connected domains of number of pixels, when the number of pixels of connected domain less than this threshold value and when being positioned at the character row bottom, determine that this connected domain is a separator.

8. character string identification device according to claim 2 also comprises the second character string determination module, is used for determining candidate character strings based on the distance between the character string of OCR recognition confidence calculated candidate character string and predefined database.

9. character string identification device according to claim 8, wherein the second character string determination module is configured to make distance with a certain candidate characters corresponding characters string to reduce the value of degree of confidence that a respective segments corresponding to this candidate characters is identified as the character of the relevant position in the character string that is compared in the database.

10. according to Claim 8 or 9 described character string identification devices, wherein the second character string determination module be configured to make the relevant position in the character string that is compared in arbitrary candidate characters of a certain fragment in the candidate character strings and the database character not simultaneously, value of described distance increase of this candidate characters corresponding characters string corresponding to the recognition confidence of this candidate characters.