CN103902993A - Document image identification method and device - Google Patents

Document image identification method and device Download PDF

Info

Publication number
CN103902993A
CN103902993A CN201210583676.9A CN201210583676A CN103902993A CN 103902993 A CN103902993 A CN 103902993A CN 201210583676 A CN201210583676 A CN 201210583676A CN 103902993 A CN103902993 A CN 103902993A
Authority
CN
China
Prior art keywords
character
language
character string
string unit
similar
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201210583676.9A
Other languages
Chinese (zh)
Inventor
李建杰
李献
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Canon Inc
Original Assignee
Canon Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Canon Inc filed Critical Canon Inc
Priority to CN201210583676.9A priority Critical patent/CN103902993A/en
Publication of CN103902993A publication Critical patent/CN103902993A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Character Discrimination (AREA)

Abstract

The invention discloses a document image identification method and device. The method which is used for identifying a document image with mixed main language and secondary language letters comprises the steps of segmentation which is used for segmenting the document image into at least one long character string, extraction which is used for extracting a character string unit from each long character string according to special characters comprised by the long character string, and identification which is used for identifying the document image based on each identified character string unit.

Description

File and picture recognition methods and equipment
Technical field
The present invention relates to the method and apparatus for identifying file and picture.Especially, the method and apparatus of file and picture identified in the language that the present invention relates to the each several part of the long character string for being partitioned into from file and picture by identification (distinguish).
Background technology
In optical character identification (OCR) field, most of OCR systems usually need to tackle the file and picture of some multilingual mixing.Many technology of distinguishing (distinction) or classification (categorization) for such document with multilingual mixing are developed.Some technology realized the classification to the document before the document for comprising different language carries out OCR.
For example, in document 1 and 2, disclose the method for distinguishing such document, the document 1 and 2 is incorporated to this by full text by reference.In the method, develop system as follows, this system can be identified 23 kinds based on Latin language (English, French etc.) and three kinds of language (Chinese, Japanese and Korean) based on Chinese.First, this system is used the upwards analysis of concavity (upward concavity) to distinguish based on Latin script and the script based on Chinese.Then, this system is carried out the language of identification based on Chinese by the optical density (OD) of analyzing character cell.For based on Latin language, system is carried out identification by the word shape mark of metalanguage to them.
Document 3 and 4 discloses other sorting technique, and document 3 and 4 is incorporated to this by full text by reference.In the method, develop following system, this system can be identified Arabic, ideograph and Latin language script.System goes out these three kinds of main scripts (Arabic, ideograph and Latin language script) by using various attributes (height distribution, character density and the horizontal projection of row) to sort out (classify).
The defect of these methods is that they can not distinguish the language that comprises the character (character) with same or similar shape, for example simplified form of Chinese Character and the Chinese-traditional in the language based on Chinese; Russian (Greek) and Latin language etc.The file and picture of Russian to have mixing (Greek)/Latin alphabet is described and occurred the reason of incorrect result as example, and this statement " Russian (the Greek)/Latin alphabet of mixing " refers to Russian of mixing and Greek and the Latin alphabet of Latin alphabet or mixing.The reason that occurs incorrect result is to have in order to identify the character in the file and picture of Russian (Greek)/Latin alphabet of mixing, and OCR system need to realize identification based on Russian (Greek) character set and Latin language character set.Latin character set is made up of two parts: ascii character-set and extended character set.And for Latin language, substantially, the character defining in their ascii character-set part is all used in Latin language word together with the character defining in extended character set part.For Russian and Greek character set, ascii character concentrates the character of definition not to be used in Russian (Greek) word.In Russian (Greek) character set, exist many have the shape similar from character in Latin alphabet, but there is the character of different codes.For example, its code is in ISO-8859-7(Russian character set) in to be defined as the shape of Russian character of 0xB0 identical with the shape of Latin character ' A ' that is defined as 0x41 in ISO-8859.OCR system can not be distinguished Russian (Greek) and these similar characters that define in Latin alphabet based on their shape facility.Therefore, the recognition result of these similar characters is usually incorrect.
In order to tackle this class file and picture that comprises the character with same or similar shape and different code, the sorting technique based on OCR result is disclosed in Chinese patent application No.200810108571.1, this Chinese patent application is incorporated to this by full text by reference.The method is used for distinguishing simplified form of Chinese Character and Chinese-traditional.First, generate the recognition confidence scope in simplified form of Chinese Character and Chinese-traditional by the training managing of the known language document based on a large amount of.Then, identify respectively unknown language document by simplified form of Chinese Character OCR and Chinese-traditional OCR.Based on this recognition confidence scope, select the special character in simplified form of Chinese Character or Chinese-traditional recognition result.The method compares by the average recognition confidence in simplified form of Chinese Character and Chinese-traditional to these special characters, determines the language of document.
The defect of the method is distinguishing of its language of having carried out whole document, but can not distinguish the word of the different language in the sentence of document.Figure 1A shows the similar exemplary sentence of situation therewith, and it comprises the word of many different languages with mixing.In this sentence, determine that according to the special character of different language the language of this sentence may be insecure.
By reference and definite for determining whether to replace to realize by other Languages character the language mixing in alphabetical document by calculating candidate's the score value of an each character in word by being incorporated in full disclosed method in this document 5.In order to replace similar character, the method generates Latin-Cyrillic (Cyrillic-Latin) map table.The defect of the method is the each candidate's who needs character Triple Frequency (trigam frequency) and Levenstein distance.Therefore, the process of the method is time-consuming and have a huge expense.In addition, the generation of these data depends on the large document of the large subset that may be internet.Therefore, in document 5, disclosed method can not accurately be identified and mix alphabetical file and picture fast and efficiently.
In United States Patent (USP) 3988715, disclose for also based on OCR recognition result to the another kind of method that comprises that the different language of the character with identical or similar shape is classified, this United States Patent (USP) is incorporated to this by full text by reference.This patent has proposed a kind of for tackling the method that is mixed with multilingual and digital document.First, a kind of character recognition engine identification character and for n passage result of each character output, wherein each passage is corresponding to a kind of language or numeral.For a character string, the method is calculated the product of the combination condition probability (jointed condition probability) of i channel recognition result.By the product of the combination condition probability in n passage of compare string string, language or the data type of the whole character string of the method identification, wherein combination condition probability refers to the combined probability of the recognition result in other n-1 passage under the correct condition of in i passage Output rusults.
The method is not used the more recognition confidence of Useful Information that has of result; The method is supposed in a character string, is only had a kind of language.Therefore,, if comprise different language part because word segmentation errors makes a character string, the result of distinguishing of the method will be insecure, as shown in Figure 1B.
Therefore,, if when a long character string comprises the part of two or more language, prior art can not be determined the language of whole word exactly.Step or the equipment that does not disclose the such character string of any reply for distinguishing multilingual technology of the prior art.That is to say, prior art can not be identified the character string that comprises different language part reliably and exactly.
Because the similar character defining in different alphabets always has same or analogous shape,
Prior art does not disclose anyly distinguishes multilingual step or equipment based on this similar character.
Prior art does not disclose step or the equipment of the word that any reply is all made up of similar character.
Citing document list
[1].P.Sibun?and?A.L.Spitz.Language?Determination:Natural?Language?Processing?from?Scanned?Document?Images.In?Proceeding?of?the?Fourth?Conference?on?Applied?Natural?Language?Processing.Pp.423-433,Las?Vegas,April1995.
[2].L.Spitz.Determination?of?the?Script?and?Language?Content?of?Document?Images.IEEE?Transaction?on?Pattern?Analysis?and?Machine?Intelligence,Vol19,no.3,pp.235-245,1997.
[3].Y.Suen,S.Bergler,N.Nobile,B.Waked,C.P.Nadal?and?A.Bloch,Categorizing?Document?Image?Into?Script?and?Language?Classes,In?the?processings?of?the?International?Confidence?on?Advances?in?Pattern?Recognition,23-25November1998,Plymouth,UK,pp.297-306.
[4].N.Nobile,S.Bergler,C.Y.Suen?and?S.Khoury.Language?Identification?of?On-Line?Documents?Using?Word?Shapes.In?Proceedings?of?the?Fourth?International?Conference?on?Document?Analysis?and?Recognition,August?1997,Ulm,Germany,pp.258-262.
[5].Christoph?Ringlstetter,Klaus?U.Schulz,Stoyan?Mihov?and?Katerina?Louka.The?same?is?Not?The?Same–Postcorrection?of?Alphabet?Confusion?Errors?in?Mixed-Alphabet?OCR?Recognition.Proceedings?of?the?2005?Eight?International?Conference?on?Document?Analysis?and?Recognition(ICDAR′05).
Summary of the invention
The present invention is intended to address the above problem.An object of the present invention is to provide any the method and system in a kind of addressing the above problem.
An object of the present invention is to provide a kind of for reliably and exactly identifying the method and apparatus of the character string that comprises different language part of file and picture.
Another object of the present invention is to provide a kind of for reliably and exactly identifying the method and apparatus of the character string being made up of similar character completely of file and picture.
In one aspect of the invention, provide a kind of recognition methods with the main language of mixing and the alphabetical file and picture of less important language, having comprised: segmentation step, for file and picture being divided into at least one long character string; Extraction step, for each of at least one long character string from this according to the special character extraction character string unit comprising in this length character string; And identification step, identify the document image for the character string unit based on each identification.
In another aspect of the present invention, a kind of identification equipment with the main language of mixing and the alphabetical file and picture of less important language is provided, comprising: segmenting device, is configured to file and picture to be divided at least one long character string; Extraction element, is configured in each at least one long character string, extract character string unit according to the special character comprising in this length character string from this; And recognition device, the document image is identified in the character string unit being configured to based on each identification.
From the following description of the exemplary embodiment with reference to accompanying drawing, it is clear that further feature of the present invention will become.
Brief description of the drawings
The accompanying drawing being incorporated in instructions and form a part for instructions shows embodiments of the invention, and together with the description for explaining principle of the present invention.In the accompanying drawings, similar Reference numeral is indicated similar project.
Figure 1A and 1B illustrate to have the main language letter of mixing and two possible words of less important language letter.
Fig. 2 is the block diagram of the layout of the computing equipment for realizing file and picture identification.
Fig. 3 is the process flow diagram that the file and picture recognition methods of the first embodiment is shown.
Fig. 4 is the block diagram that the file and picture identification equipment of the first embodiment is shown.
Fig. 5 is the process flow diagram that the process of the extraction step in the first embodiment is shown.
Fig. 6 is the block diagram that extraction element is shown.
Fig. 7 A and 7B are the explanatory diagrams of special symbol.
The image geometry feature of the schematically illustrated character of Fig. 8.
Fig. 9 A and 9B are the process flow diagrams that the file and picture recognition methods of the second embodiment is shown.
Figure 10 A and 10B illustrate the similar character table in Russian and Latin language.
Figure 11 A and 11B illustrate the similar character table in Greek and Latin language.
Figure 12 is the process flow diagram that the process of the determining step based on code is shown.
Figure 13 is the process flow diagram that the process of the first determining step of the second embodiment is shown.
Figure 14 is the process flow diagram that the process of the 3rd determining step of the second embodiment is shown.
Figure 15 illustrates first three candidate's of dissimilar character comparison.
Figure 16 is the process flow diagram that the process of revising step is shown.
Figure 17 is the process flow diagram that the file and picture recognition methods of the 3rd embodiment is shown.
Figure 18 is the process flow diagram that the process of the determining step based on degree of confidence is shown.
Figure 19 is first three candidate's of the non-similar character in character string unit degree of confidence.
Figure 20 is the block diagram that the file and picture identification equipment of embodiments of the invention is shown.
Figure 21 A to 21C illustrates example 1.
Figure 22 illustrates the comparative example of example 1.
Figure 23 A and 23B illustrate example 2.
Figure 24 illustrates the comparative example of example 2.
Embodiment
Below describe with reference to the accompanying drawings embodiments of the invention in detail.
In order to contribute to thoroughly and suitably to understand the present invention, below first the term using in instructions of the present invention and claims will be explained.
In the application's instructions and claim, especially in the time using in file and picture, term " host language " or " main language " are the language in file and picture with the word of larger proportion, and term " inferior language " or " less important language " are the language having in file and picture compared with the word of small scale.For example, in the document of mainly being write as by Russian, Russian is main language, and can be less important language such as the other Languages of Italian (English, German, French etc.).
In the present invention, for example, using based on as Russian or the Greek of main language and be described as the Latin language of less important language.For convenience's sake, in the description of each step of following method, be mainly described in the identification in the context of Russian/Latin document with mixing, and generate following special data and threshold value based on Russian and Latin mixing.Because distinguishing and revising in Russian/Latin language and Greek/Latin language has identical step, and in fact, the special data or the threshold value that in Russian (Greek)/Latin language, adopt depend on the language mixing in document, and therefore following step also can be applicable to the identification of Greek and the Latin file and picture with mixing.
In this embodiment, the code of Russian character at ISO-8859-7(from 0xa0 to 0xff) in be defined, and the code of Greek character at ISO-8859-5(the code from 0xa0 to 0xff) in be defined.Latin language character is at ISO-8859-1, is defined in-2 ,-4.
But the present invention is not limited to this, and can be applicable to have the identification of the main language of mixing and any other document of less important language.
Term " candidate " is the preliminary recognition result that the OCR dictionary by comprising main language and less important language character utilizes a character recognition engine identification character, and candidate is the character that this preliminary character of identifying may be.In general, a character can have multiple candidates, and the plurality of candidate can be sorted by the order of degree of confidence, and no matter their language form how, and a candidate's degree of confidence refers to the OCR dictionary that comprises main language and less important language character by utilization and identifies obtained degree of confidence.
Term " similar character " refers to the consistent character of one of two corresponding characters with similar shape with main language and less important language.Especially, similar character can refer to the corresponding similar character in character or the less important language in main language, similar character in main language has identical or similar shape with the corresponding similar character in less important language, but has different codes in main language and less important language character set separately.
Term " similar character table " by similar character (, similar character in main language and the corresponding similar character in less important language) form, therefore similar character table generally includes two sublists, a sublist is made up of the similar character in main language, and another sublist is made up of the similar character in less important language.As long as character is comprised in similar character table, in any in two sublist, this character is considered to similar character.
Term " non-similar character " refers to the not character in this similar character table.
Term " special symbol " is a kind of character, and is in comma (', '), fullstop ('. '), hyphen ('-') etc.In multilingual document, some words of different language may be linked to form a long character string by special symbol.
Fig. 2 is the block diagram illustrating according to the layout of the computing equipment of realizing file and picture identifying of embodiment.For for simplicity, this process is shown as and is built in single computing equipment.But no matter this process is built in single computing equipment and is still built in the multiple computing equipments as network system, this process is all effective.
As shown in Figure 2, computing equipment 100 is for realizing the process of file and picture identification.Computing equipment 100 can comprise CPU101, chipset 102, RAM103, memory controller 104, display controller 105, hard disk drive 106, CD-ROM drive 107 and display 108.Computing equipment 100 also can comprise the signal wire 111 being connected between CPU101 and chipset 102, be connected in the signal wire 112 between chipset 102 and RAM103, be connected in the peripheral bus 113 between chipset 102 and various peripherals, be connected in the signal wire 114 between memory controller 104 and hard disk drive 106, be connected in the signal wire 115 between memory controller 104 and CD-ROM drive 107, and be connected in the signal wire 116 between display controller 105 and display 108.
Customer equipment 120 can directly or via network 130 be connected to computing equipment 100.Customer equipment 120 can for example send the needed instruction of processing and/or the parameter of perform document image recognition to computing equipment 100, and computing equipment 100 can return to information customer equipment 120 or demonstration information on display 108.
[the first embodiment]
Describe according to the first embodiment of the present invention with reference to Fig. 3 and 4, wherein, Fig. 3 illustrates the process flow diagram according to the identification of this embodiment with the method for the main language of mixing and the alphabetical file and picture of less important language.
At step S301(segmentation step (segmenting step)) in, file and picture is divided into at least one (being generally multiple) long character string.As the common technology for cutting apart, there is the main language of mixing and the file and picture of less important language word and comprised that by a character recognition engine utilization OCR dictionary of main language and less important language tentatively identifies, wherein identifying includes but not limited to go cut apart, character separation and single character recognition are then partitioned into long character string according to space character wherein from thus obtained recognition result.It should be noted that above-mentioned technology is only exemplary, and the present invention is therefore not limited.
At step S302(extraction step) in, in each from this at least one long character string, extract character string unit according to the special character comprising in this length character string.Due to having in the context of the main language of mixing and the document of less important language word, in most of the cases, long character string can comprise some words that belong to different language, and the word of the different language in long character string is always linked by the special symbol such as hyphen, therefore this extraction step can be divided into shorter character string unit by long possible hybrid language character string, each character string unit belongs to single language, thereby the definite of language of the each several part of long character string will be more prone to, therefore long character string can more reliably and exactly be identified, and can not make this length character string be considered to belong to a kind of language.Below will the operation of extraction step be described.
At step S303(identification step) in, file and picture is identified in the character string unit based on each extraction.Extraction result based on extraction step can reliably and exactly be identified long character string and then identify whole file and picture.Recognition technology is not specifically limited.
Fig. 4 is the block diagram that the equipment of the identification file and picture of the first embodiment is shown.This equipment 400 can comprise the segmenting device 401 that is configured to file and picture to be divided at least one (being generally multiple) long character string, be configured to from this in each at least one long character string according to the extraction element 402 of the special character extraction character string unit comprising in this length character string, and be configured to character string unit based on each extraction and identify the recognition device 403 of the document image.
Describe this extraction step in detail with reference to Fig. 5.In step S501(coupling determining step) in, determine the character with at least one following candidate according to the candidate's of the each character comprising in long character string code, each candidate's is-symbol in this at least one candidate and mating with the special symbol comprising in the predetermined list of the special symbol comprising in this length character string.More specifically, the definite character that comprises that its first candidate in the long character string of first selection is less important linguistic notation (being Latin language symbol) of coupling of special character, then compares the follow-up candidate's of this first candidate and selected character code with the code in special symbol list.If a candidate's of this character code is found in this special symbol list, this candidate with and special symbol corresponding to the code that finds in this special symbol list match, and this character may be this special symbol.But a character may have several candidates, these several candidates' code may be all found in this special symbol list due to the precision of preliminary OCR, that is to say, the special symbol that this character may be different from several matches.
Fig. 7 A and 7B are the explanatory of special symbol, and wherein Fig. 7 A illustrates the example of a long character string that comprises special symbol, and Fig. 7 B illustrates the candidate of the character of this special symbol.As shown in Fig. 7 A and 7B, the first candidate of character and the second candidate all with special symbol list in symbol match.
If any candidate that the symbol in the predetermined list of existence and special symbol matches, process advances to step S502, to determine more accurately which special symbol this character is actually, otherwise this length character string will be illustrated as belonging to monolingual character string unit.
At step S502(detecting step) in, the character with the candidate of coupling is by the image geometry feature of this character is compared to determine with each the image geometry feature in corresponding coupling special symbol corresponding to which special symbol, wherein, in the time that the image geometry feature of character is in the scope of the threshold value of the image geometry feature of a special symbol, this character is regarded as this special symbol.More specifically, for with special symbol list in some candidates of matching of symbol, really the image geometry feature of character corresponding to the candidate with coupling in CHAR whether with this special symbol list in this symbol in any image geometry feature consistent.
If it is consistent to have any image geometry feature in the corresponding coupling special symbol in the candidate's of coupling image geometry feature and the special symbol list of character, the character of the coupling in long character string is regarded as special symbol, and this length character string can be divided into character string unit based on this special symbol.Otherwise this length character string will be illustrated as belonging to monolingual character string unit.
The image geometry feature of character can comprise wide/high ratio of character picture, and the bottom of character picture is apart from the distance of the first datum line, and the top of character is apart from the distance of the second datum line.It should be noted that image geometry feature is not limited to this.
In above-mentioned image geometry feature, the first datum line can be relevant with the character string that comprises this character with the second datum line, for example, the bottom sides boundary line that this first datum line can be this character string, and the top sides boundary line that this second datum line can be this character string, but the first and second datum lines are therefore not limited.
The image geometry feature of the schematically illustrated character of Fig. 8 wherein, presents the image geometry feature of possible special symbol on character string picture.
The image geometry feature of the symbol in special symbol list is determined in advance: '-' and '. ' wide/high score is not set as [1.5,5.0], [0.7,1.3]; For '-' (character code is the 0x2d defining in ISO-8859), character picture bottom to the distance threshold of the bottom boundary of row is set to [line height * 0.350, line height * 0.691]; For '. ' (character code is the 0x2e defining in ISO-8859), character picture top to the distance threshold of the top boundary of row is set to [line height * 0.580, line height * 0.912].Wherein, the line height in above-mentioned threshold range is corresponding to the true altitude of row.
Fig. 6 illustrates the configuration of the extraction element 402 in the first embodiment.Extraction element 402 can comprise coupling determining device 601 and pick-up unit 602, this coupling determining device 601 is configured to determine the character with at least one following candidate according to the candidate's of the each character comprising in long character string code, each candidate's is-symbol in this at least one candidate and mating with the special symbol comprising in the predetermined list of the special symbol comprising in this length character string, this pick-up unit 602 is configured to by which special symbol is the image geometry feature of character of the candidate with coupling compared to the character of determining this candidate with coupling corresponding to each the image geometry feature in corresponding coupling special symbol.
[good result]
By said method, can reliably and exactly identify and there is the main language of mixing and the file and picture of less important language letter.
In general,, in multilingual document, the word of some different languages can be linked to form a long character string by special symbol.And these new long character strings are always identified as a word and are not divided.Therefore, whole long character string will be identified based on a kind of language, in the case some characters of this length character string then this length character string will be identified improperly.As a comparison, the method of the first embodiment can be extracted the character string unit comprising in each in multiple long character strings according to special symbol, therefore identifying object is confined to respectively belong to monolingual character string unit, and can comprises that the each several part of this length character string of the part of two or more language can be identified exactly with corresponding language.Therefore, so long character string can reliably and exactly be identified.
[the second embodiment]
Describe according to a second embodiment of the present invention with reference to Fig. 9 A to Figure 16.The difference of the second embodiment and the first embodiment is identification step, identifies the process of file and picture based on character string unit.Therefore, the step similar to the first embodiment of the second embodiment and part will be omitted and will not be described in detail.
In general, the similar character in different alphabets always has same or analogous shape, will make the recognition result of file and picture deteriorated and if they are not correctly identified.Prior art does not disclose any for distinguish multilingual step or equipment based on similar character.As a comparison, the result of determination of the similar character of method according to a second embodiment of the present invention based in character string unit is determined the language of character string unit, therefore can interrogate speed and determine efficiently the language of character string unit, thus the long character string that comprises this character string unit then whole file and picture can accurately and efficiently be identified.
Fig. 9 A to 9B is the process flow diagram that the file and picture recognition methods of the second embodiment is shown, wherein Fig. 9 B illustrates that the language determining step in Fig. 9 A comprises the determining step based on code.
At step S901(determination step as shown in Figure 9 A) in, the candidate's of the each character based in character string unit code and similar character table, judge that the each character comprising in character string unit is similar character or dissimilar character.A character is regarded as similar character in its first code is-not symbol and nonnumeric candidate is comprised in similar character table time, otherwise this character is non-similar character.Below will describe similar character table in detail.
In step S902(language determining step) in, the result based on determination step is determined the language of character string unit.Below will describe step S902 in detail.
Now with reference to Figure 10 A and 10B and Figure 11 A and 11B, similar character table is described.Similar character table can be used to determine whether the character in character string unit is the similar character can with the one or more corresponding character in other Languages with same or similar shape.The structure of similar character table is illustrated in Figure 10 A and 10B and Figure 11 A and 11B.Can see, similar character table is actually a pair of similar character sublist corresponding to each other, a sublist is made up of the similar character in main language, and another sublist is made up of the similar character in less important language and is corresponding one by one with the similar character in main language.
For example, Figure 10 A and 10B illustrate that main language is that Russian and less important language are the similar character table in Latin situation.Figure 10 A illustrates the similar character sublist Rus[of Russian], and present code and the shape of the similar character in Russian, Figure 10 B illustrates Latin similar character sublist Latin_Rus[], and present code and the shape of the corresponding similar character in Latin language.At sublist Latin_Rus[] in, each character and Rus[] in each character corresponding, and two corresponding characters have similar shape and different codes.
Figure 11 A and 11B illustrate that main language is that Greek and less important language are the similar character table in Latin situation.Figure 11 A illustrates Hellenic similar character sublist Grk[], Figure 11 B illustrates Latin similar character sublist Latin_Grk[].In these two character lists, also can find the similar character in Greek and Latin language.
Similar character table generates by following operation, watches the character in the alphabet set in main language (such as Russian or Greek) and less important language (such as Latin language), and selects to have the character pair of similar or same shape.In addition, the character in similar character table is adjusted to recognition result that can be based on having some file and pictures under the conventional font of main language/less important language of mixing.
Describe the processing of step S902 in detail now with reference to Fig. 9 B, wherein Fig. 9 B illustrates that language determining step S902 comprises the determining step of step S902-1(based on code), determine the language of character string unit for the candidate's of the character in the character string unit based on judging at determination step code.
Describe the process of step S902-1 now with reference to Figure 12, Figure 12 is the process flow diagram that the process of the determining step S902-1 based on code is shown.
In step S1201(the first determining step) in, the first is-not symbol and nonnumeric candidate of the each character comprising in character string unit is judged as similar character, by utilizing less important language word dictionary (Minor Language word lexicon) to determine the language of character string unit.
More specifically, first select first candidate that is-symbol neither numerical value of each character in character string unit.If selected first candidate of character string unit is according to the similar character of similar character table, be difficult to the code by candidate only and determine the language of this character string unit.For this word of identification, use conventional less important language word dictionary to determine whether the character string unit that its first all candidate is similar character is the less important language word of commonly using.Be that Russian or Greek and less important language are in Latin situation at main language, less important language word dictionary is made up of Latin language word.That is to say, if Russian is main language, less important language word dictionary will be included in all conventional Latin language word occurring in Russian document.Below in detail determining of language and the determining of less important language word dictionary of the character string unit carrying out based on dictionary will be described.
In step S1202(the second determining step) in, the first is-not symbol and nonnumeric candidate of the each character comprising in character string unit is judged as the non-similar character in main language, the language of determining this character string unit is main language, if according to similar character table
Selected the first candidate is the similar Russian character of right and wrong, the language of character string unit is defined as to Russian.
In step S1203(the 3rd determining step) in, the first is-not symbol and nonnumeric candidate of the each character comprising in character string unit is all not that the non-similar character in similar character or main language, only the non-similar character based on comprising in character string unit is determined the language of character string unit.Under these circumstances, because the quantity of non-similar character is conventionally little,
Therefore will greatly reduce computing cost.
Next, will the generation of conventional less important language word dictionary in the situation that being main language, be described Russian.In the conventional less important language word dictionary for Russian document, Latin language word is generated as follows.First, be collected in ready Russian document, occur at Latin_Rus[] in the Latin language word that forms of similar character, and record their occurrence number.Secondly, according to the character in the Latin language word of collecting, select Rus[] in corresponding similar character to form corresponding Russian word, and the number of times that they occur in same Russian document is also recorded.If the occurrence number of such Latin language word is greater than the occurrence number of corresponding Russian word, and be greater than or equal to predetermined threshold TH, this Latin language word is by the conventional less important language word dictionary being comprised in for Russian.Threshold value TH can be 5, but this predetermined threshold can also be any other value.
For example, in Russian document of preparing, Latin language word " PM " (code: 0x50,0x4d in ISO-8859) is found 5 times.Based on this Latin language word, Rus[] in the similar Russian character of correspondence (code: 0xc0,0xbc in ISO-8859-7) selected to form corresponding Russian word " PM " and to record its occurrence number in same Russian document.Because the occurrence number of Russian word " PM " is 0, therefore, word " PM " is stored in the less important language word dictionary for Russian.
Similarly, for Greek document, can be based on Grk[for the conventional less important language word dictionary of Greek/Latin], Latin_Grk[] generate and be prepared in same step, wherein Greek is main language.
The generation of less important language word dictionary is also applied to the situation of other main language and less important language equally.
Describe now with reference to Figure 13 the character string based on less important language word dictionary language determine, Figure 13 is the process flow diagram that the first determining step of the second embodiment is shown.
In step S1301, the first is-not symbol and nonnumeric candidate of the each character comprising in character string unit is replaced to form less important language word by the corresponding less important language similar character with them.For example, in Russian/Latin situation, if the each character comprising in the character string unit of being determined is similar Russian character, each character can be used sublist Latin_Rus[] in corresponding similar character replace, then obtain the character string unit after the replacement all being formed by less important language character (such as Latin language character).Certainly,, if the first is-not symbol and nonnumeric candidate of character is the similar character in less important language, the first candidate of this character will be used and do not replaced.
In step S1302, the less important language word after thus obtained replacement is compared to determine with the word in less important language word dictionary whether they mate.
As a result, in the situation of the less important language word after replacing and a word matched in less important language word dictionary, the language of this character string unit is defined as to less important language, otherwise, the language of this character string unit is defined as to main language.
Describe the process of the 3rd determining step now with reference to Figure 14, and the 3rd determining step can comprise that multiple is-not symbol and nonnumeric candidate based on each non-similar character determines the step (step S1401) of the language of character string unit.As a result, be all the non-similar character of main language in the case of at least first three is-not symbol and nonnumeric candidate of this non-similar character, the language of determining this character string is main language.
More specifically, if the language of character string unit can not only be determined by selected the first candidate, check comprise each non-similar character multiple candidates of selectable the first candidate and their follow-up candidate, to determine at least one selected first candidate and the whether all similar main language character of right and wrong of its follow-up candidate of a character.Be all at least one character of non-similar main language (such as Russian) candidate if can find, the language of setting character string is main language.
The quantity that is used to the candidate who determines language is not specifically limited, and be conventionally not less than 3, that is to say, be used to determine first three is-not symbol and nonnumeric candidate that the candidate of language of non-similar character comprises non-similar character conventionally, comprise the first is-not symbol and nonnumeric candidate.
For example, as shown in figure 15, all similar Russian characters of right and wrong of all candidates of last character (for example first three candidate), therefore the language of this character string is set to Russian.
As shown in figure 16, identification step further comprises the step (correction step) of revising the character comprising in character string unit according to similar character table and definite language.Wherein, in the first is-not symbol and nonnumeric candidate's of the similar character in character string unit language and definite inconsistent situation of language, this similar character is replaced by the similar character of the definite language with comprising in similar character table, otherwise this character will not be changed.
More specifically, if the first is-not symbol and nonnumeric candidate of the character in character string unit does not belong to definite character cell language, then determine whether this first character is similar character.If so,, according to selected the first candidate's code, find similar character corresponding to this character cell language with determining in similar character table, and the similar character that this character is found by using is replaced.
If selected the first candidate is non-similar character, comprises its first candidate in character string unit and do not carry out any operation.
Thus, character string unit is by using suitable language to be finalized, and the long character string that comprises thus this character string unit then the document image can suitably be identified.
[good result]
Except the long character string of the above-mentioned part that comprises some different languages that link by special symbol, method in the second embodiment at least also can effectively and exactly be applied to character string unit as follows, and this character string unit is made up of special (non-similar) single alphabetic characters or similar character completely.
For the character string unit being made up of similar character completely, the method adopts similar character and conventional less important language word dictionary to determine the language of this character string unit.
Be different from common background context dictionary, the less important language word dictionary using in the method does not need to comprise a large amount of words in relational language.Less important language word dictionary is only included in the frequent less important language word being made up of similar character completely occurring in main Language Document.For example, for Russian (Greek) document, Latin language is less important language, and for a Latin language document, Russian (Greek) is less important language.This dictionary small, so that can be left in the basket search time, because it only comprises some special less important characters word.
The similar character table using in this embodiment has only recorded similar character code, and does not comprise any additional information, such as being subject to the scope of training data and the occurrence frequency of scale restriction.Only similar character is recorded in similar character table, and there is no any statistical information of their occurrence frequencies in document.Therefore, computing cost and recorded cost will reduce greatly.
Similar character table and dictionary small, so that the time cost of search can be low.
Therefore, the method can be based on similar character table or conventional less important language dictionary fast and determine efficiently the language of the word being made up of special single alphabetic characters or similar character completely, and its computing cost is little.
In sum, method in this embodiment can reduce time cost, this is mainly to determine based on similar character table and less important language word dictionary because of language, and does not have any calculating, and only has the search in two small-scale tables and (or) dictionary.
[the 3rd embodiment]
With reference to Figure 17 to 19, a third embodiment in accordance with the invention is described.The difference of the 3rd embodiment and the second embodiment is language determining step, and more specifically, the language determining step in the method for the 3rd embodiment can further comprise that the degree of confidence of the non-similar character based in character string unit determines language.Therefore, the step similar to the second embodiment of the 3rd embodiment and part will be omitted, and no longer be described in detail.
For the character string unit being made up of similar character and non-similar character, prior art can not be efficiently and is determined reliably the language of this character string unit.As a comparison, the method of a third embodiment in accordance with the invention only adopts the degree of confidence of the non-similar character of character string unit to determine its language, that is to say, the method is absorbed in its quantity less non-similar character conventionally, to reduce like this time cost and efficient, and degree of confidence based on non-similar character determine always reliable.Therefore, the method for a third embodiment in accordance with the invention can be efficiently and is determined reliably the language of character string unit.
Figure 17 is the process flow diagram of the language determining step of a third embodiment in accordance with the invention, wherein language determining step as shown in figure 17 further comprises the determining step of step S902-2(based on degree of confidence), determine the language of character string unit for the degree of confidence of the character in the character string unit based on judging at determination step.
Describe the process of step S902-2 in detail now with reference to Figure 18.
Determining step based on degree of confidence can comprise the step S1802(degree of confidence summation calculation procedure for the summation of the summation of the main language maximum confidence of each non-similar character of calculating character string location and the less important language maximum confidence of each non-similar character), and for the ratio of the summation of less important language maximum confidence and the summation of main language maximum confidence and first threshold being compared to the step S1803 of the language of determining character string unit, wherein, in the time that this ratio is less than first threshold, the language of this character string unit is confirmed as main language, otherwise the language of this character string unit is confirmed as less important language.
In step S1802, for the each non-similar character in character string unit, obtain respectively they all Russian (main language) candidate maximum confidence and sue for peace, then obtain respectively they all Latin languages (less important language) candidate maximum confidence and sue for peace.The summation of the maximum confidence of the summation of the Latin maximum confidence of non-similar character and Russian of non-similar character can be by parallel computation.Function in the calculating of the degree of confidence summation of non-similar character is as follows:
Sum LatinConfidence = Σ i n max j m ( LatinConf ij ) ;
Sum RussianConfidence = Σ i n max k m ( RusConf ik ) ;
Sum GreekConfidence = Σ i n max l m ( GrkConf il )
Wherein, n is the character quantity of the non-similar character in character string unit.M is candidate's number of each non-similar character.LatinConf ijit is j Latin language candidate's of i non-similar character in character string unit recognition confidence.RusConf ikit is k Russian candidate's of i non-similar character recognition confidence.GrkConf ilit is l Greek candidate's of i non-similar character recognition confidence.
Figure 19 illustrates image and candidate's code of correspondence and candidate's the example of character string.As shown in figure 19, there are two non-similar characters, the first candidate of these two non-similar characters is respectively 0xd8 and 0xdb, and Figure 19 illustrates that wherein the first two candidate is that Russian character and the 3rd candidate are Latin language characters by first three candidate of each character of degree of confidence sequence.Therefore, the first candidate has maximum main language degree of confidence, and the 3rd candidate has maximum less important language degree of confidence.Visible, the main language of maximum (Russian) degree of confidence of the non-similar character that its first candidate is 0xd8 is 21, and maximum less important language (Latin language) degree of confidence is 5, the main language of maximum (Russian) degree of confidence of another non-similar character is 32, and maximum less important language (Latin language) degree of confidence is 8, thereby the degree of confidence sum of Latin language and Russian is calculated as follows:
Sum LatinConfidence=5+8=13;
Sum RussianConfidence=21+32=53;
If a non-similar character does not have Russian candidate or do not have Latin language candidate, maximum Russian degree of confidence of this character or maximum Latin language degree of confidence are set as to 0.
In step S1803, the summation of the summation of main language degree of confidence and less important language degree of confidence is compared, to determine character cell language.More specifically, the summation of less important language maximum confidence and ratio and the threshold value of the summation of main language maximum confidence are compared, to determine the language of character string unit.
Two threshold values that are respectively used to Latin language/Russian are pre-estimated, to determine the language of character string.The optimal value that is used for the threshold value of distinguishing Latin language/Russian is for example set to TH1(, TH1=7.0); For example be set to TH2(, TH2=2.0 for the preferred value of distinguishing Latin language/Hellenic threshold value).
Therefore, be that Russian and less important language are in Latin situation at main language, for the non-similar character in character string unit, calculate Sum latinConfidencewith Sum russianConfidenceratio.If this ratio is greater than TH1, the language of character string unit is set as to Latin language; Otherwise, the language of this character string unit is set as to Russian.According to above-mentioned algorithm, the ratio of the degree of confidence summation of calculating is 13/53<TH1.Therefore, the language of this character string unit is Russian.
Be that Greek and less important language are in Latin situation at main language, this process can be carried out similarly, and the language of character string unit can be determined similarly.
The non-similar character of the language of the character string unit extracting as mentioned above, based in character string unit determined.More specifically, the summation of the less important language degree of confidence of the maximum of the language of character string unit based on each non-similar character is determined with the ratio of the summation of maximum main language degree of confidence.
Therefore, the language of character string unit can reliably and efficiently be determined.
After this, character string unit can stand to revise step with final by suitable language representation, the long character string that therefore comprises character string unit then file and picture can suitably be identified.
It should be noted that the determining step based on degree of confidence also can combine with the determining step based on code, thereby can realize fast, efficiently and reliably determining of character string unit.
As example, can in following situation, be performed according to the determining step based on degree of confidence of the 3rd embodiment, can not only determine language according to candidate's code of the character comprising in character string unit.
More specifically, first the language of the character string unit of extraction is determined based on similar character table and conventional less important language word dictionary according to candidate's code of this character string unit.If the language of this character string can not be determined according to candidate's code, the only degree of confidence of the non-similar character based in this character string unit, the i.e. summation of the less important language degree of confidence of the maximum based on each non-similar character and the ratio of the summation of maximum main language degree of confidence, determines the language of character string unit.
Therefore, the language of character string unit can be determined rapidly, efficiently and reliably.
[good result]
Except the long character string of the above-mentioned part that comprises some different languages that link by special symbol and the character string unit being made up of special (non-similar) single alphabetic characters or similar character completely, the method in the 3rd embodiment also at least can further be applied to the character string unit constituting by similar character and non-similar character effectively and accurately.
For the character string unit constituting by similar character and non-similar character, the method only obtains non-similar character degree of confidence to determine the language of character string unit, and this is because non-similar character always has corresponding to the high confidence level value of correct set of letters and corresponding to the low confidence value of incorrect set of letters.Even determine that by the summation of all non-similar character degree of confidence language makes in the time there is one or more character by with incorrect speech recognition, still can guarantee that judgement is always correct.Therefore,, in the situation that utilizing degree of confidence summation to determine language for such string characters, always there is pin-point accuracy.
In sum, by the method in the present embodiment, the language based on non-similar character determines and can further make to determine more reliable, and this is because the degree of confidence summation of non-similar character can provide final language judgement more accurately.
[example 1]
Below describe example 1 and comparative example 1 with reference to Figure 21 A to 21C and Figure 22, wherein Figure 21 A to 21C illustrates by the example of the method identification of the first embodiment, and Figure 22 illustrates comparative example 1.
As shown in Figure 21 A, a part for file and picture is divided into multiple long character strings according to space character, and one of them long character string can comprise the word of the different language linking by special symbol, and this special symbol is hyphen in this figure.
As shown in Figure 21 B, long character string is divided into character string unit according to special symbol.
As shown in Figure 21 C, the character string unit being respectively partitioned into can be identified exactly.
As a comparison, the method in the comparative example of example 1 is regarded the long character string of the word that comprises different language as a word and is not cut apart, and this length character string is identified as a whole.As a result, long character string can not be identified exactly, as shown in figure 22.Character code in background Beijing is identified improperly, and this is because the language of this length character string is defined as Latin language by the method.
Therefore, the method for this embodiment can be determined the language of the various piece that can link by special symbol in long word symbol string location pin-point accuracy, and therefore can identify reliably long character string and identify reliably thus file and picture.
[example 2]
Below describe example 2 and comparative example 2 with reference to Figure 23 A to 23B and Figure 24, wherein Figure 23 A to 23B illustrates by the example of the method identification of the second embodiment of the present invention, and Figure 24 illustrates comparative example 2.
For the word being formed by similar character completely, in old method, there is no method or the step of the special language of determining them.
Figure 23 A illustrates the image of the word being made up of two similar characters.
As shown in Figure 23 B, the application's method is correctly defined as Latin language based on conventional less important language word dictionary (wherein less important language is Latin language) by the language of this word.
As a comparison, as shown in figure 24, the language of word is defined as Russian by the comparative example of example 2, because this word is detected in Russian document.But in fact, the language of this word is Latin language.
Figure 20 is the block diagram illustrating according to the overall exemplary configuration of the file and picture identification equipment of embodiment.
As shown in figure 20, file and picture identification equipment 2000 can comprise the segmenting device 2001 that is configured to file and picture to be divided at least one long character string, be configured to from this in each at least one long character string according to the extraction element 2002 of the special character extraction character string unit comprising in this length character string, and be configured to character string unit based on each identification and identify the recognition device 2003 of the document image.
Preferably, this extraction element 2002 can comprise coupling determining device 2002-1 and pick-up unit 2002-2, this coupling determining device 2002-1 is configured to determine the character with at least one following candidate according to the candidate's of the each character comprising in long character string code, each candidate's is-symbol in this at least one candidate and mating with the special symbol comprising in the predetermined list of the special symbol comprising in this length character string, this pick-up unit 2002-2 is configured to by which special symbol is the image geometry feature of character of the candidate with coupling compared to the character of determining this candidate with coupling corresponding to each the image geometry feature in corresponding coupling special symbol.
Preferably, recognition device 2003 can comprise decision maker 2004 and language determining device 2005, the candidate's of the each character of this decision maker 2004 based in character string unit code and similar character table, judge that the each character comprising in character string unit is similar character or dissimilar character, this language determining device 2005 is configured to determine based on the result obtaining by described decision maker the language of character string unit.
Preferably, this language determining device 2005 can comprise the determining device 2005A based on code, the code that this determining device 2005A based on code is configured to the candidate of the character in the character string based on being determined by decision maker 2004 is determined the language of character string unit, and the determining device 2005A based on code can comprise the first determining device 2005-1, the second determining device 2005-2 and the 3rd determining device 2005-3, the first is-not symbol and nonnumeric candidate that this first determining device 2005-1 is configured to the each character comprising in character string unit is judged as similar character, determine the language of character string unit by utilizing less important language word dictionary, the first is-not symbol and nonnumeric candidate that this second determining device 2005-2 is configured to the each character comprising in character string unit is judged as the non-similar character in main language, the language of determining this character string unit is main language, and the first is-not symbol and nonnumeric candidate that the 3rd determining device 2005-3 is configured to the each character comprising in character string unit is all not the non-similar character in similar character or main language, only the non-similar character based on comprising in character string unit is determined the language of character string unit.
Preferably, the first determining device 2005-1 can comprise alternative 2005-11 and comparison means 2005-12, this alternative 2005-11 is configured such that the first is-not symbol and nonnumeric candidate of the each character comprising in character string unit is replaced to form less important language word by their corresponding less important language similar character, and this comparison means 2005-12 is configured to the word in less important to the less important language word after this replacement and this language word dictionary to compare to determine whether they mate.
Preferably, the 3rd determining device 2005-3 can comprise device 2005-31, and the multiple is-not symbol and nonnumeric candidate that this device 2005-31 is configured to the each non-similar character based in character string unit determines the language of character string unit.
Preferably, language determining device 2005 also can comprise the determining device 2005B based on degree of confidence, the degree of confidence that this determining device 2005B based on degree of confidence can be configured to the character in the character string unit based on judging by decision maker 2004 is determined the language of character string unit, and should can further comprise degree of confidence summation determining device 2005-32 and degree of confidence summation comparison means 2005-33 by the determining device 2005B based on degree of confidence, this degree of confidence summation determining device 2005-32 is configured to the summation of main language maximum confidence of the each non-similar character in calculating character string location and the summation of the less important language maximum confidence of each non-similar character, and this degree of confidence summation comparison means 2005-33 is configured to the ratio of the summation of less important language maximum confidence and the summation of main language maximum confidence and the first threshold to compare to determine the language of character string unit.
Should note, although showing language determining device 2005, Figure 20 comprises the determining device 2005A based on code and the determining device 2005B based on degree of confidence, this is only exemplary, and these two devices can be comprised in language determining device 2005 independently or combinedly, be that language determining device 2005 can comprise the independent determining device 2005A based on code, the independent determining device 2005B based on degree of confidence can be comprised, or the determining device 2005A based on code and the determining device 2005B based on degree of confidence can be comprised.
Preferably, recognition device 2003 can further comprise correcting device 2006, and this correcting device 2006 is configured to revise according to similar character table and definite language the similar character comprising in character string unit.
Said apparatus is exemplary for realizing the preferred module of said process.At large do not described above for the module that realizes each step.But, in the time that existence is used for carrying out the step of a certain process, exist corresponding for carrying out functional module or the device of same process.
In addition, can adopt various ways to carry out method and system of the present invention.For example, can carry out method and system of the present invention by software, hardware, firmware or their any combination.The order of the step of the method mentioned above is only illustrative, unless and illustrate in addition, otherwise the step of method of the present invention is not limited to specifically described order above.In addition, in certain embodiments, the present invention also can be embodied as the program recording in recording medium, comprises the machine readable instructions for implementing the method according to this invention.Therefore, the recording medium of the program for implementing the method according to this invention of storing has also been contained in the present invention.
Although reference example embodiment has described the present invention, should be appreciated that and the invention is not restricted to disclosed example embodiment.The scope of claim below will be given the most wide in range explanation, to comprise all such modifications and equivalent structure and function.

Claims (22)

1. there is a recognition methods for main language and the alphabetical file and picture less important language of mixing, comprising:
Segmentation step, for being divided into file and picture at least one long character string;
Extraction step, for each of at least one long character string from this according to the special symbol extraction character string unit comprising in this length character string; And
Identification step, identifies the document image for the character string unit based on each extraction.
2. method according to claim 1, wherein, described extraction step comprises:
Coupling determining step, the code that is used for the candidate of the each character comprising according to long character string is determined the character with at least one following candidate, each candidate's is-symbol in this at least one candidate and mating with the special symbol comprising in the predetermined list of the special symbol comprising in this length character string, and
Detecting step, for by which special symbol is each the image geometry feature of image geometry feature and corresponding coupling special symbol of character of the candidate with coupling compared to the character of determining this candidate with coupling corresponding to,
Wherein, in the time that the image geometry feature of this character is in the scope of the threshold value of the image geometry feature of a special symbol, this character is special symbol, and character string unit is extracted based on this special symbol.
3. method according to claim 2, wherein,
This image geometry feature is one that is selected from following group, and this group comprises wide/high ratio of character picture, and the bottom of character picture is apart from the distance of the first datum line, and the top of character picture is apart from the distance of the second datum line.
4. according to the method described in any one in claim 1-3, wherein, described identification step comprises:
Determination step, for the candidate's of the character based on character string unit code and similar character table, judges that the one or more characters that comprise in this character string unit are that similar character is also non-similar character; And
Language determining step, determines the language of this character string unit for the result based on obtaining by described determination step,
Wherein, in the time that the first code is-not symbol and nonnumeric candidate of character is comprised in similar character table, this character is similar character, otherwise this character is non-similar character.
5. method according to claim 4, wherein, described language determining step comprises:
Based on the determining step of code, the candidate's of the character in the character string unit based on judging in determination step code is determined the language of character string unit.
6. method according to claim 5, wherein, the described determining step based on code comprises:
The first determining step, for the first is-not symbol and nonnumeric candidate of each character of comprising in character string unit for similar character, by utilizing less important language word dictionary to determine the language of this character string unit;
The second determining step, the non-similar character that is main language for the first is-not symbol and nonnumeric candidate of each character of comprising in character string unit, the language of determining this character string unit is main language; And
The 3rd determining step, the the first is-not symbol and nonnumeric candidate who is used for the each character comprising in character string unit is all not that the non-similar character of similar character or main language, only the non-similar character based on comprising in this character string unit is determined the language of this character string unit.
7. method according to claim 6, wherein, described the first determining step comprises:
Replacement step, is replaced to form less important language word for the first is-not symbol and nonnumeric candidate who makes each character that character string unit comprises by less important language similar character; And
Comparison step, for the less important language word after replacing is compared to determine with the word of this less important language word dictionary whether they mate,
Wherein, in the situation that they mate, determine that the language of character string unit is less important language, otherwise the language of character string unit is defined as to main language.
8. method according to claim 6, wherein, described the 3rd determining step comprises:
The multiple is-not symbol and nonnumeric candidate of the each non-similar character based in character string unit determines the step of the language of this character string unit,
Wherein, be all that the non-similar character of main language, the language of this character string unit is confirmed as main language in the case of at least first three is-not symbol and nonnumeric candidate of non-similar character.
9. method according to claim 4, wherein, described language determining step further comprises:
Based on the determining step of degree of confidence, determine the language of this character string unit for the degree of confidence of the character in the character string unit of judging based on described determination step.
10. method according to claim 9, wherein, the described determining step based on degree of confidence comprises:
Degree of confidence summation calculation procedure, for the summation of main language maximum confidence of each non-similar character and the summation of the less important language maximum confidence of each non-similar character of calculating character string location; And
Degree of confidence summation comparison step, for the summation of less important language maximum confidence and ratio and the first threshold of the summation of main language maximum confidence are compared to the language of determining character string unit with this,
Wherein, in the time that described ratio is less than described first threshold, the language of this character string unit is confirmed as main language, otherwise is confirmed as less important language.
11. methods according to claim 4, wherein, described identification step further comprises:
Revise step, for revising according to similar character table and definite language the similar character that character string unit comprises,
Wherein, the first is-not symbol and nonnumeric candidate of the similar character in character string unit does not belong to definite language, this similar character by with belonging to of comprising in similar character table definite language corresponding similar character substitute.
12. 1 kinds have the identification equipment of main language and the alphabetical file and picture less important language of mixing, comprising:
Segmenting device, is configured to file and picture to be divided at least one long character string;
Extraction element, is configured in each at least one long character string, extract character string unit according to the special symbol comprising in this length character string from this; And
Recognition device, the document image is identified in the character string unit being configured to based on each extraction.
13. equipment according to claim 12, wherein, described extraction element comprises:
Coupling determining device, be configured to determine the character with at least one following candidate according to the candidate's of the each character comprising in long character string code, each candidate's is-symbol in this at least one candidate and mating with the special symbol comprising in the predetermined list of the special symbol comprising in this length character string, and
Pick-up unit, is configured to by which special symbol is the image geometry feature of character of the candidate with coupling compared to the character of determining this candidate with coupling corresponding to each the image geometry feature in corresponding coupling special symbol,
Wherein, in the time that the image geometry feature of this character is in the scope of the threshold value of the image geometry feature of a special symbol, this character is special symbol, and this character string unit is extracted based on this special symbol.
14. equipment according to claim 13, wherein,
This image geometry feature is one that is selected from following group, and this group comprises wide/high ratio of character picture, and the bottom of character picture is apart from the distance of the first datum line, and the top of character picture is apart from the distance of the second datum line.
15. according to the equipment described in any one in claim 12-14, and wherein, described recognition device comprises:
Decision maker, is configured to the candidate's of the character based in character string unit code and similar character table, judges that the one or more characters that comprise in this character string unit are that similar character is also non-similar character; And
Language determining device, is configured to determine based on the result obtaining by described decision maker the language of this character string unit,
Wherein, in the time that the first code is-not symbol and nonnumeric candidate of character is comprised in similar character table, this character is similar character, otherwise this character is non-similar character.
16. equipment according to claim 15, wherein, described language determining device comprises:
Based on the determining device of code, the candidate's of the character in the character string unit based on judging in decision maker code is determined the language of this character string unit.
17. equipment according to claim 16, wherein, the described determining device based on code comprises:
The first determining device, is configured to the first is-not symbol and nonnumeric candidate of the each character comprising in character string unit for similar character, by utilizing less important language word dictionary to determine the language of this character string unit;
The second determining device, the non-similar character that the first is-not symbol and nonnumeric candidate who is configured to the each character comprising in character string unit is main language, the language of determining this character string unit is main language; And
The 3rd determining device, the the first is-not symbol and nonnumeric candidate who is configured to the each character comprising in character string unit is all not that the non-similar character of similar character or main language, only the non-similar character based on comprising in this character string unit is determined the language of this character string unit.
18. equipment according to claim 17, wherein, described the first determining device comprises:
Alternative, is configured such that the first is-not symbol and nonnumeric candidate of the each character comprising in character string unit is replaced to form less important language word by less important language similar character; And
Comparison means, is configured to the less important language word after replacing to compare to determine with the word in this less important language word dictionary whether they mate,
Wherein, in the situation that they mate, determine that the language of character string unit is less important language, otherwise the language of character string unit is defined as to main language.
19. equipment according to claim 17, wherein, described the 3rd determining device comprises:
The multiple is-not symbol and nonnumeric candidate who is configured to the each non-similar character based in character string unit determines the device of the language of this character string unit,
Wherein, be all that the non-similar character of main language, the language of this character string unit is confirmed as main language in the case of at least first three is-not symbol and nonnumeric candidate of non-similar character.
20. equipment according to claim 15, wherein, described language determining device further comprises:
Based on the determining device of degree of confidence, the degree of confidence that is configured to the character in the character string unit based on judging in described decision maker is determined the language of this character string unit.
21. equipment according to claim 20, wherein, the described determining device based on degree of confidence comprises:
Degree of confidence summation calculation element, is configured to the summation of main language maximum confidence of the each non-similar character in calculating character string location and the summation of the less important language maximum confidence of each non-similar character; And
Degree of confidence summation comparison means, is configured to the ratio of the summation of less important language maximum confidence and the summation of main language maximum confidence and the first threshold to compare to determine the language of this character string unit,
Wherein, in the time that described ratio is less than described first threshold, the language of this character string unit is confirmed as main language, otherwise is confirmed as less important language.
22. equipment according to claim 15, wherein, described recognition device further comprises:
Correcting device, is configured to revise according to similar character table and definite language the similar character comprising in character string unit,
Wherein, the first is-not symbol and nonnumeric candidate of the similar character in character string unit does not belong to definite language, this similar character by with belonging to of comprising in similar character table definite language corresponding similar character substitute.
CN201210583676.9A 2012-12-28 2012-12-28 Document image identification method and device Pending CN103902993A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210583676.9A CN103902993A (en) 2012-12-28 2012-12-28 Document image identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210583676.9A CN103902993A (en) 2012-12-28 2012-12-28 Document image identification method and device

Publications (1)

Publication Number Publication Date
CN103902993A true CN103902993A (en) 2014-07-02

Family

ID=50994305

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210583676.9A Pending CN103902993A (en) 2012-12-28 2012-12-28 Document image identification method and device

Country Status (1)

Country Link
CN (1) CN103902993A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376304A (en) * 2014-11-18 2015-02-25 新浪网技术(中国)有限公司 Identification method and device for text advertisement image
CN106339726A (en) * 2015-07-17 2017-01-18 佳能株式会社 Method and device for handwriting recognition
CN106462579A (en) * 2014-10-15 2017-02-22 微软技术许可有限责任公司 Construction of lexicon for selected context
CN107992484A (en) * 2017-11-23 2018-05-04 网易有道信息技术(北京)有限公司 A kind of method, equipment and the storage medium of the performance for evaluating and testing OCR system
CN109214381A (en) * 2017-07-03 2019-01-15 发那科株式会社 Numerical control program conversion equipment
CN109582972A (en) * 2018-12-27 2019-04-05 信雅达系统工程股份有限公司 A kind of optical character identification error correction method based on natural language recognition
CN110088770A (en) * 2016-12-28 2019-08-02 欧姆龙健康医疗事业株式会社 Terminal installation
CN111507250A (en) * 2020-04-16 2020-08-07 北京世纪好未来教育科技有限公司 Image recognition method, device and storage medium
CN111507350A (en) * 2020-04-16 2020-08-07 腾讯科技(深圳)有限公司 Text recognition method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6047251A (en) * 1997-09-15 2000-04-04 Caere Corporation Automatic language identification system for multilingual optical character recognition
CN1808468A (en) * 2005-01-17 2006-07-26 佳能信息技术(北京)有限公司 Optical character recognition method and system
US20070081179A1 (en) * 2005-10-07 2007-04-12 Hirobumi Nishida Image processing device, image processing method, and computer program product
CN101751567A (en) * 2008-12-12 2010-06-23 汉王科技股份有限公司 Quick text recognition method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6047251A (en) * 1997-09-15 2000-04-04 Caere Corporation Automatic language identification system for multilingual optical character recognition
CN1808468A (en) * 2005-01-17 2006-07-26 佳能信息技术(北京)有限公司 Optical character recognition method and system
US20070081179A1 (en) * 2005-10-07 2007-04-12 Hirobumi Nishida Image processing device, image processing method, and computer program product
CN101751567A (en) * 2008-12-12 2010-06-23 汉王科技股份有限公司 Quick text recognition method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘娟 等: "一种适合中英文混排的字符分割技术", 《2008中国计算机大会论文集》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10296583B2 (en) 2014-10-15 2019-05-21 Microsoft Technology Licensing Llc Construction of a lexicon for a selected context
CN106462579A (en) * 2014-10-15 2017-02-22 微软技术许可有限责任公司 Construction of lexicon for selected context
CN106462579B (en) * 2014-10-15 2019-09-27 微软技术许可有限责任公司 Dictionary is constructed for selected context
CN104376304B (en) * 2014-11-18 2018-07-17 新浪网技术(中国)有限公司 A kind of recognition methods of text advertisements image and device
CN104376304A (en) * 2014-11-18 2015-02-25 新浪网技术(中国)有限公司 Identification method and device for text advertisement image
CN106339726A (en) * 2015-07-17 2017-01-18 佳能株式会社 Method and device for handwriting recognition
CN110088770A (en) * 2016-12-28 2019-08-02 欧姆龙健康医疗事业株式会社 Terminal installation
CN110088770B (en) * 2016-12-28 2023-07-07 欧姆龙健康医疗事业株式会社 Terminal device
CN109214381A (en) * 2017-07-03 2019-01-15 发那科株式会社 Numerical control program conversion equipment
CN107992484A (en) * 2017-11-23 2018-05-04 网易有道信息技术(北京)有限公司 A kind of method, equipment and the storage medium of the performance for evaluating and testing OCR system
CN107992484B (en) * 2017-11-23 2022-01-21 网易有道信息技术(北京)有限公司 Method, device and storage medium for evaluating performance of OCR system
CN109582972A (en) * 2018-12-27 2019-04-05 信雅达系统工程股份有限公司 A kind of optical character identification error correction method based on natural language recognition
CN109582972B (en) * 2018-12-27 2023-05-16 信雅达科技股份有限公司 Optical character recognition error correction method based on natural language recognition
CN111507250A (en) * 2020-04-16 2020-08-07 北京世纪好未来教育科技有限公司 Image recognition method, device and storage medium
CN111507350A (en) * 2020-04-16 2020-08-07 腾讯科技(深圳)有限公司 Text recognition method and device
CN111507350B (en) * 2020-04-16 2024-01-05 腾讯科技(深圳)有限公司 Text recognition method and device

Similar Documents

Publication Publication Date Title
CN103902993A (en) Document image identification method and device
Shafait et al. Table detection in heterogeneous documents
US7623715B2 (en) Holistic-analytical recognition of handwritten text
KR100248917B1 (en) Pattern recognizing apparatus and method
US8131087B2 (en) Program and apparatus for forms processing
CN111274239B (en) Test paper structuring processing method, device and equipment
CN101329731A (en) Automatic recognition method pf mathematical formula in image
KR20010093764A (en) Retrieval of cursive chinese handwritten annotations based on radical model
US12051256B2 (en) Entry detection and recognition for custom forms
Anh et al. A hybrid method for table detection from document image
KR101118628B1 (en) Iamge Data Recognition and Managing Method for Ancient Documents using Intelligent Recognition Library and Management Tool
CN115311666A (en) Image-text recognition method and device, computer equipment and storage medium
US9811726B2 (en) Chinese, Japanese, or Korean language detection
CN113673294B (en) Method, device, computer equipment and storage medium for extracting document key information
Villena Toro et al. Optical character recognition on engineering drawings to achieve automation in production quality control
CN103729638A (en) Text row arrangement analytical method and device for text area recognition
US20210182549A1 (en) Natural Language Processing (NLP) Pipeline for Automated Attribute Extraction
Singh et al. Document layout analysis for Indian newspapers using contour based symbiotic approach
Robertson Optical character recognition for classical philology
CN115147846A (en) Multi-language bill identification method, device, equipment and storage medium
CN116229497A (en) Layout text recognition method and device and electronic equipment
Rathnasena et al. Summarization based approach for old sinhala text archival search and preservation
Puri et al. Sentence detection and extraction in machine printed imaged document using matching technique
Nisa et al. Annotation of struck-out text in handwritten documents
Rakshit et al. Development of a multi-user handwriting recognition system using Tesseract open source OCR engine

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
AD01 Patent right deemed abandoned
AD01 Patent right deemed abandoned

Effective date of abandoning: 20180724