Fig. 1 represents the processing flow chart of general Chinese character document identification.At first utilize image pick-up in step 10, for example common scanner (scanner) is converted into electronic signal with the literal image of file.Above-mentioned file may comprise block letter and handwritten form in practical application, so word space may not be identical.Picture and text separation, literal cutting are then carried out in the pre-treatment of step 20, find out a series of Chinese character textBox image.Then gained Chinese character textBox image is extracted its statistical nature or architectural feature in that step 30 is other, calculate each literal image characteristic values.Again with above-mentioned eigenwert with train the parameter model of the recognition word collection of gained to carry out aspect ratio in advance, find out similarity is the highest one or more candidate and corresponding similarity scoring therebetween, to constitute candidate matrix (step 50) to (step 40).Above-mentioned steps 10-50 is the identification stage of normal words, and the gained result is the candidate matrix; But reach the document identification stage, then need to carry out aftertreatment by language model.
With " crow " two words is example, might be regarded as " bird crow " when the text-recognition of reality, and gained candidate matrix class is following form seemingly:
Bird (20) crow (17)
Its similarity scoring of each candidate right side digitized representation of crow (22) refined (30), its numerical value is littler, expression and former font image similarity degree higher (that is otherness is littler).As mentioned above, the similarity degree of " bird crow " anti-" crow " comes highly.Therefore, the aftertreatment that step 60 is carried out promptly is to utilize language model to correct above-mentioned issuable text-recognition mistake, for example utilizes dictionary to select " crow " but not " bird crow ".General language model scoring can utilize the statistics scoring of knowing, and continues and shows or clump continues and shows or mark frequently based on the speech long word of dictionary as word table, speech the continue table, part of speech of word that continue between table, speech that continue, and shows with probable value or fractional value.Select the highest candidate word string of similarity degree as result's output by step 70 at last.
In the document identification, the mistake that is taken place between similar " crow " and " bird " generally is referred to as the replaceability mistake, results from feature extraction and aspect ratio in the step.In addition, also have a kind of character error of cutting, result from and cut the word step in the pre-treatment.Cut the character error of cutting that character error generally comprises the property cut apart, become " family jin " as " institute " by identification, " crow " become " tooth bird " by identification, and compressibility cut character error, become " just " as " capital is outstanding " by identification.For the hard and fast rule manuscript paper document that bright lattice/dark lattice are arranged, the problem of cutting character error is also not serious; But when the Chinese character document being arranged shortly or not having the input of natural handwriting of bright lattice/dark lattice, it is then quite obvious to cut character error.
Error-detecting of knowing at present and error correction technology all are confined to handle replaceability mistake aspect, Taiwan patent 81104438,80102492,80107315,83103817.For cutting character error, product now and laboratory system are all to provide manually-operated corrigendum instrument to solve.In practical application, obviously be not effective scheme.
Generally cutting character error is the pre-treatment step that results from the document identification, of the present invention cut character error automatically more correction method then be before carrying out post-processing step, the candidate matrix is expanded to and expands the candidate matrix according to cutting apart situation and combination situation, with tangent character error more automatically.
The font structure of Chinese text, according to the relative position relation of each link (connected component), can divide into up and down separate (for example " calling together "), about separate (for example " institute "), partly contain (for example " asking ") and contain types such as (for example " returning ") entirely.When the document identification system is carried out pre-treatment literal cutting action,, generally adopt horizontal or vertical scanning to cut apart according to ways of writing.Therefore, it is the easiest of the perpendicular type of separation literal up and down that occurs in when writing to cut character error, occurs in left and right sides type of separation literal when writing across the page.On the other hand, cut character error, can divide into that the property cut apart is cut character error and compressibility is cut character error according to reason.But when that produced when after the cutting or merging back and improper literal, then the text-recognition stage can be thought it by mistake the normal text that another is altogether irrelevant, makes processing become very difficult.
Therefore, at the perpendicular literal that may cut character error in the document and handle for present embodiment institute desire of writing, have following condition:
(1) can be separated into two or more parts in succession up and down, and each in succession parts all form normal text.
(2) do not comprise after the separation that parts in succession can form the literal of the word sequence in succession of frequent appearance, for example: " two " ← → " one by one ".
In like manner, the literal that may cut character error and handle for present embodiment institute desire in the document of writing across the page has following condition:
(1) can about be separated into two or more parts in succession, and each in succession parts all form normal text.
(2) do not comprise after the separation that parts in succession can form the literal of the word sequence in succession of frequent appearance, for example: " good " ← → " woman ".
In the present embodiment, be that the word that still belongs to BIG-513051 character library (the second word collection) after separating with the interior literal of BIG-55401 character library (the first word collection) is an example, wherein, can be separated into up and down two, three, four in succession the literal of parts respectively have 397,14 and 1, can about be separated into two, three in succession the literal of parts respectively have 1570 and 38.List respectively among Fig. 4 about part and separate and the example that separates up and down.In addition, above-mentioned first word collection and the visual actual state of the second word collection are adjusted voluntarily, and certain first word collection can be identical with the second word collection.
According to above-described corresponding relation, can set up vertical font structural table and horizontal font structural table respectively, write document and write across the page document identification corrigendum use for perpendicular.The font structural table can represent that both are slightly different in the data statement with tabular structure or reticulate texture.With " paste " is example, can about be separated into " Mi Guyue " or " rice recklessly ", the tabular structure expression of various combinations can being itemized this moment, reticulate texture then can be represented according to stratum's segmentation.
Utilize vertical font structural table and horizontal font structural table, can handle the character error of cutting of property cut apart and compressibility.Fig. 2 represents to cut the automatic more process flow diagram of correction method of character error.Wherein, the flow process of text-recognition before the stage is constant, that is with the candidate matrix as input.According to the document format write, respectively the perpendicular document of writing is handled (step 52) with the document of writing across the page.For the perpendicular document of writing, deciliter handle (step 54) with vertical character and the candidate matrix of N * M is extended to expands the candidate matrix, wherein N is input word number, the M candidate number for each input word.In vertical character deciliter processing, be that preceding L higher candidate of similarity degree word for word cut apart and possible merging, to check all possible character error of cutting, wherein L is the positive integer that is not more than M.As for the adjustment in the similarity scoring, then can set according to actual demand.In the present embodiment, get L=1; When cutting apart character (C → C1, C2), C (SC) → C1 (SC) then, C2 (0); When merging character (C1, C2 → C), C1 (SC1) then, C2 (SC2) → C (SC1+SC2+15), wherein SC, SC1, SC2 represent the similarity scoring of corresponding character.Then carry out aftertreatment (step 60) with a language model, it is the highest to find out scoring in the word string by various combinations.By such handling procedure, can will cut character error and correct automatically, obtain correct result's output (step 70).For the document of writing across the page, processing mode is identical, repeats no more herein.
Above-mentioned character is cut apart, character merges, the word string combination, processing such as language model word string scoring, can interlock or batch mode carry out, for example former word string combination → scoring → character is cut apart → word string combination → scoring → character merging → word string combination → scoring, or character is cut apart → combination → scoring of character merging → word string.In addition, character merge with dividing processing all be that candidate matrix with input is an object, that is the result after the dividing processing no longer does to merge and handles, the result who merges after handling also no longer carries out dividing processing.
Now with example explanation present embodiment, the document fragment of being imported is:
" Tokyo especially be exactly electricity logical target "
Candidate matrix according to text-recognition stage gained is:
East 34 cards, 34 bundles 35
Capital 47 is cooked 64 64
Outstanding 35 In-particular, 48 arts 58
Its 35 dustpan 51 calculates 54
Capital 52 is cooked 58 65
Outstanding 43 In-particular, 52 arts 59
Be 29 fixed 42 foots 43
Electricity 35 hails 37 secondary rainbows 37
Logical 39 suitable 48 is near by 53
Family 52 table tennis 61 Yin 67
55 liter of 58 row 74 of jin
43 about 63 hooks 63
Order 32 months 48 times 60
Mark 35 stupefied 41 43
Wherein, each candidate right side is its similarity scoring, and the numerical value little person's similarity degree of healing is higher.Utilize dividing processing, can be with above-mentioned candidate matrix expansion, wherein
43 about 63 hooks 63
→ white 43
Spoon 0
Mark 35 stupefied 41 43
→ wood 35
Ticket 0
Utilize to merge and handle then:
Capital 47 is cooked 64 64
Outstanding 35 In-particular, 48 arts 58
With regard to 97
Capital 52 is cooked 58 65
Outstanding 43 In-particular, 52 arts 59
With regard to 110
Family 52 table tennis 61 Yin 67
55 liter of 58 row 74 of jin
Institute 122
TOP V through word string combination scoring gained in the original candidate matrix is in regular turn:
1[2132] Tokyo especially the capital be the target of the logical family of electricity jin especially
2[2127] Tokyo especially the capital be the target of the logical family of electricity jin especially
3[2123] Tokyo especially the capital be the target of the logical family of secondary rainbow jin especially
4[2121] Tokyo especially the capital be the target of the logical family of electricity jin especially
5[2120] Tokyo especially the ancestor be the target of the logical family of electricity jin especially
Wherein the best result person is numbering 1 (scoring is 2132).As for scoring, enumerate following numerical example now via the new word string combination of expansion candidate matrix gained:
A[2105] Tokyo especially the capital be the logical family of electricity jin white peony root target especially
B[2099] Tokyo especially the capital be the order wood ticket of the logical family of electricity jin especially
C[2113] east is the target of the logical family of electricity jin especially with regard to its capital
D[2143] Tokyo especially is exactly the target of the logical family of electricity jin
E[2160] Tokyo especially be exactly electricity logical target
Numbering A general " " → " white peony root ", scoring descends; B is with " mark " → " wooden ticket " for numbering, and scoring descends; C is with first " capital is outstanding " → " just " for numbering, and scoring reduces; D is with second " capital is outstanding " → " just " for numbering, and scoring is risen; E person is with second " capital is outstanding " → " just " and " family jin " → " institute " for numbering, and scoring is not only risen, and be best result (2160), so word string makes up and be correct output result, cuts character error simultaneously and also corrects automatically.
Fig. 3 cuts the automatic more calcspar of equipment of character error.Vertical/horizontal character coupling or uncoupling means 80 utilizes suitable processings of cutting apart or merges with the candidate matrix of input, produces the expansion candidate matrix of correspondence, marks and selects wherein soprano via language model scoring apparatus 82, as the aftertreatment result.Wherein vertical/horizontal character coupling or uncoupling means 80 and language model scoring apparatus 82 can be implemented by computer program.
Though the present invention discloses as above with concrete implementation column; but it is not in order to qualification the present invention, any those skilled in the art, without departing from the spirit and scope of the present invention; can do a little modification and retouching, so protection scope of the present invention should be as the criterion with the qualification person of accompanying Claim institute.