CN1084503C

CN1084503C - Method for automatically correcting truncating error of document and device thereof

Info

Publication number: CN1084503C
Application number: CN96100537A
Authority: CN
Inventors: 张照煌
Original assignee: Industrial Technology Research Institute ITRI
Current assignee: Transpacific IP Pte Ltd.
Priority date: 1996-04-09
Filing date: 1996-04-09
Publication date: 2002-05-08
Anticipated expiration: 2016-04-09
Also published as: CN1162158A

Abstract

The present invention relates to a method for automatic correction method for a document identification character truncating error and a device composed of the method to provide an automatic correction function for the truncating error when a character is truncated. First, an alternate character matrix is expanded into an expanded alternate character matrix by establishing a perpendicular character structure table and a horizontal character structure table which probably has the character truncating error in advance according to a document format of vertical writing or longitudinal writing, character string combinations of the expanded alternate character matrix are scored by a language model, the highest one is selected, and the character truncating error is corrected automatically.

Description

The document identification cut character error more correction method and device automatically

The present invention is relevant for a kind of error correction method thereof and device thereof of document identification, cuts the more correction method and the equipment therefor thereof automatically of character error when being particularly to the identification of Chinese character document.Its application category comprises hand-written Chinese identification on Chinese list reading machine, printing/hand-written Chinese text identification system, the pen based computer environment/line, manuscript paper reading machine, reaches other Chinese character document identification system.

Fig. 1 represents the processing flow chart of general Chinese character document identification.At first utilize image pick-up in step 10, for example common scanner (scanner) is converted into electronic signal with the literal image of file.Above-mentioned file may comprise block letter and handwritten form in practical application, so word space may not be identical.Picture and text separation, literal cutting are then carried out in the pre-treatment of step 20, find out a series of Chinese character textBox image.Then gained Chinese character textBox image is extracted its statistical nature or architectural feature in that step 30 is other, calculate each literal image characteristic values.Again with above-mentioned eigenwert with train the parameter model of the recognition word collection of gained to carry out aspect ratio in advance, find out similarity is the highest one or more candidate and corresponding similarity scoring therebetween, to constitute candidate matrix (step 50) to (step 40).Above-mentioned steps 10-50 is the identification stage of normal words, and the gained result is the candidate matrix; But reach the document identification stage, then need to carry out aftertreatment by language model.

With " crow " two words is example, might be regarded as " bird crow " when the text-recognition of reality, and gained candidate matrix class is following form seemingly:

Bird (20) crow (17)

Its similarity scoring of each candidate right side digitized representation of crow (22) refined (30), its numerical value is littler, expression and former font image similarity degree higher (that is otherness is littler).As mentioned above, the similarity degree of " bird crow " anti-" crow " comes highly.Therefore, the aftertreatment that step 60 is carried out promptly is to utilize language model to correct above-mentioned issuable text-recognition mistake, for example utilizes dictionary to select " crow " but not " bird crow ".General language model scoring can utilize the statistics scoring of knowing, and continues and shows or clump continues and shows or mark frequently based on the speech long word of dictionary as word table, speech the continue table, part of speech of word that continue between table, speech that continue, and shows with probable value or fractional value.Select the highest candidate word string of similarity degree as result's output by step 70 at last.

In the document identification, the mistake that is taken place between similar " crow " and " bird " generally is referred to as the replaceability mistake, results from feature extraction and aspect ratio in the step.In addition, also have a kind of character error of cutting, result from and cut the word step in the pre-treatment.Cut the character error of cutting that character error generally comprises the property cut apart, become " family jin " as " institute " by identification, " crow " become " tooth bird " by identification, and compressibility cut character error, become " just " as " capital is outstanding " by identification.For the hard and fast rule manuscript paper document that bright lattice/dark lattice are arranged, the problem of cutting character error is also not serious; But when the Chinese character document being arranged shortly or not having the input of natural handwriting of bright lattice/dark lattice, it is then quite obvious to cut character error.

Error-detecting of knowing at present and error correction technology all are confined to handle replaceability mistake aspect, Taiwan patent 81104438,80102492,80107315,83103817.For cutting character error, product now and laboratory system are all to provide manually-operated corrigendum instrument to solve.In practical application, obviously be not effective scheme.

Fundamental purpose of the present invention, what be to provide a kind of document identification cuts character error correction method more automatically, in order to the character error of cutting in effective solution text-recognition, improves the correctness of identification.

Another object of the present invention, what be to provide a kind of document identification cuts character error equipment more automatically, can produce the high identification result of correctness according to text-recognition gained candidate matrix.

According to above-mentioned purpose, what the invention provides a kind of document identification cuts character error correction method more automatically, in order to cut the character error corrigendum according to a perpendicular candidate matrix of writing document, above-mentioned candidate matrix is via producing behind the text-recognition, the present invention utilizes representative can cut apart and merge the vertical/horizontal font structural table of vertical/horizontal font, vertical/horizontal character coupling or uncoupling means expands to above-mentioned candidate matrix and expands the candidate matrix, the processing of marking of word string after utilizing a language model to above-mentioned expansion candidate matrix combined treatment again, select the highest word string of scoring, can will cut character error and correct automatically.

In addition, what the present invention also provided a kind of document identification cuts character error equipment more automatically, in order to cut the character error corrigendum according to a perpendicular candidate matrix of writing document, above-mentioned candidate matrix is via producing behind the text-recognition, it comprises: a vertical character coupling or uncoupling means receives above-mentioned candidate matrix, according to a vertical font structural table, it is expanded to expansion candidate matrix, use the situation that character is cut apart and character merges in the above-mentioned candidate matrix of expression; And a language model scoring apparatus, with the processing of marking of the word string after the above-mentioned expansion candidate matrix combined treatment, select the highest word string of its scoring, correct automatically will cut character error.

For above-mentioned purpose of the present invention, feature and advantage can be become apparent, this paper is especially exemplified by a specific embodiment, and conjunction with figs., is described in detail below:

Brief Description Of Drawings:

Fig. 1 is a process flow diagram of knowing the document discrimination method.

Fig. 2 is the automatic more process flow diagram of correction method of character error of cutting of the present invention.

Fig. 3 is the automatic more calcspar of equipment of character error of cutting of the present invention.

Fig. 4 separates about of the present invention and separated portions font example table up and down.

Generally cutting character error is the pre-treatment step that results from the document identification, of the present invention cut character error automatically more correction method then be before carrying out post-processing step, the candidate matrix is expanded to and expands the candidate matrix according to cutting apart situation and combination situation, with tangent character error more automatically.

The font structure of Chinese text, according to the relative position relation of each link (connected component), can divide into up and down separate (for example " calling together "), about separate (for example " institute "), partly contain (for example " asking ") and contain types such as (for example " returning ") entirely.When the document identification system is carried out pre-treatment literal cutting action,, generally adopt horizontal or vertical scanning to cut apart according to ways of writing.Therefore, it is the easiest of the perpendicular type of separation literal up and down that occurs in when writing to cut character error, occurs in left and right sides type of separation literal when writing across the page.On the other hand, cut character error, can divide into that the property cut apart is cut character error and compressibility is cut character error according to reason.But when that produced when after the cutting or merging back and improper literal, then the text-recognition stage can be thought it by mistake the normal text that another is altogether irrelevant, makes processing become very difficult.

Therefore, at the perpendicular literal that may cut character error in the document and handle for present embodiment institute desire of writing, have following condition:

(1) can be separated into two or more parts in succession up and down, and each in succession parts all form normal text.

(2) do not comprise after the separation that parts in succession can form the literal of the word sequence in succession of frequent appearance, for example: " two " ← → " one by one ".

In like manner, the literal that may cut character error and handle for present embodiment institute desire in the document of writing across the page has following condition:

(1) can about be separated into two or more parts in succession, and each in succession parts all form normal text.

(2) do not comprise after the separation that parts in succession can form the literal of the word sequence in succession of frequent appearance, for example: " good " ← → " woman ".

In the present embodiment, be that the word that still belongs to BIG-513051 character library (the second word collection) after separating with the interior literal of BIG-55401 character library (the first word collection) is an example, wherein, can be separated into up and down two, three, four in succession the literal of parts respectively have 397,14 and 1, can about be separated into two, three in succession the literal of parts respectively have 1570 and 38.List respectively among Fig. 4 about part and separate and the example that separates up and down.In addition, above-mentioned first word collection and the visual actual state of the second word collection are adjusted voluntarily, and certain first word collection can be identical with the second word collection.

According to above-described corresponding relation, can set up vertical font structural table and horizontal font structural table respectively, write document and write across the page document identification corrigendum use for perpendicular.The font structural table can represent that both are slightly different in the data statement with tabular structure or reticulate texture.With " paste " is example, can about be separated into " Mi Guyue " or " rice recklessly ", the tabular structure expression of various combinations can being itemized this moment, reticulate texture then can be represented according to stratum's segmentation.

Utilize vertical font structural table and horizontal font structural table, can handle the character error of cutting of property cut apart and compressibility.Fig. 2 represents to cut the automatic more process flow diagram of correction method of character error.Wherein, the flow process of text-recognition before the stage is constant, that is with the candidate matrix as input.According to the document format write, respectively the perpendicular document of writing is handled (step 52) with the document of writing across the page.For the perpendicular document of writing, deciliter handle (step 54) with vertical character and the candidate matrix of N * M is extended to expands the candidate matrix, wherein N is input word number, the M candidate number for each input word.In vertical character deciliter processing, be that preceding L higher candidate of similarity degree word for word cut apart and possible merging, to check all possible character error of cutting, wherein L is the positive integer that is not more than M.As for the adjustment in the similarity scoring, then can set according to actual demand.In the present embodiment, get L=1; When cutting apart character (C → C1, C2), C (SC) → C1 (SC) then, C2 (0); When merging character (C1, C2 → C), C1 (SC1) then, C2 (SC2) → C (SC1+SC2+15), wherein SC, SC1, SC2 represent the similarity scoring of corresponding character.Then carry out aftertreatment (step 60) with a language model, it is the highest to find out scoring in the word string by various combinations.By such handling procedure, can will cut character error and correct automatically, obtain correct result's output (step 70).For the document of writing across the page, processing mode is identical, repeats no more herein.

Above-mentioned character is cut apart, character merges, the word string combination, processing such as language model word string scoring, can interlock or batch mode carry out, for example former word string combination → scoring → character is cut apart → word string combination → scoring → character merging → word string combination → scoring, or character is cut apart → combination → scoring of character merging → word string.In addition, character merge with dividing processing all be that candidate matrix with input is an object, that is the result after the dividing processing no longer does to merge and handles, the result who merges after handling also no longer carries out dividing processing.

Now with example explanation present embodiment, the document fragment of being imported is:

" Tokyo especially be exactly electricity logical target "

Candidate matrix according to text-recognition stage gained is:

East 34 cards, 34 bundles 35

Capital 47 is cooked 64 64

Outstanding 35 In-particular, 48 arts 58

Its 35 dustpan 51 calculates 54

Capital 52 is cooked 58 65

Outstanding 43 In-particular, 52 arts 59

Be 29 fixed 42 foots 43

Electricity 35 hails 37 secondary rainbows 37

Logical 39 suitable 48 is near by 53

Family 52 table tennis 61 Yin 67

55 liter of 58 row 74 of jin

43 about 63 hooks 63

Order 32 months 48 times 60

Mark 35 stupefied 41 43

Wherein, each candidate right side is its similarity scoring, and the numerical value little person's similarity degree of healing is higher.Utilize dividing processing, can be with above-mentioned candidate matrix expansion, wherein

43 about 63 hooks 63

→ white 43

Spoon 0

Mark 35 stupefied 41 43

→ wood 35

Ticket 0

Utilize to merge and handle then:

Capital 47 is cooked 64 64

Outstanding 35 In-particular, 48 arts 58

With regard to 97

Capital 52 is cooked 58 65

Outstanding 43 In-particular, 52 arts 59

With regard to 110

Family 52 table tennis 61 Yin 67

55 liter of 58 row 74 of jin

Institute 122

TOP V through word string combination scoring gained in the original candidate matrix is in regular turn:

1[2132] Tokyo especially the capital be the target of the logical family of electricity jin especially

2[2127] Tokyo especially the capital be the target of the logical family of electricity jin especially

3[2123] Tokyo especially the capital be the target of the logical family of secondary rainbow jin especially

4[2121] Tokyo especially the capital be the target of the logical family of electricity jin especially

5[2120] Tokyo especially the ancestor be the target of the logical family of electricity jin especially

Wherein the best result person is numbering 1 (scoring is 2132).As for scoring, enumerate following numerical example now via the new word string combination of expansion candidate matrix gained:

A[2105] Tokyo especially the capital be the logical family of electricity jin white peony root target especially

B[2099] Tokyo especially the capital be the order wood ticket of the logical family of electricity jin especially

C[2113] east is the target of the logical family of electricity jin especially with regard to its capital

D[2143] Tokyo especially is exactly the target of the logical family of electricity jin

E[2160] Tokyo especially be exactly electricity logical target

Numbering A general " " → " white peony root ", scoring descends; B is with " mark " → " wooden ticket " for numbering, and scoring descends; C is with first " capital is outstanding " → " just " for numbering, and scoring reduces; D is with second " capital is outstanding " → " just " for numbering, and scoring is risen; E person is with second " capital is outstanding " → " just " and " family jin " → " institute " for numbering, and scoring is not only risen, and be best result (2160), so word string makes up and be correct output result, cuts character error simultaneously and also corrects automatically.

Fig. 3 cuts the automatic more calcspar of equipment of character error.Vertical/horizontal character coupling or uncoupling means 80 utilizes suitable processings of cutting apart or merges with the candidate matrix of input, produces the expansion candidate matrix of correspondence, marks and selects wherein soprano via language model scoring apparatus 82, as the aftertreatment result.Wherein vertical/horizontal character coupling or uncoupling means 80 and language model scoring apparatus 82 can be implemented by computer program.

Though the present invention discloses as above with concrete implementation column; but it is not in order to qualification the present invention, any those skilled in the art, without departing from the spirit and scope of the present invention; can do a little modification and retouching, so protection scope of the present invention should be as the criterion with the qualification person of accompanying Claim institute.

Claims

A document identification cut character error correction method more automatically, can be in order to cut the character error corrigendum according to a perpendicular candidate matrix of writing document, above-mentioned candidate matrix can is characterized in that via producing behind the text-recognition:

Utilize representative may cut apart and a vertical font structural table that merges the font of cutting character error, one vertical character coupling or uncoupling means expands to above-mentioned candidate matrix and expands the candidate matrix, the processing of marking of word string after utilizing a language model to above-mentioned expansion candidate matrix combined treatment again, select the highest word string of scoring, can will cut character error and correct automatically.
2. the character error correction method more automatically of cutting as claimed in claim 1, above-mentioned vertical font structural table is the font that utilizes one first word to concentrate, the each several part of its vertical separation still is the font that one second word is concentrated, the both sides relation table of being set up.
3. the character error correction method more automatically of cutting as claimed in claim 2, wherein above-mentioned vertical font structural table is to utilize the tabular structure to represent.
4. the character error correction method more automatically of cutting as claimed in claim 2, wherein above-mentioned vertical font structural table is to utilize reticulate texture to represent.
5. the character error correction method more automatically of cutting as claimed in claim 2, the wherein above-mentioned first word collection can be identical with the above-mentioned second word collection.
6. the character error correction method more automatically of cutting as claimed in claim 1, wherein above-mentioned vertical character coupling or uncoupling means, utilize above-mentioned vertical font structural table, the capable character that carries out of the higher preceding L of probability in the above-mentioned candidate matrix is merged processing or character dividing processing, produce above-mentioned expansion candidate matrix, L is a positive integer and the total line number that is not more than above-mentioned candidate matrix.
7. the character error correction method more automatically of cutting as claimed in claim 6, wherein above-mentioned character dividing processing, character merge processing, combined treatment and scoring to be handled to interlock and carries out, to select the highest word string of scoring.
8. the character error correction method more automatically of cutting as claimed in claim 6, wherein above-mentioned character dividing processing, character merge processing, combined treatment and scoring to be handled and can batch carry out, to select the highest word string of scoring.
A document identification cut character error equipment more automatically, can be in order to cut the character error corrigendum according to a perpendicular candidate matrix of writing document, above-mentioned candidate matrix can is characterized in that comprising via producing behind the text-recognition:

One vertical character coupling or uncoupling means receives above-mentioned candidate matrix, according to a vertical font structural table, it is expanded to expansion candidate matrix, with the situation of representing that character is cut apart and character merges in the above-mentioned candidate matrix; And

One language model scoring apparatus with the processing of marking of the word string after the above-mentioned expansion candidate matrix combined treatment, is selected the highest word string of its scoring, corrects automatically will cut character error.
A document identification cut character error correction method more automatically, can be in order to cut the character error corrigendum according to a candidate matrix of writing across the page document, above-mentioned candidate matrix can is characterized in that via producing behind the text-recognition:

Utilize representative may cut apart and merge the horizontal font structural table of the font of cutting character error, one horizontal character coupling or uncoupling means expands to above-mentioned candidate matrix and expands the candidate matrix, the processing of marking of word string after utilizing a language model to above-mentioned expansion candidate matrix combined treatment again, select the highest word string of scoring, can will cut character error and correct automatically.
11. the character error correction method more automatically of cutting as claimed in claim 10, wherein above-mentioned horizontal font structural table is the font that utilizes one first word to concentrate, and the each several part of its horizontal separation still is the font that one second word is concentrated, the both sides relation table of being set up.
12. the character error correction method more automatically of cutting as claimed in claim 11, wherein above-mentioned horizontal font structural table is to utilize the tabular structure to represent.
13. the character error correction method more automatically of cutting as claimed in claim 11, wherein above-mentioned horizontal font structural table is to utilize reticulate texture to represent.
14. the character error correction method more automatically of cutting as claimed in claim 11, the wherein above-mentioned first word collection can be identical with the above-mentioned second word collection.
15. the character error correction method more automatically of cutting as claimed in claim 10, wherein above-mentioned horizontal character coupling or uncoupling means, utilize above-mentioned horizontal font structural table, to capable character merging processing or the character dividing processing from left to right of carrying out of the higher preceding L of probability in the above-mentioned candidate matrix, produce above-mentioned expansion candidate matrix, L is a positive integer and the total line number that is not more than above-mentioned candidate matrix.
Handle to interlock and carry out 16. the character error correction method more automatically of cutting as claimed in claim 15, wherein above-mentioned character dividing processing, character merge processing, combined treatment and scoring, select the highest word string of scoring.
17. the character error correction method more automatically of cutting as claimed in claim 15, merges processing, combined treatment and scoring and handles and can batch carry out wherein above-mentioned dividing processing, selects the highest word string of scoring.
18. the character error correction method more automatically of cutting as claimed in claim 10, wherein above-mentioned horizontal character coupling or uncoupling means, utilize above-mentioned horizontal font structural table, to capable character merging processing or the character dividing processing from right to left of carrying out of the higher preceding L of probability in the above-mentioned candidate matrix, produce above-mentioned expansion candidate matrix, L is a positive integer and the total line number that is not more than above-mentioned candidate matrix.
Handle to interlock and carry out 19. the character error correction method more automatically of cutting as claimed in claim 18, wherein above-mentioned character dividing processing, character merge processing, combined treatment and scoring, to select the highest word string of scoring.
Handle and batch to carry out 20. the character error correction method more automatically of cutting as claimed in claim 18, wherein above-mentioned character dividing processing, character merge processing, combined treatment and scoring, to select the highest word string of scoring.
21. a document identification cut character error equipment more automatically, can be in order to cut the character error corrigendum according to a candidate matrix of writing across the page document, above-mentioned candidate matrix can is characterized in that comprising via producing behind the text-recognition:

One horizontal character coupling or uncoupling means receives above-mentioned candidate matrix, according to a horizontal font structural table, it is expanded to expansion candidate matrix, with the situation of representing that character is cut apart and character merges in the above-mentioned candidate matrix; And

One language model scoring apparatus with the processing of marking of the word string after the above-mentioned expansion candidate matrix combined treatment, is selected the highest word string of its scoring, corrects automatically will cut character error.