Many commerce and government unit all need handle the printing list of filling in handwriting, have many methods can be with this data pick-up, handle and stored.For instance, can utilize image-scanning device and optical character identification technique to extract printing or hand-written data on the list.Form image itself can be via take a picture producing micro monolithic or microfilm, or utilize optical scanning and produce image storage in hard disc of computer or other electron storage medium.Renowned company such as Toshiba (Toshiba), Sanyo (Sanyo), Hitachi (Hitachi), Panasonic (Panasonic) etc. have all released combining image scanning and have handled Japanese and alphanumeric data with the list reading system of optics character identification (OCR) device.
The list that a kind of OCR device is commonly used is A8 or A4 size, always the list of dark grid is arranged.Fig. 1 is an example of describing such list.Comment on the list need be printd in advance in the field position of regulation, and literal to be filled out need fill in in the spaced field between dark grid sign, word and word.Comment does not need with concealed wire (lattice) separately.
Dark lattice list 20 shown in Figure 1 has fields such as 22,24,26,28 can fill in literal, for example: in the medical policy that exemplifies, comprise insurant's name 22, patient name 24, employer's name 26 and patient and insurant's name 28.Related text is received in the grid 30 that contains concealed wire 32, can only fill in Chinese or English alphabet numeral in each grid that is defined by concealed wire 34.Position symbol 36 is printed on the list 20, and in a preferred embodiment, this symbol 36 is positioned at four jiaos of list, the inclination of list and shift state when being used to correct scan.
Fig. 2 shows the part amplification sample of list 20, wherein print the character part, for example " insurant's name " 38 and " patient name " 39 do not have concealed wire to separate, but the hand-written character that falls within hurdle 40 and 42 (being shown in the dotted line field of Fig. 2) is then write in grid 34.Grid 34 is made up of the dark lattice 32 that are positioned at field 44,46.
A character identification system often can't guarantee that identification is errorless, and particularly when the hand-written character of identification, the identification mistake is unavoidable.So, artificial corrigendum (being carried out by operating personnel) is imperative.Some typical character identification system often refuses to recognize hasty and careless or illegal character.When refuse to recognize increase greatly with the misidentification character after, the word correction rate is for automatic system, and is more important than general artificial data login system.Therefore, optics character identification system preferably can provide the cover method that corrigendum can't the identification character effectively.
The list identification result can be divided into three kinds of situations:
1, entirely true: but in the list each character all identification and each field all by the aftertreatment inspection, for example dictionary inspection (whether the identification field meets a word in the dictionary), grammar testing (whether the identification field meets the default syntax) etc.Not needing any artificial corrigendum after debating knowledge, also is can't correct (for example: character identification mistake is desired equally by the aftertreatment inspection) because of the hiding mistake of system causes even any mistake is arranged.Its concealing errors of the system of a practicality must be lower than the manual entry system.
2, artificial corrigendum: need after the list identification through artificial screen corrigendum.But when some character refused to recognize or field in all identifications but of character by integrity checking, then list must section through artificial corrigendum.
3, refuse to recognize for whole: can't the identification character in list too many (for example, because quality of scanning is too poor, list mistake or the hasty and careless list of written handwriting are refused to recognize, the character on this moment all lists must be by artificial input.
Some external optics character identification systems have proposed different solutions at the problems referred to above.For example, United States Patent (USP) 5,251, No. 273 (Betts etc.) propose a data disposal system and method, the mistake that produces behind the identification of the corrigendum scanning in regular turn list.Comprise three Identification Data corrigendum processors in the device that this reference example proposed, i.e. artificial checking of (1) artificial intelligence process device (2) database error detection processor (3) and corrigendum processor.Data structure records recognition result that a kind of machine produces and corrigendum are historical, and it is sent to each processor successively.After artificial intelligence and Database error corrigendum processor processing is finished, can the display field bit image on the workstation screen for artificial corrigendum.
United States Patent (USP) 5,305, No. 396 (Betts) proposes a data handling system and method, can select character identification flow process and Identification Data corrigendum flow process at different client's lists.This reference example proposes elder generation's input one list masterplate before identification, and this masterplate contains the system operating parameters of establishing according to customer demand, and this list masterplate must elder generation be read by system before a large amount of identifications.
United States Patent (USP) 5,235, No. 654 (Anderson etc.) propose progressive data pick-up, a data handling system, handle the form image after scanning.Its content is one can produce the system that new table is done automatic processing.
United States Patent (USP) 5,153, No. 927 (Yamanari) proposes a character reading system and a method, and this patent proposes a character reading system, and this system allows the user to prepare user's special handling procedure, and the specification of this handling procedure need not known by system.This patent proposes two and handles section, and promptly standard processing section and user are from ordering the processing section.This user handles the field that section permission user sets its hope inspection arbitrarily from ordering, and not influence standard is handled section.
United States Patent (USP) 5,233, No. 627 (Yamanari etc.) propose a literal recognizer with special more orthofunction, and a kind of character reading device of this patent disclosure can be avoided hiding original form image when screen display comprises the image of the unit that refuses to read.
Write as Fig. 3, it is a preferred embodiment of optical character device for identifying of the present invention, and this system 50 comprises sheet conveying system 51, and this sheet conveying system 51 passes through optical scanner (" OCR scanner ") 52 with list along the direction of arrow.The preferred embodiment 52 of a scanner utilizes this list of illuminated with laser light, and utilizes the storage unit of charge coupled device ccd for example to produce the bianry image of this list.It is the bianry image of logical zero that this scanner can produce each pixel NOT logic " 1 ".A kind of model of OCR scanner 52 is that TDC2610W is (by Terminal DataCorp manufacturing.)
Scanner 52 can connect with processor 54 (for example, a general purpose computer or the hardware handles unit of a specific use).The hardware cell of processor can be optical processing unit or electronic processing unit, for example " Resister Summing Network " and Digital Logic circuit.This processor can comprise a microprocessor 56 and all the other elements, screen or monitor 58, keyboard or all the other input medias 60.This processor 54 can comprise the document image after a memory storage 62 stores scanning.This memory storage can be hardware, RAM or all the other memory storages.
Identifying is as follows:
The list of desire identification produces binary image data and deposits storer 62 in via microprocessor 54 processing via feeder 51 and scanner 52 scannings; Application program, character feature, database, list check that knowledge base etc. all is stored in the memory storage 62, when execution debate when knowing image debate know program data base etc. be written in the dynamic RAM by microprocessor 56 controls and progressively carry out until this batch image all handled produce batch in bay be stored in the hard disk.The corrigendum operation is carried out in microprocessor 56 controls, wherein need display image on screen 58, and operate by operating personnel and to export/to go into keyboard, microprocessor 56 is after receiving keyboard 60 inputs, with the corrigendum program in this input value transmission primary memory, so that program continues to carry out till the corrigendum work flow is finished.In scanning process, form image and character recognition data display screen 58, after following character identification program 58 was finished, the character that a preferred embodiment of the present invention can't identification was shown in screen, and the user can utilize keyboard 60 that correct character is replaced and be refused to recognize and the character of misidentification.As following discussion, field that can't identification and list then are shown in does artificial corrigendum on the screen.
For making the literal in OCR system of the present invention " reading " list, the preferable practice is that there is literal to be read in which zone on elder generation of system " inquiry " list, and these literal are with which kind of pattern (for example, printing or hand-written) to occur, and what these literal are.Because different field positions and character character are the OCR device and inquire about before list identification, data pick-up will be very fast, also more correct, and character extraction action is also more efficient.Behind the position relatively expection and real list sprocket bit, list tilts and the border of different fields can accurately be found out.
So make the OCR device need be extracted with containing and the important field of recognition character independent in whole list.As described below, identification and post-treatment parameters also preset, and treatment effeciency is improved.In other words, character character (as printing/hand-written and Chinese/alphanumeric) is handled for identification and is preset, and field is described (name, sex, address etc.) and preset for the words aftertreatment.
The list polling routine:
Fig. 4 is flow process Figure 70 of list inquiry, and at first a blank list is earlier through scanning (step 72), and form image is presented on the computer screen.The operator determines to define one of them field (for example, " insurant's name "), uses peripheral device, Genius mouse for example, and the operator pulls out a square type zone that comprises the identification field.OCR software is found out the field border (step 74) of this field X and Y direction, and the position of so filling in the character grid can indicate automatically.
Then define a character (or field is described step 76), this character is pointed out the classification of data in the field.For example, first field points out to include " insurant's name ", and second field points out to include " patient name " (seeing Fig. 1 and Fig. 2).After field defines, then will define the character attribute (step 78) that includes, that is character should be printing or hand-written English words or printing or hand-written Chinese words in the defined field.For example " patient name " field planted agent fills in hand-written English character.
After field border, character, attribute all defined, defining each concealed wire " grid " 34 (see figure 2)s was that character is filled in place's (step 80).So, this device can be inquired about the desired location of each hand-written character.
Then, the operator defines the position (step 82) of sprocket bit 36, and in preferred embodiment of the present invention, sprocket bit 36 must be positioned at four jiaos of list, and data should adopt laterally and fill in.Then define the character (step 84) of sprocket bit 36 again.
This query script can make OCR50 extract data automatically in the list of filling in.So can quicken character extraction process subsequently, and increase tolerance deviation the angle of inclination.
After the data in all blank lists are all inquired about, the promptly accurately good list that is filled with data of reading of native system, this must extract and character identification two steps through character.Wherein the character extracted data comprises three parts: field extracts, row extracts and character extracts.Character extracts and is further divided into printing character extraction (comprising Chinese and alphanumeric) and hand-written character extraction (comprising Chinese and alphanumeric).
Data extract
Fig. 5 is the identification workflow diagram of preferable enforcement of the present invention, and wherein system 100 is divided into three parts: sweep test 102, character recognition portion divide 104 and identification aftertreatment part 106.
Its workflow is:
At first, the list of filling in places sheet transport system 51, through OCR scanner 52 (as Fig. 3), finishes scanning 110, and this scan image compares with the empty forms forms data that searches and be stored in storer 112 again.
Data pick-up can be divided into three steps.At first, find the field position that comprises extracted data earlier, and consider any possible skew.Next determines the literal line position in the field, and this is that literal line extracts; At last, extract the position of character in the literal line, this is that character extracts.Character extracts can be divided into two steps again, promptly prints character and extracts and hand-written character extraction.
1, field extracts
Extracting module 114 extracts desire identification field and proofreaies and correct the field coordinate.Its step is as follows: at first determine the skew and the inclination of list, this module tolerable tilts (maximum 5 degree) and is offset (list moves during scanning).These two kinds of variations are subjected to the mechanical constraint of paper feed system 51.The determining positions of sprocket bit 36 border of list 20, (for example, in the present embodiment, the border that sprocket bit 36 is pointed out list 20 (for example: in the present embodiment, sprocket bit 36 is pointed out four jiaos of list) and the sprocket bit position that obtains via " searching " on the position of relatively importing the list sprocket bit and the blank list, and learn the inclination and the side-play amount of input list.
Then, the literal character that this module reference column bit data storehouse 112 is write down determines its desired location, and extracts field.Because the inclination and the side-play amount of known list, desiring identification hurdle bit position all can be via calculating and get with respect to blank list.
2, literal line extracts
Then, literal line extracts and row coordinate correction execution in the following manner.The position of literal line in the hurdle is decided in module 114 query word character data storehouses 112, and extracts the position of literal line.If literal line is arranged in the field, then carry out the level projection, it is described below: at first drop in the hurdle stain with the character of delegation with horizontal scanning line decision, these horizontal lines combine and form the accumulation projection amount, and the border of literal line can be by the determining positions of stain in the horizontal line.Then, the position that the field original position that is got by inquiry is used to proofread and correct literal line, that is the original address that utilization " inquiry " obtains is to find out the divisible two overlapping capable optimum level cut-off rules of input characters.When the character string in the literal line surpassed the up-and-down boundary of inquiry literal line, field can be divided into the number row safely, and can obtain correct literal line coordinate this moment.
3, character extracts
Next, character extracts and following steps are carried out in the coordinate corrigendum: utilize the vertical projection of character image in the row to extract character in the row, promptly utilize the vertical scan line character to form the vertical projection amount.The minimum value nidus of projection amount is the boundary position of character.Literal line database field 112 can be used to determine that character is a block letter or hand-written.The desired location of character can be in order to adjust the extraction coordinate of desiring character in the identification field when inquiring about blank list, and it is more effective that character is extracted.The interior character order of literal line is according to the horizontal base scale value, that is its X-coordinate is arranged.
(i) the printing character extracts:
The printing character extracts module 116 and extracts the literal character data storehouses 112 indicated field data that comprise printed data, and its is Chinese or English words with reference to 112 to predict this character.The Chinese printed data is sent into the Chinese identification module 118 of printing, and the alphanumeric printed data is sent into printed character DIGITAL IDENTIFICATION module 120.
Then, carry out the identification of printing character.Known many optical identifying apparatus as shown in Figure 5, comprise module 118,120.(for example, referring to Mc Graw Hill Encyclopediaof Electronics and Computers, pp.109-111 (Mc GRAW-Hill1984)).The optical identifier of identification printing character adopts masterplate comparative approach identification character usually.Yet printing character recognition device 118,120 extracts different features and utilizes and judge DIGITAL IDENTIFICATION expert database 122, and prints Chinese recognition device 116 with reference to the Chinese identification expert database 124 of printing.
(ii) hand-written character extracts:
Hand-written character extracts module 130 and extracts the literal line character data storehouse 112 indicated field data that contain hand-written data, and it includes Chinese or English digital data with reference to 112 with this hand-written field of precognition.The Chinese hand-written data is sent into hand-written Chinese identification module 132, and the alphanumeric hand-written data is sent into handwritten word alphameric identification module 134.
Then carry out hand-written character identification.The hand-written Chinese character that extracts and at least one hand-written Chinese character identification expert 136 compare, and handwritten word alphameric character also compares with at least one handwritten word alphameric character identification expert 138.Have two kinds of preferred mode to carry out identifications, the first adopts statistics identification expert, the feature extraction of extracting character is gone out, and with the storage data storehouse in feature relatively, select near the person as identification result.
Second method is to utilize several identification experts " ballot " to select correct identification result.Adopt four identification experts in preferred embodiment of the present invention, one is above-mentioned data craft; It two is structural loose contrast identification expert; It three is structural periphery contrast identification expert; It four is the neural network of software simulation.Loose contrast identification expert comprises a hop count order, pen section shape (convex or concavity, direction etc.), a segment length and position, turning point etc. with the key feature of character image backboneization and drawing-out structure.Loose comparison sorter is then in order to distinguish unknown character.
Periphery identification expert extracts the periphery of character image and the feature of drawing-out structure, comprises position, number, unique point kind.These features comprise as layout informations such as the number in cavity in the character and positions; Dynamic contrast and layout sorter are used to distinguish unknown character.
System network identification expert extracts general statistical nature, and adopts the system network of expansion backward to distinguish unknown character.
All the other methods also can be used to the hand-written character of identification.
4, identification aftertreatment:
The identification aftertreatment includes two steps: i.e. words aftertreatment and screen corrigendum.Words aftertreatment module 140 comprises address aftertreatment and field inspection.
1, words aftertreatment:
The words aftertreatment utilizes dictionary cross-check character identification correctness.For example, dictionary can comprise the title of city, small towns, road and segmentation in a certain geographic area.The words that identification produces can contrast to determine whether identification is correct with dictionary.In addition, postcode also can be in order to cross-check.
The codomain scope of each character is checked in the field inspection, and whether the character in the field meets the algebraic relation of setting.
2, screen corrigendum:
Fig. 6 is the more process flow diagram of correction method 200 of a preferable screen.The form image of scanning is admitted to list identification system (step 202), list be included into " entirely true ", " artificial corrigendum " or: one of " refusal is accepted " three classes (step 204), right-on form image deposits in earlier in the database (step 222).
The list that needs artificial corrigendum is when handling, and whether decision earlier refuses the unit's (step 206) of reading, and the unit that refuses to read needs by artificial corrigendum (step 208).
When carrying out character (or field) corrigendum, screen corrigendum device 144 unit's (or field) that will refuse to read is presented at (see figure 3) on the screen 58, as shown in Figure 7.First image of refusing to read is presented at and supplies corrigendum on the screen, and these characters belong to same batch, but can be from different lists.But therefore the many lists of single treatment more can raise the efficiency when so making corrigendum.
When list needs artificial corrigendum, but when there is no the unit that refuses to read and existing, the character string in the expression field is checked (step 210) by the field aftertreatment, promptly need carry out field corrigendum (step 214) this moment.
Screen display example when Fig. 8 corrects for carrying out field.As shown in Figure 8, in preferred embodiment of the present invention, monitor that Figure 58 adopts the split screen mode, the field image is presented at a side (being the screen first half in this example) identification result is presented at opposite side (being the screen Lower Half in this example).The user can be with reference to the field image inspection on the screen and corrigendum identification mistake or the character refusing to recognize, and operating personnel for example can utilize the input media of keyboard to import correct character.
If list is by the field inspection, it also deposits database (step 222) in, but do not refuse to recognize as if list, and carry out whole list manual entry (step 218), that is the interior all data of list this moment are by manually typewriting input again by the then whole list of field inspection (being step 216).If the list after the corrigendum can be accepted (that is institute wrong manually corrected), form data promptly is stored in database (step 222) otherwise a promptly whole list refuses to recognize (step 224).
At last, the data of identification generation are sent to format conversion module 146 and convert thereof into database format commonly used.Data after this format conversion and form image can store, inquire about, sort or do other purposes.
When corrigendum refuses to read unit, adopt the principle of the step elder generation execution of workload minimum, that is inspect and correct character earlier but not field or whole list.In addition, character corrigendum step can improve the possibility of list by field inspection and whole inspection, so can handle many lists simultaneously effectively.
The explanation of corrigendum work flow:
The corrigendum operation is that the image with part character, field or whole list shows on screen, is had a question after the part with visual judgement by operating personnel, utilizes keyboard to import lteral data in this character, field or whole the list, imports with the indirect labor.Computer provides following function basically:
1, selects suspicious data, the character (being that what is called refuses to recognize) that can't recognize certainly comprising the character identification; Though or whole field can recognize, the aftertreatment knowledge of this field of gained is checked the identification result of this field when utilizing list to check, does not but meet this aftertreatment knowledge, and this moment, this field image was promptly chosen; In addition, if because list tilts or writer's writing is raised very much grass, in making list, surpass a certain proportion of character or field can't identification the time (according to the list check result as can be known such list how many characters and how many fields should be arranged) then this whole form image can be chosen.The action of the above-mentioned target of choosing (or judging the action which data is chosen) be by be stored in after the CPU of the computer identification in a collection of list that the image in the hard disk mixes with lteral data bay word for word first, pursue that field monitors and computing after, with doubt character or field, even the related data of whole form image (sequence number, image boundary coordinate etc.) is stored in the dynamic RAM, shows for successive image and utilizes;
2, show suspicious data: when data select finish after, CPU carries out promptly that bay monitors in the list in the hardware to being stored in, and, object (comprising character, field or whole image) is presented on the screen according to the above-mentioned related data that is stored in the dynamic RAM.Consider efficiency, therefore the order that shows is to field to whole form image by character;
3, artificial corrigendum: above-mentioned display action except that displayed image on screen, and shows the input characters district, to provide operating personnel with the pairing model answer of this display image, via keyboard input computer under image.CPU promptly carries out field aftertreatment inspection, with the correctness of decision data after receiving these input data.For example: when all refusing in this batch list read unit all import finish after, CPU carries out the aftertreatment inspection, sequence number that will the person of not being inconsistent is again in the typing dynamic RAM, shows usefulness with corrigendum for follow-up image.
Via above-mentioned three basic functions, follow the flow process of Fig. 1, can obtain high efficiency screen corrigendum, simultaneously, also can corresponding every list in hard disk, produce the pure words shelves of every list content.