CN1153358A - Chinese and English table recognition system and method - Google Patents

Chinese and English table recognition system and method Download PDF

Info

Publication number
CN1153358A
CN1153358A CN 96106616 CN96106616A CN1153358A CN 1153358 A CN1153358 A CN 1153358A CN 96106616 CN96106616 CN 96106616 CN 96106616 A CN96106616 A CN 96106616A CN 1153358 A CN1153358 A CN 1153358A
Authority
CN
China
Prior art keywords
character
field
list
corrigendum
identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 96106616
Other languages
Chinese (zh)
Other versions
CN1107280C (en
Inventor
徐英士
陈谋琰
林文雯
屠乐梃
周开祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Transpacific IP Pte Ltd.
Original Assignee
Industrial Technology Research Institute ITRI
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial Technology Research Institute ITRI filed Critical Industrial Technology Research Institute ITRI
Priority to CN 96106616 priority Critical patent/CN1107280C/en
Publication of CN1153358A publication Critical patent/CN1153358A/en
Application granted granted Critical
Publication of CN1107280C publication Critical patent/CN1107280C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Abstract

A system for recognizing Chinese and English forms is composed of printed numerals recognition module, alphanumeric script recognition module, printed Chinese character recognition module and Chinese character script recognition module. Its recognition method includes form query, extracting cell to certificate optically scanned data, and artificial correcting. Said form query includes defining boundary of volumn containing data, the nature of data in volumn, the attribute of cell in volumn and position where character is filled. The steps for certificating optically scanned data comprise extracting a volumn, correcting coordinate of volumn, extracting at least one line of text, correcting coordinate of the text line, extracting a cell and correcting coordinate of cell.

Description

The recognition system and the recognition methods of Chinese and English list
The present invention relates to a kind of recognition system and recognition methods of Chinese and English list.Or rather, but the present invention is the recognition system and the recognition methods of the Chinese and English list of an identification printing and hand-written Chinese and character numeral.
Many commerce and government unit all need handle the printing list of filling in handwriting, have many methods can be with this data pick-up, handle and stored.For instance, can utilize image-scanning device and optical character identification technique to extract printing or hand-written data on the list.Form image itself can be via take a picture producing micro monolithic or microfilm, or utilize optical scanning and produce image storage in hard disc of computer or other electron storage medium.Renowned company such as Toshiba (Toshiba), Sanyo (Sanyo), Hitachi (Hitachi), Panasonic (Panasonic) etc. have all released combining image scanning and have handled Japanese and alphanumeric data with the list reading system of optics character identification (OCR) device.
The list that a kind of OCR device is commonly used is A8 or A4 size, always the list of dark grid is arranged.Fig. 1 is an example of describing such list.Comment on the list need be printd in advance in the field position of regulation, and literal to be filled out need fill in in the spaced field between dark grid sign, word and word.Comment does not need with concealed wire (lattice) separately.
Dark lattice list 20 shown in Figure 1 has fields such as 22,24,26,28 can fill in literal, for example: in the medical policy that exemplifies, comprise insurant's name 22, patient name 24, employer's name 26 and patient and insurant's name 28.Related text is received in the grid 30 that contains concealed wire 32, can only fill in Chinese or English alphabet numeral in each grid that is defined by concealed wire 34.Position symbol 36 is printed on the list 20, and in a preferred embodiment, this symbol 36 is positioned at four jiaos of list, the inclination of list and shift state when being used to correct scan.
Fig. 2 shows the part amplification sample of list 20, wherein print the character part, for example " insurant's name " 38 and " patient name " 39 do not have concealed wire to separate, but the hand-written character that falls within hurdle 40 and 42 (being shown in the dotted line field of Fig. 2) is then write in grid 34.Grid 34 is made up of the dark lattice 32 that are positioned at field 44,46.
A character identification system often can't guarantee that identification is errorless, and particularly when the hand-written character of identification, the identification mistake is unavoidable.So, artificial corrigendum (being carried out by operating personnel) is imperative.Some typical character identification system often refuses to recognize hasty and careless or illegal character.When refuse to recognize increase greatly with the misidentification character after, the word correction rate is for automatic system, and is more important than general artificial data login system.Therefore, optics character identification system preferably can provide the cover method that corrigendum can't the identification character effectively.
The list identification result can be divided into three kinds of situations:
1, entirely true: but in the list each character all identification and each field all by the aftertreatment inspection, for example dictionary inspection (whether the identification field meets a word in the dictionary), grammar testing (whether the identification field meets the default syntax) etc.Not needing any artificial corrigendum after debating knowledge, also is can't correct (for example: character identification mistake is desired equally by the aftertreatment inspection) because of the hiding mistake of system causes even any mistake is arranged.Its concealing errors of the system of a practicality must be lower than the manual entry system.
2, artificial corrigendum: need after the list identification through artificial screen corrigendum.But when some character refused to recognize or field in all identifications but of character by integrity checking, then list must section through artificial corrigendum.
3, refuse to recognize for whole: can't the identification character in list too many (for example, because quality of scanning is too poor, list mistake or the hasty and careless list of written handwriting are refused to recognize, the character on this moment all lists must be by artificial input.
Some external optics character identification systems have proposed different solutions at the problems referred to above.For example, United States Patent (USP) 5,251, No. 273 (Betts etc.) propose a data disposal system and method, the mistake that produces behind the identification of the corrigendum scanning in regular turn list.Comprise three Identification Data corrigendum processors in the device that this reference example proposed, i.e. artificial checking of (1) artificial intelligence process device (2) database error detection processor (3) and corrigendum processor.Data structure records recognition result that a kind of machine produces and corrigendum are historical, and it is sent to each processor successively.After artificial intelligence and Database error corrigendum processor processing is finished, can the display field bit image on the workstation screen for artificial corrigendum.
United States Patent (USP) 5,305, No. 396 (Betts) proposes a data handling system and method, can select character identification flow process and Identification Data corrigendum flow process at different client's lists.This reference example proposes elder generation's input one list masterplate before identification, and this masterplate contains the system operating parameters of establishing according to customer demand, and this list masterplate must elder generation be read by system before a large amount of identifications.
United States Patent (USP) 5,235, No. 654 (Anderson etc.) propose progressive data pick-up, a data handling system, handle the form image after scanning.Its content is one can produce the system that new table is done automatic processing.
United States Patent (USP) 5,153, No. 927 (Yamanari) proposes a character reading system and a method, and this patent proposes a character reading system, and this system allows the user to prepare user's special handling procedure, and the specification of this handling procedure need not known by system.This patent proposes two and handles section, and promptly standard processing section and user are from ordering the processing section.This user handles the field that section permission user sets its hope inspection arbitrarily from ordering, and not influence standard is handled section.
United States Patent (USP) 5,233, No. 627 (Yamanari etc.) propose a literal recognizer with special more orthofunction, and a kind of character reading device of this patent disclosure can be avoided hiding original form image when screen display comprises the image of the unit that refuses to read.
The object of the present invention is to provide a kind of character device for identifying with Chinese and English list query function.
Another object of the present invention is to provide the optics character recognition device that to discern character in printing and the hand-written list.
A further object of the present invention is to provide the high device of a kind of recognition efficiency to reach manually more correction method.
Above-mentioned purpose reaches via the recognition device with printed character digit recognition module, handwritten word alphameric identification module, the Chinese identification module of printing, hand-written Chinese identification group.After the data that are extracted are finished identification,, can be presented at again on the screen for watching and correcting if needed.
A preferred embodiment of the present invention comprises that one can " inquire about " the list inquiry module of data address in the form, so can make optical character identification (OCR) device directly enter the field with pending literal and handles.This module can be inquired about character position, and when a large amount of the processing, and the difference of setting mark institute when improving scanning was produced and the permissible aberration that is offset when relatively the position of setting mark was with inquiry on the scan table single image.
The present invention also proposes a kind of printing and hand-written data of extracting, and this data storing is installed with the optical character identification (OCR) in hand-written Chinese and the alphanumeric image document in comprising printing.
A preferred embodiment also provides a kind of corrigendum flow process of carrying out gradually, and wherein artificial corrigendum is only just carried out when being necessary.Its corrigendum program be according to the size of workload by letter to numerous arrangement.That is lower-cost part (more not time-consuming) is carried out earlier.In this embodiment, implementing the character corrigendum earlier, secondly is the field corrigendum, is the whole Zhang Gengzheng of list at last.
List querying method of the present invention, extract the method for character method, checking optical scanning data message correctness and manually correct the step of method of list as follows:
The embodiment that reaches in conjunction with the accompanying drawings of the present invention is described in detail as follows feature:
Description of drawings:
Fig. 1 shows a dark lattice list example;
Fig. 2 is the part enlarged drawing of Fig. 1 list;
Fig. 3 is the calcspar of Chinese and English list recognition device;
Fig. 4 is the process flow diagram of list inquiry among the present invention;
Fig. 5 is the workflow diagram of apparatus of the present invention;
Fig. 6 is the process flow diagram of the corrigendum of the screen among the present invention program;
One when Fig. 7 corrects for character picture is described;
One when Fig. 8 corrects for field picture is described;
Fig. 9 is character corrigendum screen flow figure.
Write as Fig. 3, it is a preferred embodiment of optical character device for identifying of the present invention, and this system 50 comprises sheet conveying system 51, and this sheet conveying system 51 passes through optical scanner (" OCR scanner ") 52 with list along the direction of arrow.The preferred embodiment 52 of a scanner utilizes this list of illuminated with laser light, and utilizes the storage unit of charge coupled device ccd for example to produce the bianry image of this list.It is the bianry image of logical zero that this scanner can produce each pixel NOT logic " 1 ".A kind of model of OCR scanner 52 is that TDC2610W is (by Terminal DataCorp manufacturing.)
Scanner 52 can connect with processor 54 (for example, a general purpose computer or the hardware handles unit of a specific use).The hardware cell of processor can be optical processing unit or electronic processing unit, for example " Resister Summing Network " and Digital Logic circuit.This processor can comprise a microprocessor 56 and all the other elements, screen or monitor 58, keyboard or all the other input medias 60.This processor 54 can comprise the document image after a memory storage 62 stores scanning.This memory storage can be hardware, RAM or all the other memory storages.
Identifying is as follows:
The list of desire identification produces binary image data and deposits storer 62 in via microprocessor 54 processing via feeder 51 and scanner 52 scannings; Application program, character feature, database, list check that knowledge base etc. all is stored in the memory storage 62, when execution debate when knowing image debate know program data base etc. be written in the dynamic RAM by microprocessor 56 controls and progressively carry out until this batch image all handled produce batch in bay be stored in the hard disk.The corrigendum operation is carried out in microprocessor 56 controls, wherein need display image on screen 58, and operate by operating personnel and to export/to go into keyboard, microprocessor 56 is after receiving keyboard 60 inputs, with the corrigendum program in this input value transmission primary memory, so that program continues to carry out till the corrigendum work flow is finished.In scanning process, form image and character recognition data display screen 58, after following character identification program 58 was finished, the character that a preferred embodiment of the present invention can't identification was shown in screen, and the user can utilize keyboard 60 that correct character is replaced and be refused to recognize and the character of misidentification.As following discussion, field that can't identification and list then are shown in does artificial corrigendum on the screen.
For making the literal in OCR system of the present invention " reading " list, the preferable practice is that there is literal to be read in which zone on elder generation of system " inquiry " list, and these literal are with which kind of pattern (for example, printing or hand-written) to occur, and what these literal are.Because different field positions and character character are the OCR device and inquire about before list identification, data pick-up will be very fast, also more correct, and character extraction action is also more efficient.Behind the position relatively expection and real list sprocket bit, list tilts and the border of different fields can accurately be found out.
So make the OCR device need be extracted with containing and the important field of recognition character independent in whole list.As described below, identification and post-treatment parameters also preset, and treatment effeciency is improved.In other words, character character (as printing/hand-written and Chinese/alphanumeric) is handled for identification and is preset, and field is described (name, sex, address etc.) and preset for the words aftertreatment.
The list polling routine:
Fig. 4 is flow process Figure 70 of list inquiry, and at first a blank list is earlier through scanning (step 72), and form image is presented on the computer screen.The operator determines to define one of them field (for example, " insurant's name "), uses peripheral device, Genius mouse for example, and the operator pulls out a square type zone that comprises the identification field.OCR software is found out the field border (step 74) of this field X and Y direction, and the position of so filling in the character grid can indicate automatically.
Then define a character (or field is described step 76), this character is pointed out the classification of data in the field.For example, first field points out to include " insurant's name ", and second field points out to include " patient name " (seeing Fig. 1 and Fig. 2).After field defines, then will define the character attribute (step 78) that includes, that is character should be printing or hand-written English words or printing or hand-written Chinese words in the defined field.For example " patient name " field planted agent fills in hand-written English character.
After field border, character, attribute all defined, defining each concealed wire " grid " 34 (see figure 2)s was that character is filled in place's (step 80).So, this device can be inquired about the desired location of each hand-written character.
Then, the operator defines the position (step 82) of sprocket bit 36, and in preferred embodiment of the present invention, sprocket bit 36 must be positioned at four jiaos of list, and data should adopt laterally and fill in.Then define the character (step 84) of sprocket bit 36 again.
This query script can make OCR50 extract data automatically in the list of filling in.So can quicken character extraction process subsequently, and increase tolerance deviation the angle of inclination.
After the data in all blank lists are all inquired about, the promptly accurately good list that is filled with data of reading of native system, this must extract and character identification two steps through character.Wherein the character extracted data comprises three parts: field extracts, row extracts and character extracts.Character extracts and is further divided into printing character extraction (comprising Chinese and alphanumeric) and hand-written character extraction (comprising Chinese and alphanumeric).
Data extract
Fig. 5 is the identification workflow diagram of preferable enforcement of the present invention, and wherein system 100 is divided into three parts: sweep test 102, character recognition portion divide 104 and identification aftertreatment part 106.
Its workflow is:
At first, the list of filling in places sheet transport system 51, through OCR scanner 52 (as Fig. 3), finishes scanning 110, and this scan image compares with the empty forms forms data that searches and be stored in storer 112 again.
Data pick-up can be divided into three steps.At first, find the field position that comprises extracted data earlier, and consider any possible skew.Next determines the literal line position in the field, and this is that literal line extracts; At last, extract the position of character in the literal line, this is that character extracts.Character extracts can be divided into two steps again, promptly prints character and extracts and hand-written character extraction.
1, field extracts
Extracting module 114 extracts desire identification field and proofreaies and correct the field coordinate.Its step is as follows: at first determine the skew and the inclination of list, this module tolerable tilts (maximum 5 degree) and is offset (list moves during scanning).These two kinds of variations are subjected to the mechanical constraint of paper feed system 51.The determining positions of sprocket bit 36 border of list 20, (for example, in the present embodiment, the border that sprocket bit 36 is pointed out list 20 (for example: in the present embodiment, sprocket bit 36 is pointed out four jiaos of list) and the sprocket bit position that obtains via " searching " on the position of relatively importing the list sprocket bit and the blank list, and learn the inclination and the side-play amount of input list.
Then, the literal character that this module reference column bit data storehouse 112 is write down determines its desired location, and extracts field.Because the inclination and the side-play amount of known list, desiring identification hurdle bit position all can be via calculating and get with respect to blank list.
2, literal line extracts
Then, literal line extracts and row coordinate correction execution in the following manner.The position of literal line in the hurdle is decided in module 114 query word character data storehouses 112, and extracts the position of literal line.If literal line is arranged in the field, then carry out the level projection, it is described below: at first drop in the hurdle stain with the character of delegation with horizontal scanning line decision, these horizontal lines combine and form the accumulation projection amount, and the border of literal line can be by the determining positions of stain in the horizontal line.Then, the position that the field original position that is got by inquiry is used to proofread and correct literal line, that is the original address that utilization " inquiry " obtains is to find out the divisible two overlapping capable optimum level cut-off rules of input characters.When the character string in the literal line surpassed the up-and-down boundary of inquiry literal line, field can be divided into the number row safely, and can obtain correct literal line coordinate this moment.
3, character extracts
Next, character extracts and following steps are carried out in the coordinate corrigendum: utilize the vertical projection of character image in the row to extract character in the row, promptly utilize the vertical scan line character to form the vertical projection amount.The minimum value nidus of projection amount is the boundary position of character.Literal line database field 112 can be used to determine that character is a block letter or hand-written.The desired location of character can be in order to adjust the extraction coordinate of desiring character in the identification field when inquiring about blank list, and it is more effective that character is extracted.The interior character order of literal line is according to the horizontal base scale value, that is its X-coordinate is arranged.
(i) the printing character extracts:
The printing character extracts module 116 and extracts the literal character data storehouses 112 indicated field data that comprise printed data, and its is Chinese or English words with reference to 112 to predict this character.The Chinese printed data is sent into the Chinese identification module 118 of printing, and the alphanumeric printed data is sent into printed character DIGITAL IDENTIFICATION module 120.
Then, carry out the identification of printing character.Known many optical identifying apparatus as shown in Figure 5, comprise module 118,120.(for example, referring to Mc Graw Hill Encyclopediaof Electronics and Computers, pp.109-111 (Mc GRAW-Hill1984)).The optical identifier of identification printing character adopts masterplate comparative approach identification character usually.Yet printing character recognition device 118,120 extracts different features and utilizes and judge DIGITAL IDENTIFICATION expert database 122, and prints Chinese recognition device 116 with reference to the Chinese identification expert database 124 of printing.
(ii) hand-written character extracts:
Hand-written character extracts module 130 and extracts the literal line character data storehouse 112 indicated field data that contain hand-written data, and it includes Chinese or English digital data with reference to 112 with this hand-written field of precognition.The Chinese hand-written data is sent into hand-written Chinese identification module 132, and the alphanumeric hand-written data is sent into handwritten word alphameric identification module 134.
Then carry out hand-written character identification.The hand-written Chinese character that extracts and at least one hand-written Chinese character identification expert 136 compare, and handwritten word alphameric character also compares with at least one handwritten word alphameric character identification expert 138.Have two kinds of preferred mode to carry out identifications, the first adopts statistics identification expert, the feature extraction of extracting character is gone out, and with the storage data storehouse in feature relatively, select near the person as identification result.
Second method is to utilize several identification experts " ballot " to select correct identification result.Adopt four identification experts in preferred embodiment of the present invention, one is above-mentioned data craft; It two is structural loose contrast identification expert; It three is structural periphery contrast identification expert; It four is the neural network of software simulation.Loose contrast identification expert comprises a hop count order, pen section shape (convex or concavity, direction etc.), a segment length and position, turning point etc. with the key feature of character image backboneization and drawing-out structure.Loose comparison sorter is then in order to distinguish unknown character.
Periphery identification expert extracts the periphery of character image and the feature of drawing-out structure, comprises position, number, unique point kind.These features comprise as layout informations such as the number in cavity in the character and positions; Dynamic contrast and layout sorter are used to distinguish unknown character.
System network identification expert extracts general statistical nature, and adopts the system network of expansion backward to distinguish unknown character.
All the other methods also can be used to the hand-written character of identification.
4, identification aftertreatment:
The identification aftertreatment includes two steps: i.e. words aftertreatment and screen corrigendum.Words aftertreatment module 140 comprises address aftertreatment and field inspection.
1, words aftertreatment:
The words aftertreatment utilizes dictionary cross-check character identification correctness.For example, dictionary can comprise the title of city, small towns, road and segmentation in a certain geographic area.The words that identification produces can contrast to determine whether identification is correct with dictionary.In addition, postcode also can be in order to cross-check.
The codomain scope of each character is checked in the field inspection, and whether the character in the field meets the algebraic relation of setting.
2, screen corrigendum:
Fig. 6 is the more process flow diagram of correction method 200 of a preferable screen.The form image of scanning is admitted to list identification system (step 202), list be included into " entirely true ", " artificial corrigendum " or: one of " refusal is accepted " three classes (step 204), right-on form image deposits in earlier in the database (step 222).
The list that needs artificial corrigendum is when handling, and whether decision earlier refuses the unit's (step 206) of reading, and the unit that refuses to read needs by artificial corrigendum (step 208).
When carrying out character (or field) corrigendum, screen corrigendum device 144 unit's (or field) that will refuse to read is presented at (see figure 3) on the screen 58, as shown in Figure 7.First image of refusing to read is presented at and supplies corrigendum on the screen, and these characters belong to same batch, but can be from different lists.But therefore the many lists of single treatment more can raise the efficiency when so making corrigendum.
When list needs artificial corrigendum, but when there is no the unit that refuses to read and existing, the character string in the expression field is checked (step 210) by the field aftertreatment, promptly need carry out field corrigendum (step 214) this moment.
Screen display example when Fig. 8 corrects for carrying out field.As shown in Figure 8, in preferred embodiment of the present invention, monitor that Figure 58 adopts the split screen mode, the field image is presented at a side (being the screen first half in this example) identification result is presented at opposite side (being the screen Lower Half in this example).The user can be with reference to the field image inspection on the screen and corrigendum identification mistake or the character refusing to recognize, and operating personnel for example can utilize the input media of keyboard to import correct character.
If list is by the field inspection, it also deposits database (step 222) in, but do not refuse to recognize as if list, and carry out whole list manual entry (step 218), that is the interior all data of list this moment are by manually typewriting input again by the then whole list of field inspection (being step 216).If the list after the corrigendum can be accepted (that is institute wrong manually corrected), form data promptly is stored in database (step 222) otherwise a promptly whole list refuses to recognize (step 224).
At last, the data of identification generation are sent to format conversion module 146 and convert thereof into database format commonly used.Data after this format conversion and form image can store, inquire about, sort or do other purposes.
When corrigendum refuses to read unit, adopt the principle of the step elder generation execution of workload minimum, that is inspect and correct character earlier but not field or whole list.In addition, character corrigendum step can improve the possibility of list by field inspection and whole inspection, so can handle many lists simultaneously effectively.
The explanation of corrigendum work flow:
The corrigendum operation is that the image with part character, field or whole list shows on screen, is had a question after the part with visual judgement by operating personnel, utilizes keyboard to import lteral data in this character, field or whole the list, imports with the indirect labor.Computer provides following function basically:
1, selects suspicious data, the character (being that what is called refuses to recognize) that can't recognize certainly comprising the character identification; Though or whole field can recognize, the aftertreatment knowledge of this field of gained is checked the identification result of this field when utilizing list to check, does not but meet this aftertreatment knowledge, and this moment, this field image was promptly chosen; In addition, if because list tilts or writer's writing is raised very much grass, in making list, surpass a certain proportion of character or field can't identification the time (according to the list check result as can be known such list how many characters and how many fields should be arranged) then this whole form image can be chosen.The action of the above-mentioned target of choosing (or judging the action which data is chosen) be by be stored in after the CPU of the computer identification in a collection of list that the image in the hard disk mixes with lteral data bay word for word first, pursue that field monitors and computing after, with doubt character or field, even the related data of whole form image (sequence number, image boundary coordinate etc.) is stored in the dynamic RAM, shows for successive image and utilizes;
2, show suspicious data: when data select finish after, CPU carries out promptly that bay monitors in the list in the hardware to being stored in, and, object (comprising character, field or whole image) is presented on the screen according to the above-mentioned related data that is stored in the dynamic RAM.Consider efficiency, therefore the order that shows is to field to whole form image by character;
3, artificial corrigendum: above-mentioned display action except that displayed image on screen, and shows the input characters district, to provide operating personnel with the pairing model answer of this display image, via keyboard input computer under image.CPU promptly carries out field aftertreatment inspection, with the correctness of decision data after receiving these input data.For example: when all refusing in this batch list read unit all import finish after, CPU carries out the aftertreatment inspection, sequence number that will the person of not being inconsistent is again in the typing dynamic RAM, shows usefulness with corrigendum for follow-up image.
Via above-mentioned three basic functions, follow the flow process of Fig. 1, can obtain high efficiency screen corrigendum, simultaneously, also can corresponding every list in hard disk, produce the pure words shelves of every list content.
Effect of the present invention comprises that the convenience of operation and character extract the time Reduce. Character identification speed increases, and formula is manually corrected process especially more step by step Just scanning an effective method very of list after the identification. In addition, can be at screen The corrigendum identification result reaches and effectively extracts and storage data on the curtain. So improve Input, reading, storage print in a large number, the ability of hand-written form data.

Claims (24)

1, a kind of recognition methods of Chinese and English list is characterized in that, comprises list inquiry, optical character identification and words post-processing step,
The list inquiry comprises the steps:
(a) definition contains the border of data message field;
(b) define data message character in this field;
(c) define the attribute of character in this field; And
(d) define the position that the character expection is inserted in this field.
2, method according to claim 1 is characterized in that, it also comprises the position that defines several sprocket bits.
3, method according to claim 1, its spy be in, before it also comprises the defined field bit boundary, with optics scanner scans blank list.
4, method according to claim 1 is characterized in that, step (a)-(d) is heavily covered enforcement to the field that several include data.
5, method according to claim 1 is characterized in that also comprising the form of definition of data information in the step of definition of data information attribute.
6, method according to claim 1 is characterized in that, in the step of defined attribute, also comprises comprising printing or hand-written character in the defined field.
7, want the described method of right according to right, it is characterized in that, described optical character identification step comprises the step of extractor from the list electronic image, and it comprises: determine (a) whether this electronic image tilts or displacement;
(b) extract a field in this electronic image certainly;
(c) proofread and correct the coordinate of this extraction field;
(d) extract at least one literal line in the field after the self-correcting;
(e) proofread and correct the coordinate of this literal line;
(f) extract at least one character in the self-tuning literal line; And
(g) proofread and correct the coordinate that extracts character.
8, method according to claim 7 is characterized in that, before described extraction field, and the definition field.
9, method according to claim 8 is characterized in that, in described definition field step, comprises the following steps:
(a) border of the described field of decision;
(b) position that the character expection occurs in the decision field;
(c) character of selection field; And
(d) mark of selection field.
10, method according to claim 7 is characterized in that, the step whether described decision list tilts or be offset comprises the following steps:
(a) determine the border of this electronic image; And
(b) according to the border of electronic image, decision waits to extract the hurdle bit position.
11, method according to claim 7 is characterized in that, the step of field coordinate is extracted in described correction, comprises tilting and the step of offset projection at the extraction field.
12, method according to claim 7 is characterized in that, also comprises the following steps: in the step of described at least one literal line of extraction
(a) with reference to the position of a database with the decision literal line; And
(b) utilize horizontal projection and its line position that extracts character in the field, adjust the position of literal line in the field.
13, method according to claim 7 is characterized in that, the coordinate step of literal line is extracted in described correction, also comprises the following steps:
(a) horizontal projection with character is projeced into extraction field and line position, to adjust the literal line in the field;
(b) whether the character in the decision literal line surpasses the bottom or the napex of described extraction field; And
(c) surpass the bottom or the napex of described extraction field if find the character in this literal line, then described literal line is produced literal line again.
14, method according to claim 7 is characterized in that, described character abstracting method step also comprises the following steps:
(a) be printing or hand-written with reference to a database decision character;
(b) extract character;
(c) the hand-written character that will extract is sent into hand-written character identification module; And
(d) the printing character that will extract out is sent into printing character identification module.
15, method according to claim 14 is characterized in that, the step of described extraction character also comprises:
(a) determine the vertical projection of a row character; And
(b) separate each character.
16, method according to claim 14 is characterized in that, the hand-written character step that described transmission is extracted also comprises:
(a) inquiry one database is contemplated to alphanumeric or Chinese to determine hand-written character;
(b) handwritten word alphameric unit is sent to handwritten word alphameric character identification module; And
(c) hand-written Chinese is sent to hand-written Chinese character identification module.
17, method according to claim 7 is characterized in that, described correction is extracted character coordinate step and also comprised according to horizontal coordinate arrangement character step.
18, method according to claim 7 is characterized in that, also comprises the following steps:
(a) carry out identification program to extracting character; And
(b) character to identification carries out the identification post processor.
19, method according to claim 1 is characterized in that, described words post-processing step comprises:
(a) on monitor, show the character that identification was handled; And
(b) if needed, corrigendum is any can't identification or the character of understanding.
20, a kind of method of verifying optical scanning data message correctness is characterized in that, comprises the following steps:
(a) form information after the identification is divided into;
(i) entirely true;
(ii) manually corrigendum; And
Refuse to recognize for (iii) whole;
(b) store right-on form information;
(c) in the form information of the artificial corrigendum of needs, whether decision refuses the unit of reading;
(d), manually correct the described unit of reading that refuses if any refusing the unit of reading;
(e) carry out the field aftertreatment inspection first time;
(f) if field then stores these character information by the aftertreatment inspection under the character after the corrigendum;
(g) to not carrying out the field corrigendum by there being first field of refusing to read the first time in field aftertreatment inspection and the hurdle;
(h) the field information after the corrigendum is carried out the field aftertreatment inspection second time;
(i) if the field information after the corrigendum by field aftertreatment inspection for the second time, then stores this field information;
(j) to not by the field aftertreatment inspection second time and be classified as whole the list refusing to recognize and carry out whole Zhang Gengzheng;
(k) form information to whole Zhang Gengzheng carries out system's aftertreatment inspection;
(l) store by this 3rd form information that system's aftertreatment is checked, and
(m) whole refusal accepts to fail the form information checked by the 3rd system's aftertreatment.
21, method according to claim 20 is characterized in that, described scan-data comprises many lists, and artificial corrigendum refuses to read the character corrected in first step from many lists.
22, method according to claim 20 is characterized in that, described artificial corrigendum first step of refusing to read also comprises the following steps:
(a) part of first on monitor shows first image of refusing to read; And
(b) second part at monitor provides the position that can import correct character.
23, the method for an artificial corrigendum optical scanning list is characterized in that, comprises manually correcting the step that program is arranged according to the work complexity, and better simply corrigendum proceedings is before the higher corrigendum program of complexity in this step.
24, method according to claim 23 is characterized in that, many lists of described many fields are manually corrected with following step scanner uni;
(a) artificial corrigendum is not by the character in the list field of field aftertreatment inspection for the first time;
(b) the list field data that hurdle processing is for the second time checked are passed through in artificial corrigendum; And
(c) whole Zhang Gengzheng does not pass through the form data that the 3rd field aftertreatment checked.
CN 96106616 1995-06-13 1996-06-07 Chinese and English table recognition system and method Expired - Fee Related CN1107280C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 96106616 CN1107280C (en) 1995-06-13 1996-06-07 Chinese and English table recognition system and method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US08/489945 1995-06-13
CN 96106616 CN1107280C (en) 1995-06-13 1996-06-07 Chinese and English table recognition system and method

Publications (2)

Publication Number Publication Date
CN1153358A true CN1153358A (en) 1997-07-02
CN1107280C CN1107280C (en) 2003-04-30

Family

ID=5119308

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 96106616 Expired - Fee Related CN1107280C (en) 1995-06-13 1996-06-07 Chinese and English table recognition system and method

Country Status (1)

Country Link
CN (1) CN1107280C (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100405274C (en) * 1999-06-30 2008-07-23 西尔弗布鲁克研究股份有限公司 Method and system for searching information
CN101661512B (en) * 2009-09-25 2012-01-11 万斌 System and method for identifying traditional form information and establishing corresponding Web form
CN103995904A (en) * 2014-06-13 2014-08-20 上海珉智信息科技有限公司 Recognition system for image file electronic data
CN104021495A (en) * 2014-06-16 2014-09-03 王美金 Banking service application form generation device based on character recognition

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI787651B (en) * 2020-09-16 2022-12-21 洽吧智能股份有限公司 Method and system for labeling text segment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100405274C (en) * 1999-06-30 2008-07-23 西尔弗布鲁克研究股份有限公司 Method and system for searching information
CN101661512B (en) * 2009-09-25 2012-01-11 万斌 System and method for identifying traditional form information and establishing corresponding Web form
CN103995904A (en) * 2014-06-13 2014-08-20 上海珉智信息科技有限公司 Recognition system for image file electronic data
CN104021495A (en) * 2014-06-16 2014-09-03 王美金 Banking service application form generation device based on character recognition

Also Published As

Publication number Publication date
CN1107280C (en) 2003-04-30

Similar Documents

Publication Publication Date Title
CN1103087C (en) Optical scanning list recognition and correction method
US6151423A (en) Character recognition with document orientation determination
JP5073022B2 (en) Low resolution OCR for documents acquired with a camera
Hochberg et al. Automatic script identification from document images using cluster-based templates
CN1162803C (en) Bill distinguishing device and method and recording medium for recording the method
CN1151464C (en) Method of reading characters and method of reading postal addresses
CN1258894A (en) Apparatus and method for identifying character
US20070168382A1 (en) Document analysis system for integration of paper records into a searchable electronic database
US6327388B1 (en) Identification of logos from document images
CN1542656A (en) Information processing apparatus, method, storage medium and program
JP2011166768A (en) Method for generating microfine intrinsic features and document image processing system
CN101048783A (en) Photographic document imaging system
CN1492377A (en) Form processing system and method
CN112818785B (en) Rapid digitization method and system for meteorological paper form document
CN1141666C (en) Online character recognition system for recognizing input characters using standard strokes
CN106778717A (en) A kind of test and appraisal table recognition methods based on image recognition and k nearest neighbor
CN1955981A (en) Character recognition device, character recognition method and character data
CN1367460A (en) Character string identification device, character string identification method and storage medium thereof
CN1107280C (en) Chinese and English table recognition system and method
Rodrigues et al. Cursive character recognition–a character segmentation method using projection profile-based technique
KR100655916B1 (en) Document image processing and verification system for digitalizing a large volume of data and method thereof
CN1228733C (en) Finding objects in image
CN1429450A (en) Method and system for form recognition and digitized image processing
JPH09319824A (en) Document recognizing method
JP2005250786A (en) Image recognition method

Legal Events

Date Code Title Description
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C06 Publication
PB01 Publication
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: YUDONG TECHNOLOGY CO., LTD.

Free format text: FORMER OWNER: INDUSTRIAL TECHNOLOGY RESEARCH INSTITUTE

Effective date: 20070126

C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20070126

Address after: Taiwan, China

Patentee after: Transpacific IP Pte Ltd.

Address before: Taiwan, China

Patentee before: Industrial Technology Research Institute

C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20030430

Termination date: 20130607