經濟部中央樣準局員工消费合泎社印裝 五、發明説明(/ ) 豐j月領域: 本發明為一光學字元辨識裝置。更確切地説,本發明為—可辨識 印刷及手窝中文及英數字之光學字元辨識裝置。 曼g背景: 許多商業及政府單位均需處理場寫手寫資料的印刷表單,有許多 万法可以將該資料萃取、處理並予以儲存。舉例來説,可利用影像掃 瞄裝置及光學字元辨識技術抽取出表單上的印刷或手寫資料。表單影 像本身可以經由照相產生微縮單片或微縮影片,或利用光學掃瞄而產 生影像儲存於硬碟或其餘電子儲存媒體。知名公司如東芝(Toshiba)、 二洋(Sanyo)、曰立(Hitachi)、松下(Panasonic)等均已推出結合影像掃晦 與光學字元辨識(0CR)裝置的表單閲讀系統處理曰文及英數字資料。 一種OCR裝置常用的表單為八8或八4大小、繪有暗格子的表單。 圖一為一描述該類表單的例子。表單上的説明文字需預先列印在規定 的欄位位a ’待填文字則纽暗格子標示、字財間有間隔之 欄位内。説明文字不需以暗線(格)分開。 圖-所不之暗格表單20有22、24、26、28等欄位可填寫資料,例 如:在例舉的醫療保險單内,包含被保險人姓名22、病人姓名24、俄主 姓名26、以及病人與被鑛人姓名π。_資訊被填人含暗線32的格 子30内,每個由暗線定義㈣格子_只能填寫—個中文或英數字字 元。錄舰36被印在鮮紙,在—難實施财,此符紐位於 表單的四角,被用來校正掃瞄時表單的傾斜與偏移因 本紙張尺錢财ϋ财料(CNS ) Α4規格(2Γ〇_χ297公慶 83. 3. !0,〇〇〇 「 -- — .I - —I— m - - - ....... - II 士^II 11 m 1- 0¾. 、-& (請先閱讀背面之注意事項再填寫本頁) 經濟部中央標隼局員.工消費合作社印裝 A7 ____B7 五、發明説明(2 ) " ^~~~~~- 圖2顯示表格20之局部放大樣本20,,其中印刷念—如、 |刺+兀邵分,例如,,被 保險人姓名"38以及"病人姓名"39沒有暗線區隔,但落於棚4〇及4,(顯示 於圖二之虛線攔位)之手寫字元則書寫於格子34内。格子Μ由位於搁位 44、46内的暗格32所組成。 一個字元辨識系統往往無法保證辨識無誤,特別是 元時,辨識錯誤無可避免。如此,人工更正(由操作人員執行勢在1 行。某些典型的字元辨識系統往往拒認潦草或不合法的字元。當拒認 與誤認字元大綱减,資料更正料於此自動紐而言,比一般的 人工資料登⑽統更形重要。,—個光學字元辨識系統最好能提 供一套有效地更正無法辨識字元的方法。 表單辨識後可被分為三類: 1.疋全正確:表單内每個字元都可辨識且每個欄位均通過後處理檢 查,例如字典檢查(辨識欄位是否符合字典中的一個字)、文法 檢查(辨識欄位是否符合預設的文法)等。辨識後毋需任何人工 更正’即使有任何錯誤也是因系統隱藏之錯誤造成而無法更正 (例如:字元辨識錯誤卻同樣通過後處理檢查)。一個實用的系統 其隱减錯誤必須低於人工資料登錄系統。 2·人工更正:表單辨識後需經人工螢幕更正。當某些字元被拒認或 櫚k内的字元均可辨識但未通過完整性檢查,則表單必須經人 工更正。 83. 3. 10,000 (請先閲讀背面之注意事項再填寫本頁) 裝. 、ye 2〇4βϋ4 A7 B7 五、發明説明(3 3.=認:當表單内太多字元無法辨識(例如,由於掃瞒品質太 ^表單錯誤、或書寫筆跡潦草)整張表單即被拒認,此時刚 表單上的字元均須由人工輸入。 -些國外的絲字元辨識麵針對上胡題“ 了不_解^ 去。例如,美國專利细,273號(触等)提出—個資料處理系統及方 法’依序更正掃崎識表單後產生之錯誤。此參考崎提出之裝置中 包含三個辨識資料更正處賴,即⑴人工智慧處鄕⑺資料庫偵錯處 理益(3)人场證蚊域黯…_敲生之資騎敝錄賴結 果及更正歷史,並將其陸續傳遞至每個處理器。當人工智慧與資料庫 錯誤更正處理器處理完成後,工作站縣上會顯賴位影像供人工更 正0 美國專利5,3〇5,3%號田咖)提出一資料處理系統與方法,可針對 不同客户表單選擇字元辨識流程以及辨識資料更正流程。此參考例提 出在辨識前先輸入一表單模版,該模版内含根據客户需求而設之系統 操作參數’在大量辨識前系統須先閲讀該表單模版。 美國專利5,235,654號(Anderson等)提出一進步的資料抓取資料處 經濟部中央標準局員工消費合作社印製 理系統’處理掃瞄後之表單影像。其内容為一個可以產生新表單做自 動處理的系統。 美國專利5,153.927號(Yamanari)提出一字元閲讀系統與方法, 該專利提出一個字元閲讀系統,該系統允許使用者準備一個使用者特 殊處理程式,系統不需知道該處理程式的規格。該專利提出兩個處理 區段’即標準處理區段以及使用者自訂處理區段。該使用者自訂處理 83. 3. 10,000 扣衣—— (請先閱讀背面之注意事項再填寫本頁} 木紙伕尺度適用中國國家標隼(CNS)八4规格(210X297公釐) A7 A7 經濟部中央標準局员工消費合作杜印製 五、發明説明(4 ) '' '一'— 區段允許使用者任意設定其希望檢查的棚位,而不影響標準處理區 段。 ' 美國專利H3,627號(Yama職i等)提出一個附有特殊更正功能的 又字辨知,孩專利揭露—種字元閲讀裝置,可避免在絲顯示包含 拒認字元之影像時,遮到原始的表單影像。 本發月之目的為提出一具有表單學習功能的字元辨識裝置。本 發明之另-目的為提出―可辨識印刷及手寫之表單内字元的光學字元 辨識裝置。本發明之再—目的為提出該裝置之高效率辨識錯誤人工更 正方法。 發明概述: 上述目的係經由具有印刷英數字識模組、手寫英數字辨識模組' 印刷中文辨識模组、手窝中文辨識模組之〇CR裝置而達成。當被抽取 之資料完成辨識後,如有需要,會再顯示在螢幕上供察看與更正。 本發明之一較佳實施例包括一可”學習”表格内資訊位址的表單學 習模組,-如此可使光學文字辨識(0CR)裝置直接進入具有待處理資訊 的搁位並進行處理。該模組並且學習字元位置,並比較大量處運時, 掃瞒表單影像上定位記制純與學料定位記號的差異,以改善掃 晦時所產生之傾斜與偏移的容忍度。 本紙張尺度適用中國國家標準(CNS ) Λ4规格(210X297公釐) 83·1 >〇,00〇 ί -I ml - - . I . -. 士I _ _ _ < I 絮------訂 (請先閲讀背面之注意事項再填寫本頁)Printed by the Employee Consumption Association of the Central Bureau of Samples of the Ministry of Economic Affairs 5. Description of the invention (/) Feng Jue Field: The present invention is an optical character recognition device. More specifically, the invention is an optical character recognition device that can recognize printed and hand Chinese and English numbers. Mang background: Many commercial and government units need to process printed forms for writing handwritten data, and there are many ways to extract, process, and store the data. For example, an image scanning device and optical character recognition technology can be used to extract printed or handwritten data on the form. The form image itself can be produced as a miniature film or a miniature film through photography, or an optical scan can be used to generate the image and store it on a hard disk or other electronic storage media. Well-known companies such as Toshiba, Sanyo, Hitachi, and Panasonic have all launched form reading systems that combine image obscuring and optical character recognition (0CR) devices to process Japanese and English. Digital data. A form commonly used for an OCR device is a form with a size of eight or eight and a dark grid. Figure 1 is an example of such a form. The explanatory text on the form must be pre-printed in the specified field a ’. The text to be filled is marked with a dark grid and spaced between characters. The explanatory text need not be separated by dark lines (grids). Figure-The hidden form 20 has 22, 24, 26, 28 and other fields to fill in information, for example: in the example of the medical insurance policy, including the name of the insured 22, the name of the patient 24, the name of the Russian host 26 , And the name of the patient and the victim π. _Information is filled in boxes 30 with dark lines 32, each of which is defined by dark lines (Grid) _ Only one Chinese or alphanumeric character can be filled in. The record ship 36 is printed on fresh paper, and it is difficult to implement. This symbol is located at the four corners of the form. It is used to correct the inclination and offset of the form during scanning. Due to the size of the paper, the money (CNS) Α4 specifications (2Γ〇_χ297 Gongqing 83. 3.! 0, 〇〇〇 "--.I--I-m---.......-II Shi ^ II 11 m 1- 0¾., -& (Please read the precautions on the back before filling in this page) Member of the Central Standard Falcon Bureau of the Ministry of Economic Affairs. Printed on the A7 ____B7 of the Industrial and Consumer Cooperatives 5. Description of the invention (2) " ^ ~~~~~-Figure 2 shows the form Partially enlarged sample of 20, 20, where the printed readings-such as, | thorn + Wu Shao points, for example, the insured ’s name " 38 and " patient ’s name " 39 have no dark line segmentation, but fall in the shed 4. And 4, hand-written characters (shown in dotted line in Figure 2) are written in grid 34. Grid M consists of dark grids 32 located in shelves 44, 46. A character recognition system often cannot guarantee recognition Unmistakable, especially when meta, identification errors are unavoidable. In this way, manual correction (performed by the operator in 1 line. Some typical character identification systems often refuse to recognize scribbles or not The character of the law. When the outline of the rejected and misrecognized characters is reduced, the data correction is expected to be more important for the automatic button than the general manual data registration., An optical character recognition system is best to provide a A set of methods to effectively correct unrecognized characters. After the form is recognized, it can be divided into three categories: 1. All correct: each character in the form can be recognized and each field passes post-processing checks, such as dictionary checks (Identify whether the field matches a word in the dictionary), grammar check (Identify whether the field matches the default grammar), etc. After identification, no manual correction is required. 'Even if there are any errors, they cannot be corrected due to errors hidden by the system (For example: the character recognition error also passes the post-processing check.) A practical system must have lower concealment errors than the manual data registration system. 2. Manual correction: After the form is recognized, it needs to be corrected by the manual screen. When certain characters The characters in the rejected or palm k can be recognized but the integrity check is not passed, then the form must be corrected manually. 83. 3. 10,000 (please read the precautions on the back before filling this page). ye 2〇4βϋ4 A7 B7 V. Description of the invention (3 3. = Recognition: When too many characters in the form cannot be recognized (for example, because the quality of the scan is too ^ the form is wrong, or the handwriting is scribbled) the entire form is rejected At this time, all the characters on the form must be manually input.-Some foreign silk character recognition faces for the stubborn title "Unclear_solution ^ go. For example, US Patent No. 273 (touch etc.) -A data processing system and method 'to correct the errors generated after scanning the Qiqi identification form in sequence. The device proposed by Qiqi contains three identification data corrections, namely ⑴ artificial intelligence processing ⑺ database error detection processing benefits ( 3) Human field card mosquitoes are dark ..._ Kim Sang's Zi Qi recorded the results and corrected history, and passed them to each processor one after another. After the processing of the artificial intelligence and database error correction processor is completed, the image will be displayed on the workstation county for manual correction. 0 US Patent No. 5,305,3% Tian Jia) proposed a data processing system and method, which can be targeted Different customer forms select character recognition process and recognition data correction process. This reference example proposes to enter a form template before identification. The template contains system operating parameters set according to customer needs. Before large amounts of identification, the system must read the form template. U.S. Patent No. 5,235,654 (Anderson et al.) Proposes an advanced data-grabbing data division. The Ministry of Economic Affairs, Central Bureau of Standards, Employee Consumer Cooperative Printing System "processes scanned form images. Its content is a system that can generate new forms for automatic processing. U.S. Patent No. 5,153.927 (Yamanari) proposes a character reading system and method. The patent proposes a character reading system that allows users to prepare a user-specific processing program without the system having to know the specifications of the processing program. The patent proposes two processing sections' standard processing section and user-defined processing section. The user customizes the handling of 83. 3. 10,000 buttons-(please read the precautions on the back and then fill in this page) The size of the wooden paper is applicable to the Chinese National Standard Falcon (CNS) 84 specifications (210X297 mm) A7 A7 Du Printed by the Ministry of Economic Affairs, Central Bureau of Standards for Consumer Consumption V. Invention Instructions (4) '' 'One'-section allows users to arbitrarily set the shed they wish to inspect without affecting the standard processing section. 'US Patent H3 No. 627 (Yama, etc.) proposed a new character recognition with special correction function, the patent disclosure of children-a kind of character reading device, which can avoid obscuring the original image when displaying images containing rejected characters Form image. The purpose of this issue is to propose a character recognition device with form learning function. Another object of the present invention is to propose an optical character recognition device that can recognize characters in printed and handwritten forms. Again-the purpose is to propose a high-efficiency artificial error correction method for the device. Summary of the invention: The above purpose is to have a printed English number recognition module, a handwritten English number recognition module 'printed Chinese recognition module, a hand nest It is achieved by the OCR device of the identification module. After the extracted data is identified, it will be displayed on the screen for review and correction if necessary. One of the preferred embodiments of the present invention includes a "learning" form The form learning module of information address, so that the optical text recognition (0CR) device can directly enter the shelf with the information to be processed and process it. The module also learns the character position and compares a large number of shipments. Conceal the difference between the positioning marks on the form image and the positioning marks of the academic materials to improve the tolerance of tilt and offset during sweeping. The paper size is applicable to the Chinese National Standard (CNS) Λ4 specification (210X297mm) 83 · 1 > 〇, 00〇ί -I ml--. I.-. 士 I _ _ _ < I ----- order (please read the precautions on the back before filling this page)
、發明説明( -個較佳實施例也提供—轉布柄更正流程,其中人工更正只 在有必要時才執行。其更正程序録照工作量的大小由簡至繁排列。Γ即成本較低的部分(較不費時)先進行,當較昂貴更正方法的過滤7在此實施财,先實施字元更正,狀是欄位更正 表單整張更正。 請 先 閲 讀 背 ιέ 意 事 項 再 填 寫 本 頁 裝 經濟部中央樣準局員工消费^作社印货 圖之簡要描逑: 本發明(上述及其餘特徵可經由下列圖的詳細描述而更為明顯: 圖1.顯示一暗格表單範例 圖2.為圖1表單之部分放大檢視 圖j.為一 OCR裝置的元件方塊圖 圖4.為本發明中表單學習過程之流程圖 圖5.為本發明系統的方塊圖 圖6:為本發明中瀑布式更正程序的流程圖 圖7.為字元更正時的一個描述畫面 圖8.為欄位更正時的一個描述畫面。 本紙浪尺度適用中國國家標隼(CNS ) Λ4現格(210X297公董) 83. 3. 10,000 A7 B7 五、發明説明(4 ) ' 之詳細描述: A.光學文字辨識系統概述 圖3是一個〇CR系統50的方塊圖,可為本發明採用。該系統% G括紙張輸送系統51。该紙張輸送系統51將表單沿著箭頭方向通過 光學掃瞄器("OCR掃瞄器,,)52。一個OCR掃瞄器的較佳實施例52利 用适射光照:¾該表單,並利用例如CCD的儲存元件產生該表單的二 値影像。此型掃瞄器可產生每個像素非遲輯”丨"即邏輯▼的二値影 像。一種OCR掃瞄器52的型號為TDC 2610W,係由Terminal Data Corp.所製造。 掃瞒器52可與處理器54(例如,一個一般用途的電腦或是一個特 殊用途的硬體處理單元)連接。處理器的硬體單元可以是光學處理單 元或電子處理單元’例如,,Resister Summing Network"及數位邏輯線 路。該處理器可包括一微處理器56及其餘元件、一個螢幕或監視器 58、一個鍵盤或其餘輸入裝置6〇。該處理器54可包括一個記憶裝置 62儲存掃瞄後的文件影像。該記憶裝置可為硬碟、、或其餘記 憶裝置。 經濟部中央標_局員工消f合作社印製 欲辨識之表單先以OCR掃瞄器52掃瞄,影像資料則經由處理器 54處理。在掃瞄過程中’表單影像及字元辨識資訊顯示於螢幕58, 當下述的字元辨識程序58完成後,本發明之一較佳實施例會將無法 辨識的字元列示於螢幕,使用者可利用鍵盤60將正確字元取代被拒 認及誤認的字元。如下述討論,無法辨識之欄位與表單則顯示於螢 幕上做人工更正。 83. 3. 10,000 本紙張尺度通用中國國家標隼(CNS ) Λ4規格(U〇X297公釐) A7 B7 五、發明説明(7 為使本發明之OCR系統"閲讀”表單内的資料,較佳的作法是系 统先”學習"表單上哪些區域有待閲讀資訊,這些資訊是以何種型態 (例如’卬刷或手寫)出現,以及這些資訊是什麼。由於不同的欄位 包置及字元性質已在表單辨識前即為OCR裝置所學習,資料抽取將 較快’亦較正確’而字元抽取動作亦更有效率。比較預期的及眞實 的表單定位符號的位置後,表單傾斜及不同欄位的邊界即可更精確 地探知。 如此使得OCR裝置可將含有需被抽取及辨識資訊之重要攔位自 整張表單中獨立出來。如下所述’辨識及後處理參數也已預設好, 使處理效率提昇。換句話説,字元性質(如印刷/手寫及中文/英數字) 為辨識處理而預設,欄位描述(名字、性別、地址等)為字辭後處理 而預設。 H— 1--—— - - I I - 1 I ί « ― (請先閎讀背面之注意事項再填寫本頁) 經濟部中央標準局Μ工消費合作社印製 B.表單學習程序 圖4是表單學習的流程圖7〇,首先一個空白表格先經掃瞄(步驟 72),表單影像顯現在電腦螢幕上。操作員決定定義其中的一個資訊 欄位(例如’ ”被保險人姓名”),使用週邊裝置,例如滑鼠,操作員拉 出個包含辨識欄位的矩型區域。OCR軟體探知該攔位X及γ方向的 搁位邊界(步騾74),如此填字元格子的位置即可自動標示出。 接著定義一個性質(或櫚位描述,步騾76),此性質指出欄位内 資訊的類別。例如,第一個欄位指出内含,,被保險人姓名",第二個 欄位指出内含"病人姓名"(見圓丨及圆2)。當欄位定義好後,接著要定 義内含之字元屬性(步縣78),亦即定義櫊内字元應為印刷寫英 度適用中國國家標準(CNS ) Λ4規格(210Χ297公麈) 83. 3. 10,000 五、發明説明(、§ ) A7 B7 經濟部中央標準局—工消費合作杜印裝 數字或者印刷或手寫中 文字元。 ㈣从姓名櫚㈣應顯手寫英 當欄位邊界、性質、屬性都定義好後,定義每 字元填寫處(步,。如此,二: 寫字元的_位置。 ^每個手 佳實Π操Γ狀義梅號36綱_2),在本發明之較 寫。接著域36必触於表單㈣,且資料應採橫式填 知接耆再疋義足位符號36的性質(步驟84)。 此學習過程將可使OCR5〇自填好的表單内自動抽取資料。如此 可加速隨後《字元抽取過程,並增加制斜角度之容忍性。 。當所有空白表單内的資訊都學習好後,本系統即已準備好閲讀 埃有資料的料,此必織過字元跡與字元辨顧㈣。其十字 凡抽取包括三部分:欄位抽取、娜取與字元棘。字元抽取再分為 印刷字元抽取(包括中文與英數字)以及手寫字元抽取(包括中文與英 數字)。 C.資料萃取 圖5為描述本發明之較佳實施的光學字元辨識系統1〇〇的方塊 圓。系統100分為二邵分:掃啦丨〇2、〇CR 1〇4、及辨識後處理1〇6。 本紙張尺度適用中國國家標革(CNS ) Λ4規格(210 X 297公楚) 83. 3. 10,000 (請先閱讀背面之注意事縐再填寫本頁) -----¾衣------、玎---- - ^^48042. Description of the invention (a preferred embodiment also provides a correction process for the transfer handle, where manual correction is only performed when necessary. The size of the correction program recording workload is simple to complex. Γ means lower cost The first part (less time-consuming) is carried out first. When the filtering of the more expensive correction method is implemented here, the character correction will be implemented first, and the field correction form will be corrected in its entirety. Please read the details and then fill out this page Employee consumption of the Central Prototype Bureau of the Ministry of Economy ^ A brief description of the printed map of the company: The present invention (the above and other features can be more obvious through the detailed description of the following figure: Figure 1. Example of a dark grid form shown in Figure 2 . Partially enlarged view of the form in FIG. 1 j. Block diagram of an OCR device FIG. 4. Flow chart of the form learning process in the present invention FIG. 5. Block diagram of the system of the present invention FIG. 6: In the present invention Flow chart of the waterfall correction program. Figure 7. A description screen for character correction. Figure 8. A description screen for field correction. This paper wave scale is applicable to the Chinese National Standard Falcon (CNS) Λ4 cash (210X297 company director) ) 83. 3. 10, 000 A7 B7 V. Description of the invention (4) 'Detailed description: A. Overview of optical character recognition system FIG. 3 is a block diagram of an OCR system 50, which can be adopted for the present invention. The system includes a paper conveying system 51. The paper conveying system 51 passes the form along the direction of the arrow through an optical scanner (" OCR scanner, ") 52. A preferred embodiment of an OCR scanner 52 utilizes illuminating light: the form and use For example, the storage element of the CCD generates the two-value image of the form. This type of scanner can generate a two-value image of each pixel that is not delayed "i.e., logical". The type of an OCR scanner 52 is TDC 2610W, It is manufactured by Terminal Data Corp. The scanner 52 can be connected to the processor 54 (for example, a general-purpose computer or a special-purpose hardware processing unit). The hardware unit of the processor can be an optical processing unit Or electronic processing unit ', for example, Resister Summing Network " and digital logic circuit. The processor may include a microprocessor 56 and other components, a screen or monitor 58, a keyboard or other input devices 60. The The organizer 54 may include a memory device 62 to store the scanned document image. The memory device may be a hard disk, or other memory devices. The Ministry of Economic Affairs Central Standards Bureau publishes the forms to be identified by the OCR employee cooperative and first uses OCR Scanner 52 scans and the image data is processed by processor 54. During the scanning process, the 'form image and character recognition information is displayed on screen 58. When the following character recognition process 58 is completed, one of the present invention The preferred embodiment lists unrecognized characters on the screen, and the user can use the keyboard 60 to replace the rejected and misrecognized characters with the correct characters. As discussed below, unrecognized fields and forms are displayed on the screen for manual correction. 83. 3. The size of the 10,000 paper is generally in accordance with the Chinese National Standard Falcon (CNS) Λ4 specification (U〇X297 mm) A7 B7 V. Description of the invention (7 is to make the OCR system " reading " in the form of the invention more suitable The best way is to "learn" first in the system which areas on the form to read information, what type of information (such as' brush or handwriting) appears in the form, and what the information is. Due to the different fields and The character character has been learned by the OCR device before the form recognition, the data extraction will be faster and more accurate, and the character extraction action will also be more efficient. After comparing the expected and substantial form positioning symbol position, the form is tilted And the boundaries of different fields can be detected more accurately. This allows the OCR device to isolate important stops containing information that needs to be extracted and identified from the entire form. The identification and post-processing parameters are also pre-defined as described below Set up to improve processing efficiency. In other words, character properties (such as printing / handwriting and Chinese / English numbers) are preset for recognition processing, and field descriptions (name, gender, address, etc.) are after words The default is H. 1 ———--II-1 I ί «― (please read the precautions on the back before filling in this page) Printed by M. Consumer Cooperative, Central Standards Bureau, Ministry of Economic Affairs B. Form Study Procedure Figure 4 is a flow chart for form learning 70. First, a blank form is scanned first (step 72), and the form image appears on the computer screen. The operator decides to define one of the information fields (for example, “” insured Name "), using a peripheral device, such as a mouse, the operator pulls out a rectangular area containing the identification field. The OCR software detects the X and γ direction of the stop boundary (step mule 74), so fill in the characters The position of the grid can be automatically marked. Then define a property (or palm position description, step mule 76), this property indicates the type of information in the field. For example, the first field indicates the content, the name of the insured ", the second field indicates the inclusion of " patient ’s name " (see circle 丨 and circle 2). After the field is defined, the character attribute contained within it must be defined (Buxian 78), that is Definition The characters in 櫊 shall be printed and written in English. The Chinese national standard (C NS) Λ4 specification (210Χ297) 83. 3. 10,000 V. Description of the invention (, §) A7 B7 The Central Standards Bureau of the Ministry of Economic Affairs-Industry and Consumer Cooperation Du printed digital or printed or handwritten Chinese characters. (From the name palm) After the boundary, nature, and attributes of the English column should be defined, the place where each character should be filled is defined (step, so., Two: the position of the character. ^ Each hand Jia Shi Π operation Γ shape Yimei No. 36 Outline_2), written in the present invention. Then the field 36 must touch the form ㈣, and the data should be filled in horizontally to understand the nature of the foot symbol 36 (step 84). This learning process will OCR5〇 can automatically extract data from the self-filled form. This can accelerate the subsequent character extraction process and increase the tolerance of the oblique angle. . After all the information in the blank form has been studied, the system is ready to read the materials with data, which must be woven through character traces and character recognition. The cross extraction includes three parts: column extraction, na extraction and character spines. Character extraction is further divided into printing character extraction (including Chinese and English numbers) and handwriting character extraction (including Chinese and English numbers). C. Data extraction FIG. 5 is a block diagram describing the optical character recognition system 100 of the preferred embodiment of the present invention. The system 100 is divided into two points: scan 〇〇2, 〇CR 1〇4, and post-recognition processing 106. This paper scale is suitable for China National Standard Leather (CNS) Λ4 specification (210 X 297 Gongchu) 83. 3. 10,000 (please read the precautions on the back before filling this page) ----- ¾ 衣 ---- -、 玎 -----^^ 4804
經濟部中央標隼局員工消費合泎让印Ti 首先,填好的表單置於紙張傳送系統51,經過〇CR挣瞄器52 ’ 完成掃瞒110 ’此掃瞒影像再與已"學習"且儲存在言己憶體112的空白 表單資訊做比較。 資料抽取可分為三個步縣。首先,先找到包含欲抽取資料的搁 位之位置’並«任何讀的靖。其次蚊欄位中的文字行位 置’此即文字行抽取;最後,抽取文字行中字元的位置,此即字元 抽取。字元抽取又可分為兩個步驟,㈣刷字元抽取與手寫字元抽 取。 1.棚位抽取 抽取模組114抽取欲辨認欄位並校正攔位座標。其執行如下 首先決定表單的偏移錢斜,本齡可容·斜(最多5度)及偏掃 瞒表單軸)。此兩種變異肇始於送紙纽⑽顧聞。定 號3曝故定了表單2〇的邊界,(例如,本實施财,定位符號% 心出表單酬雜(例如:本實施财,定位贱减岐單的四 並且經由比較輸人表單定位符制位置與空自表單上"學習,,得 足位符號位g ’而得知輸人表單的解與偏移I。 、 接著,本模組參考攔位資料庫出所記錄之文字行性質 其預期位置,並抽取出欄位。由於已知表單的傾斜與偏移量二 辨識欄位的位置皆可經由相對於空白表單計算.砍 本紙張尺度·中涵^1 CNS ) Λ4規格TT^Tiil 83·3· !〇,〇〇〇 -私衣 訂'^ (請先閱讀背面之注意事項再填寫本頁j 經濟部中央標準局負工消費合作社印取 A7 B7 -----— "—----- —_____ 五、發明说明(i〇 ) 2. 文字行抽取 接著,文字行抽取及行座標校正以下列方式執行。模組ιΐ4 查詢字辭性質資料庫112來決定欄内文字行的位置,並抽取文字 行之位置。如果欄位内有文字行,則進行水平投射,其描述如下: 首先用水平掃瞄線決定落在欄内同一行的字元的黑點,這些水平 線結合起來形成累積投影量,文字行的邊界可由水平線中黑點的 位置決定。接著,學習而得的攔位原始位置被用來校正文字行的 位置,亦即利用”學習”得到的原始位址以找出可分割兩重疊的輸 入文字行的最佳水平分割線。當文字行内的字元串超過學習文字 行的上下邊界時,欄位即可安全地分割成數行,此時可得到正確 的文字行座標。 3. 字元抽取 接下來’字元抽取及座標更正執行如下:利用行内字元影像 的垂直投影來抽取行内字元,即利用垂直掃瞄線掃瞄字元形成垂 直投影量。投影量之最小値發生處即為字元的邊界位置。文字行 資料庫欄位112可用來決定字元是印刷體或手寫。學習空白表單 時字元的預期位置可用以調整欲辨識攔位内字元的抽取座標,如 此使字元抽取更有效。文字行内字元順序依水平座標値,亦即其 X-座標排列。 (i)印刷字元抽取: 印刷字元抽取模組116抽取文字性質資料庫112所指示包含 印刷資料的欄位資料,它參考112以預知該字元為中文或英數 本紙張尺度適用㈤;< 297$ — ΐ衣 訂- (請先閲讀背面之注意事項再填寫本頁) 五、發明説明(U) A7 B7 字。中文印刷資料送入印刷中文辨識模組118,英數字印刷資 料送入印刷英數字辨識模組120。 接著,執行印刷字元辨識。已知許多光學辨識裝置,如圖 5所示,包含模組II8、12〇。(例如,參見McGrawHill Encyclopedia of Electronics and Computers,pp. 109-111 (McGRAW-Hill 1984))。辨識印刷字元的光學辨識器通常採用 模版比對方法辨識字元。然而,印刷字元辨識裝置118,12〇抽 取不同的特徵並利用決策邏輯來辨識字元。其中,印刷英數字 辨識裝置120參考了印刷英數字辨識專家資料庫122,而印刷中 文辨識裝置116參考了印刷中文辨識專家資料庫124。 (請先閲讀背面之注意事項再填寫本頁) ‘裝· 經濟部中央標準局員工消費合作社印製 (ii)手寫字元抽取: 手窝字元抽取模組no抽取文字行性質資料庫丨12所指示含 手寫資料的欄位資料,它參考⑴以預知該手寫欄位内含中文 或英數字料。中文手寫麟枓送人手寫t文辨識模組132, 英數字手窝資料送入手窝英數字辨識模組134。 接著執行手寫字元辨識。抽取出的手寫中文字元與至少— 個手寫中文字元辨識專_6續,作字字元亦與至少 二個手料財元辨齡家。有兩截佳方式執行辨 :其疋採用統計辨識專家,將拙取字元的特徵拙取出,龙 存在鱗_特徵_,選《接近者做輕識結果。 本紙張尺度適用中國國家標草 Λ4現格(21〇/297公釐 83.3. |〇 .〇〇〇 -11 ------ --- 五、發明説明( A7 B7 ㈣攻利用幾個辨識專家正確的 — 本發明之較佳實施例中採用四個辨識專家,其—為、D 錢計專家;其二為結構性的獅比對辨識專家;其:、6逑 性的週邊比對辨識專家;其四為麵擬的類神經網路=: 比對辨識專家將字元影像骨幹化並抽取結構_骨幹特徵^ 括筆段數目、筆段錢(*狀或味,方崎)、筆段長度及^ 置、轉折點等。鬆弛比對分類器則用以區分未知字元。广 週邊辨識專家抽取字元影像的週邊並柚取結構性的特徵, 包括位置、數目、特徵點種類。這些特徵包括如字元中空洞的 數目及位辭減資訊;娜崎及姆分絲㈣來區 知字元。 類神經網路辨識專家抽取一般的統計特徵,並採用向後傳 達類神經網路區分未知字元。 其餘方法亦可用來辨識手寫字元。 ! :i HI II !_ - ..... I * 士^—II 1 - - - - - I 1 0¾ 、-a (請先閲讀背面之注意事項再填寫本頁) 經濟部中央標準局員工消f合作杜印製 4.辨識後處理: -辨識後處理步驟有二:即字辭後處理與螢幕更正。字辭後處 理模組140包括地址後處理與欄位檢查。 1.字辭後處理: 字辭後處理利用辭席交又檢查字元辨識正確性。例如,辭 麻可包含某一地理區域内的城市、鄉鎮、道路及分段的名稍_。 本紙張尺度適用中國國家標华(CNS ) A4規格(210X297公釐) 83. 3. !〇,〇〇〇 A7 五、發明説明( 辨識產生的字辭會與辭庫比對以決定是否辨識正確。另外,郵 遞區號亦可用以交又檢查, ' 欄位檢查檢驗每個字元的値域範圍,以及欄位内的字元是 否符合設定的代數關係。 2.螢幕更正: 圖6是一個較佳的瀑布式螢幕更正方法200的流程圖。掃瞄 的表單影像被送入表單辨識系統(步驟2〇2),表單被歸入"完全 正確、人工更正"、或’’拒絕接受”三類之一(步驟2〇4),完全正 確的表單影像先存入資料庫内(步帮222)。 需要人工更正的表單在處理時,先決定是否有拒認字元 (步驟206),拒認字元需由人工更正(步驟2〇8)。 執行字元更正時,螢幕更正器]44將拒認字元顯示在螢幕 58上(見圖3),如圖7所示。拒認字元影像顯示在勞幕上供更 正,該些字元屬於同一批次,但可來自不同的表單。如此使得 更正時可一次處理許多表單,因此而更有效率。 當表單需人工更正,但並無拒認字元存在時,表示欄位内 的字元串未通過欄位後處理檢查(步鄉21〇),此時即需執行棚位 更正(步鄉2M)。圖8為執行欄位更正時的勞幕顯示例。如圖8所 示,在本發明之較佳實施例中,監視器%採分割發幕方式,將 標準(CNS ) Λ4現^17^^^釐) 83. 3. 10,000 . I^衣 I 訂 级 (請先閲讀背面之注意事項再填寫本頁) Α7 Β7 經濟部中央標準局員工消費合作社印製 五、發明说明(十) 欄位影像顯示在-侧(此例中為勞幕上半部),將辨識結果顯示 在另-側(此财為絲下半部)。制村參考歸上的搁位 影像檢查並更正辨識錯誤或拒認的字元,操作人員可利用例如 鍵盤之輸入裝置輸入正確字元。 若表單通過欄位檢查,它也存入資料庫(步縣222),但若表 單未通過欄位檢查(即步驟則整張表單拒認,並執行整張表 單人工登打(步騾218),亦即此時表單内所有的資料由人工重新 打字輸入。如果更正後的表單可接i(亦即所有錯誤已人工更正 完),表單資料即儲存於資料庫(步騾222)否則即整張表單拒認 (步驟224)。 最後,辨識產生之資料被送至格式轉換模組146將其轉換成常用 的資料庫格式。此格式轉換後的資料與表單光學影像可以儲存、查 詢、排序或做其餘用途。 在更正拒認字元時,採用工作量最小的步驟先執行之原則,亦即 先檢視並更正字元而非爛位或整張表單。此外,字元更正步驟可提昇 表單通過欄位檢查及整張檢查的可能性,如此始能同時有效地處理許 多的表單。 本發明的優點包括操作的方便性以及字元抽取時間減少。字元辨 為逨度增加,爆;φ式的人工更正過程更是更正掃瞄辨識後的表單之一 甚有效的方法。此外,能在螢幕上更正辨識結果及有效的抽取及儲存 (請先閱讀背面之注意事項再填寫本頁) -裝 、-° -i 本紙浪尺及通网中國國家標芈(CNS ) A4規格(210X297公釐) 83. 3. 10,000First, the completed form is placed on the paper conveying system 51, and after passing through the CRCR 52, 'Complete the scanning and concealing 110'. This scanning and concealing the images are then "learned". ; And compare the blank form information stored in Yanjiyi 112. Data extraction can be divided into three step counties. First, find the location of the shelf containing the data you want to extract ’and« any reading. Secondly, the position of the text line in the mosquito pen ’is the extraction of the text line; finally, the position of the character in the text line is extracted, which is the character extraction. Character extraction can be divided into two steps, (iv) brush character extraction and handwriting character extraction. 1. Shed extraction The extraction module 114 extracts the desired field and corrects the coordinates of the block. Its implementation is as follows: First, determine the offset of the form, the skew, the age can be tolerated and skewed (up to 5 degrees), and the offset scan hides the form axis). These two variations started with the paper feed button ⑽ Gu Wen. The reference number 3 exposes the boundary of the form 20, (for example, this implementation of finance, positioning symbol% heart out of the form complex (e.g., this implementation of finance, positioning four of the low-consistency orders and inputting the form locator by comparison Control position and empty from the form "learn", get the full sign bit g 'and know the solution and offset I of the input form I. Then, the module refers to the block database to find out the nature of the text line recorded and its expected Position, and extract the field. Because the tilt and offset of the known form can be identified, the position of the field can be calculated relative to the blank form. Cut the paper size · Zhonghan ^ 1 CNS) ^ 4 Specification TT ^ Tiil 83 · 3 ·! 〇, 〇〇〇-private clothing order '^ (please read the notes on the back first and then fill in this page j Ministry of Economic Affairs Central Bureau of Standards Consumer Cooperative Printed A7 B7 -----— " — ----- —_____ 5. Description of the invention (i〇) 2. Text line extraction Next, the text line extraction and line coordinate correction are performed in the following manner. Module ιll4 Query the character database 112 to determine the text line in the column And extract the position of the text line. If there is a text line in the field Then perform horizontal projection, which is described as follows: First, use the horizontal scanning line to determine the black dots of characters that fall on the same line in the column. These horizontal lines combine to form a cumulative projection. The boundary of the text line can be determined by the position of the black dots in the horizontal line. Then, the original position of the stop position learned is used to correct the position of the text line, that is, the original address obtained by "learning" to find the best horizontal dividing line that can divide the two overlapping input text lines. When the character string in the text line exceeds the upper and lower boundaries of the learning text line, the field can be safely divided into several lines, and the correct coordinate of the text line can be obtained at this time. 3. Character extraction Next, character extraction and coordinate correction are executed As follows: use the vertical projection of the in-line character image to extract the in-line characters, that is, use the vertical scanning line to scan the characters to form the vertical projection. The minimum value of the projection amount is the boundary position of the character. Field 112 can be used to determine whether the character is printed or handwritten. When learning a blank form, the expected position of the character can be used to adjust the extraction coordinates of the character in the block to be recognized , So that the character extraction is more effective. The character order in the text line is arranged according to the horizontal coordinate value, that is, its X-coordinate. (I) Printed character extraction: The printed character extraction module 116 extracts the character database 112 instructions Column information that contains printed materials, it refers to 112 to predict that the character is Chinese or English, and the size of the paper is applicable. ㈤; < 297 $ — ΙClothing- (please read the precautions on the back before filling this page) 2. Description of invention (U) A7 B7. Chinese printed materials are sent to the printed Chinese recognition module 118, and alphanumeric printed materials are sent to the printed alphanumeric recognition module 120. Then, printed character recognition is performed. Many optical recognition devices are known As shown in Figure 5, including modules II8, 12〇. (For example, see McGraw Hill Encyclopedia of Electronics and Computers, pp. 109-111 (McGRAW-Hill 1984)). Optical recognizers that recognize printed characters usually use template comparison methods to recognize characters. However, the printed character recognition devices 118, 120 extract different features and use decision logic to recognize characters. Among them, the printed English numeral recognition device 120 refers to the printed English numeral recognition expert database 122, and the printed Chinese character recognition device 116 refers to the printed Chinese numeral recognition expert database 124. (Please read the precautions on the back and then fill out this page) 'Appearance · Printed by the Employee Consumer Cooperative of the Central Bureau of Standards of the Ministry of Economic Affairs (ii) Handwriting character extraction: hand nest character extraction module no extraction character line nature database 丨 12 For the field information with handwritten data as indicated, it refers to (1) to predict that the handwritten field contains Chinese or English digital materials. The Chinese handwriting handwriting is sent to the handwriting recognition module 132, and the English digital hand pocket data is sent to the hand digital recognition module 134. Then perform handwriting character recognition. The extracted handwritten Chinese characters and at least one handwritten Chinese character recognition special _6 continued, the character is also distinguished from at least two hand-crafted financial elements. There are two good ways to perform discrimination: Qi Zhang uses a statistical recognition expert to take out the features of clumsy characters. The dragon has scales_features_. This paper scale is applicable to the Chinese national standard grass Λ4 present grid (21〇 / 297 mm 83.3. | 〇.〇〇〇-11 ------ --- Fifth, the invention description (A7 B7 ㈣ 攻 uses several identification Experts are correct—in the preferred embodiment of the present invention, four recognition experts are used, which are the money expert; the second is the structured lion comparison recognition expert; its :, 6 peripheral recognition recognition Expert; the fourth is a face-to-face neural network =: comparison and identification experts will characterize the character image and extract the structure _ backbone features ^ including the number of pen segments, pen segment money (* shape or taste, Fangqi), pen Segment length and position, turning point, etc. The slack comparison classifier is used to distinguish unknown characters. Wide peripheral recognition experts extract the periphery of the character image and take structural features, including location, number, and type of feature points. These Features include, for example, information on the number and location of holes in characters; Nazaki and Mufensi (regional knowledge of characters). Neural network recognition experts extract general statistical features and use backward communication neural networks to distinguish unknowns Characters. Other methods can also be used to identify handwriting characters.!: I HI II! _-..... I * 士 ^ —II 1-----I 1 0¾, -a (please read the precautions on the back before filling this page) Employees of the Central Bureau of Standards of the Ministry of Economic Affairs cooperate Du Yinzai 4. Post-recognition processing:-There are two post-recognition processing steps: word post-processing and screen correction. Word post-processing module 140 includes address post-processing and field checking. 1. Word post-processing: words Post-resignation processing uses resignation to check the correctness of character recognition. For example, Ci Ma can include the names of cities, towns, roads, and sections within a geographic area. This paper scale is applicable to China National Standards (CNS) ) A4 specification (210X297 mm) 83. 3. 〇〇〇〇〇A7 V. Description of the invention (The words generated by the recognition will be compared with the lexicon to determine whether the recognition is correct. In addition, the postal code can also be used to pay Check, the field check checks the range of each character and whether the characters in the field conform to the set algebraic relationship. 2. Screen correction: Figure 6 is a flowchart of a better waterfall screen correction method 200 The scanned form image is sent to the form recognition system (step 2〇2), the table Be classified as one of the three categories of "completely correct, manual correction", or "refuse to accept" (step 2〇4), and the completely correct form image is first stored in the database (step 222). Manual correction is required When processing the form, first decide whether there are rejected characters (step 206), and the rejected characters need to be corrected manually (step 20). When performing character correction, the screen corrector] 44 will reject the characters It is displayed on screen 58 (see Figure 3), as shown in Figure 7. The image of rejected characters is displayed on the screen for correction. These characters belong to the same batch, but can come from different forms. This makes the correction time It can process many forms at once, so it is more efficient. When the form needs to be corrected manually, but there is no rejected character, it means that the character string in the field has not passed the post-processing check of the field (Buxiang 21〇), then the shed correction needs to be performed (Buxiang 2M ). Fig. 8 is an example of a labor screen display when performing field correction. As shown in FIG. 8, in a preferred embodiment of the present invention, the monitor% adopts a split hair release method, and the standard (CNS) Λ4 is now ^ 17 ^^^%) 83. 3. 10,000. I ^ 衣 I 定Level (please read the precautions on the back before filling this page) Α7 Β7 Printed by the Employee Consumer Cooperative of the Central Standards Bureau of the Ministry of Economy V. Invention Instructions (X) The image of the field is displayed on the-side (in this example, the upper half of the labor screen) ), Display the recognition result on the other side (this is the lower half of the silk). Refer to the shelf image on the reference system to check and correct the characters that have been recognized incorrectly or rejected. The operator can input the correct characters using an input device such as a keyboard. If the form passes the field check, it is also stored in the database (Step County 222), but if the form does not pass the field check (ie steps, the entire form is rejected, and the entire form is manually entered (Step 218) , That is, all the data in the form is manually retyped. If the corrected form can be accessed (that is, all errors have been manually corrected), the form data is stored in the database (step mule 222), otherwise Form rejection (step 224). Finally, the identification data is sent to the format conversion module 146 to convert it into a commonly used database format. The format converted data and form optical images can be stored, queried, sorted or Do the rest. When correcting rejected characters, use the principle of the least-worked step first, that is, review and correct the characters first instead of rotten or the entire form. In addition, the character correction steps can improve the form pass The possibility of column check and whole check can effectively process many forms at the same time. The advantages of the present invention include the convenience of operation and the time of character extraction. The character recognition is increased, Explosion; φ manual correction process is one of the most effective ways to correct the scanned and identified form. In addition, it can correct the identification results and effective extraction and storage on the screen (please read the precautions on the back before filling in this Page) -Installation,-° -i This paper wave ruler and the Internet China National Standard (CNS) A4 specification (210X297mm) 83. 3. 10,000
資料, 單資料的能力。 五、發明説明(〖Γ) 亦都是優點。如此改進了私、 11入、閱讀、儲存大量印刷、手寫表 本發明並不局限於上所揭露之實施, 構而不脱離本發明之範圍者亦涵括在内。 其餘不同的修改取代及結 ---——III— - — · _ I (請先閱讀背面之注意事項再填寫本莧) 經濟部中央標準局—工消費合作社印製 83. 3. !〇,〇〇〇 本紙浪尺度適闲中國國家榡率(CNS ) Μ規格(210x297公釐)Data, the ability to single data. Fifth, the invention description (〖Γ) is also an advantage. This improves privacy, reading, reading, storing a large number of printed and handwritten watches. The present invention is not limited to the implementations disclosed above, and those who do not depart from the scope of the present invention are also included. The rest of the different modifications are replaced and completed ------- III---_ I (Please read the notes on the back before filling in this amaranth) Printed by the Central Bureau of Standards of the Ministry of Economic Affairs-Industrial and Consumer Cooperatives 83. 3.! 〇, 〇〇paper paper wave standard leisure Chinese national rate (CNS) Μ specifications (210x297 mm)