TW200409046A

TW200409046A - Optical character recognition device, document searching system, and document searching program

Info

Publication number: TW200409046A
Application number: TW92100430A
Authority: TW
Inventors: Takeshi Eisaki; Katsumi Marukawa; Shigeyuki Fujiwara
Original assignee: Hitachi Ltd
Priority date: 2002-11-21
Filing date: 2003-01-09
Publication date: 2004-06-01
Also published as: CN1503193A; TWI285849B; CN100351847C; JP2004171316A

Abstract

The subject of the present invention is to provide a method as a document searching means for paper document and document image for applying the character recognition technique to execute the searching in the character group containing specific key words. The solution according to the present invention is to separate the optical character recognition (OCR) from the searching device, and the output format of the OCR employs the precipitation of permanent storage character column, character division, and the files (OCR reading assumption files) with multiple assumption for character recognition, so as to employ the OCR reading assumption file as the basis to compose the searching function for the key words, and further provide the system for executing the necessary document searching and document categorizing.

Description

200409046 (1) 玖、發明說明 [發明所屬之技術領域] 本發明是關於應用文字辨識技術從紙文書群或文書影像群中檢索含有特定鍵檢索之文書群並取得必要之資訊的文書檢索·處理方法、其裝置、及文書檢索處理程式。 [先前技術] 雖然現在之數位資訊技術已因爲電腦而普及，紙文書仍然是被廣泛採用的資訊傳送媒體。然而’對於想要從大量的文書中以某關鍵字來檢索必要資訊、或想要對含有特定關鍵字群之文書進行檢索並自動分類等之要求，很明顯的，和數位資料相比，紙文書有難以對應的問題。爲了解決此問題，出現以紙文書檢索及自動處理爲目的之各種方法。從紙文書或文書影像等檢索必要之關鍵字的手段，有每次需要檢索時以OCR(光學讀取裝置）辨識紙文書再進行檢索之線上處理、以及一開始即以〇C R讚取並將讀取結果永遠保留然後進行檢索之非線上處理。例如，郵件區分機等裝置即屬於線上處理。此種線上處理時，因指定想要檢索之關鍵字，可利用關鍵字含有之文字特性（全形、半形、漢字、英數字等）來變更文字切割的參數，或者，可執行限定文字辨識時之字種的處理等來提高檢索精度。相對的，因爲每次檢索都必須執行影像解析及文字辨識，故在重複檢索之運用形態時，以計算時間之觀點而言，並不切 -6- (2) (2)200409046 實際。本發明所提供之方法則以非線上處理爲基礎。紙文書之非線上的關鍵字檢索之最基本方法，就是利用〇C R將紙文書轉換成正文檔案，再對正文檔案進行檢索。然而，因爲一般以〇C R轉換之正文碼會含有錯誤，而產生有時無法以單純正文檢索來處理的情形。當然，可以利用人工在OCR轉換後再修正正文碼，然後再對修正結果進檢索。然而，從處理速度及成本方面而言，利用人工修正實在很不切實際。提高OCR之讀取精度的手段，以對〇CR之辨識結果執行形態分析之方法爲大家熟知（參照專利文獻1。）。的確 ’形態分析等之知識處理可以訂正誤讀，然而，卻無法實現100%之訂正。此外，一般之形態分析所使用之辭典以新聞等一般文章爲對象，爲了要以良好精度來校正特殊業務用途之文書，則必須追加定義適合該分野之特殊辭典。因此’維修性及計算量方面都仍存在問題。胃7避免文字誤讀對檢索產生不良影響，有人提出利 OCR容易誤讀之類似文字的資訊來執行單語檢索的方法 (參照專利文獻2。）。又，也有人提出容許OCR之讀取結果具：有複數之文字辨識候補，然後從其中選擇文字碼來檢出單語之方法（參照專利文獻3)。的確，使用這些技術可避免1文字單位之誤讀對單語檢索產生不良影響。然而’利用前述方法時，會因爲文字分離或文字互相接觸等而無法明確界定文字圖案之境界，故文字圖案之誤切割時就無法對應。例如，〇CR將「y、少」之文字讀成「 (3) (3)200409046 」時，前述專利方法可對應，然而，讀成r y〆」時則無法對應。又，對於含有圖或表之文書、或單據形式等含有許多格線之文書等，往往很難在讀取文字以前實施文字行之檢出·鑑別。然而，前述方法無法處理此問題。 [專利文獻1] 日本特開平〇 5 -1 〇 8 8 9 1號公報 [專利文獻2] 日本特開平1 0-74250號公報 [專利文獻3 ] 曰本特開平9-1343 69號公報 [發明內容] 本發明之目的，在提供以文字辨識結果爲基礎從紙文書群檢索必要關鍵子之卓語檢索方法、利用其結果執f了文書檢索·文書分類等之處理的文書檢索處理系統及其裝置、以及記錄檢索處理程式之記錄媒體。傳統方法之對紙文書群的文書檢索，是對OCR讀取結果之正文進行檢索，然而，很難處理文字變形或變淡等造成OCR之文字識別錯誤、文字圖案境界之模糊導致OCR之文字切割錯誤、或文書-圖面-格線之混合存在造成O CR之文字行析出錯誤等問題。本發明之第1目的，就是提供可避免因OCR讀取導致文字識別、文字切割、文字行析出等之錯誤而對單語檢索產生不良影響的方法。又，使用關鍵字群之文書檢索·文書分類處理時，一 -8- (4) (4)200409046 般會使用特定關鍵字及其集合規則（AND或〇R:和或或）來執行處理。例如，檢索同時（A N D)具有「〇c R」及「檢索」之單語的文書之實例。對傳統正文文書執行檢索時，因會以1或0之2數値來規定有無關鍵字，只要單純處理即可適用集合規則’然而，本方法因和文字辨識相關，故關鍵字之有無會以〇至1之連續値的槪度來表示。因此，若對於槪度較低之關鍵字一律採用交集規則來執行文書檢索，則有無法進行充份檢索之問題，而若一律忽略槪度較低之關鍵字來執行文書檢索，則有無法檢索到必要文書之問題。本發明之第2目的，是提供利用文字識別之槪度來導出單語檢索之槪度及交集規則之槪度，且利用自動學習來管理文書檢索之精度的方法。爲了達成前述第1目的，本發明將OCR及檢索裝置分離，OCR之輸出形態採用可永久保存文字行析出、文字切割、及速字識別之多重假設的檔案（OCR讀取假設檔案），以此OCR讀取假設檔案爲基礎來構成檢索關鍵字之機能，進而提供可執行必要文書之檢索及文書之分類的系統。爲了達成前述第2目的，提供一種機構，使OCR讀取假設檔案含有文字識別之類似度、文字圖案之位置資訊等，並將其當做計算檢索到之關鍵字的槪度、及關鍵字規則集合時之槪度的資訊，並依據這些槪度來決定文書檢索結果之受理·廢棄。 [實施方式] -9- (5) (5)200409046 以第1圖爲例來槪說傳統方法及本發明方法之不同。第1圖爲傳統單語檢索方法及文書檢索方法、以及本發明方法之差異的模式圖。首先，傳統方法之流程中，有以1 〇 1表示之紙文書群，而利用以1 0 2表示之〇C R來執行讀取。將讀取結果當做以103表示之正文檔案輸出。其次，將正文檔案輸入以1〇4 表示之裝置，執行單語檢索。此流程中，檢索對象之單語是參照單語DB( 113)。然而，本來爲「血液化學檢查」之文字，OCR之讀取結果卻將其讀成「皿液 < 匕學檢查」時，以正文檔案爲基礎無法檢索到「血液化學檢查」之單語，此時，一般會視爲檢索失敗。因此，即使利用以1 〇 5表示之裝置對檢索之單語採用文書檢索規則（1 1 4)執行處理，亦因爲必須適用規則之單語不存在，而檢索失敗。因而無法對文書執行檢索·篩選。相對於此，本發明之處理流程中，首先，有以107表示之紙文書群，利用以108所示之 OCR讀取。將讀取結果當做以109所示之OCR讀取假設檔案輸出。其次，將OCR讀取假設檔案輸入至以1 10所示之裝置，執行單語檢索。必須檢索之單語定義於以11 3表示之單語DB。因OCR讀取假設檔案含有各種文字行析出候補、文字分割候補、及文字識別候補，除了「皿液4 t:學檢查」之結果以外，尙可獲得正確識別結果--「血」、「化」的結果，使單語檢索更爲容易。其次，利用以1 i 1表示之裝置，依據記述檢出之單語及單語間的關係之文書檢索規則進行文書之檢索·篩選。文書檢索規則記載於以1 1 4表 -10- (6) (6)200409046 示之規則D B。文書檢索規則之實例如「「〇CR」及「檢索」之單語共同存在之文書」，爲以〇R或AND連結複數單語之構造等。使用OCR讀取假設檔案可提高單語檢索之精度，結果則是可適用文書檢索規則並可執行以1 1 2表示之文書檢索·篩選。 OCR讀取假設檔案含有可完全鑑別相對應之紙文書或文書影像的文書ID碼，且可永久儲存於磁性儲存裝置。使用〇C R讀取假設檔案之檢索系統，在出現文書檢索要求時，會從預先儲存之OCR讀取假設檔案檢索必要之關鍵字，對照文書檢索規則，儲存適合之文書的文書ID碼。檢索結果會同時顯示利用文書ID碼鑑別之紙文書或文書影像等。利用此方式，即使OCR裝置及檢索裝置爲分離形態，亦可構成統一處理文書影像及讀取資料之文書處理系統。針對第2圖進行說明。本發明實施例之單據辨識裝置時，首先，〇C R裝置會實施紙文書攝影，並將其轉換成電子影像資料（2 0 1)。若文書本來就是電子影像資料時，可省略本處理。其次，以電子影像資料爲基礎，執行格線析出、框構造解析、讀取對象框之位置推算等文書構造解析 (202)。此時’使用之辨識處理爲公知技術（日本特開平〇9_ 319824、日本特開2000-251012等）。其次，接收文書構造解析之結果，析出讀取對象之文字行候補（2 0 3 )。其次，再從文字行影切割文字圖案候補（204)，再識別各文字圖案候補（205)。從對象文書析出複數之文字行候補、文字圖案候補、及文字識別候補，構成多重假設。最後，將文 -11 - (7) (7)200409046 字行候補、文字切割圖案候補、及其識別結果輸出至檔案 (20 6)。此輸出之檔案稱爲〇CR讀取假設檔案。後面會對 OCR讀取假設檔案進行詳細說明。前述處理201至206是利用光學讀取裝置等專用裝置將紙文書轉換成OCR讀取假設檔案的過程。相對於此，若爲電子影像資料時，則以影像讀取（ 207)來取代處理201，將其轉換成OCR讀取假設檔案。此時，若有轉換程式及以驅動程式爲目的之汎用演算裝置，則可執行處埋。前面所述之各資訊，儲存於第10圖所示之OCR裝置的下述位置。由紙文書轉換而成之影像資料、或預先準備之處理對象的影像資料，會儲存於外部儲存裝置1 004或記憶體1 00 5。OCR程式儲存於外部儲存裝置1 004或記憶體1005 ，利用中央演算裝置1 006來執行處理。解析影像資料結果所所得之框資訊、行資訊、候補網狀結構、候補文字網狀結構則以記憶體1〇〇5爲主實施展開。本處理之輸出的OCR 讀取假設檔案，會透過外部儲存裝置1004、記憶體1 005、或通信裝置1007儲存於外部裝置。針對第3圖進行說明。第3圖爲使用OCR讀取假設檔案之文書檢索引擎的處理流程圖。首先’讀取對應檢索對象之紙文書群（或文書影像群）的0CR讀取假設檔案’針對各 OCR讀取假設作成候補文字網狀結構（3〇1)。其次’將候補文字網狀結構及檢索對象之單語群視爲輸入’執行單語檢索（302)。因OCR讀取假設檔案含有各種文字行候補、文字切割候補、及文字識別候補i ’而必須執行已檢索之單語是 -12- (8) (8)200409046 否正確的判定處理。其後，針對檢索之結果，依文字識別之槪度或順位、及圖案之排列等資訊，計算已檢索之單語的槪度，決定是否受理或廢棄單語檢索結果（ 303 )。文字識別之槪度或順位、及圖案之排列等相關資訊包含於OCR 讀取假設檔案內。後面會對OCR讀取假設檔案進行詳細說明（和第1 2圖〜第1 6圖相關）。其次，再針對含有已檢索之單語群的文書，應用文書檢索規則執行文書檢索（304)。最後，再針對已檢索之文書，依據經過規則篩選之檢出單語的槪度、或採用之規則的重要性，決定受理或廢棄文書檢索結果（3 0 5 )。針對第4圖進行說明。第4圖是詳細說明前述處理303 。此處理中，針對已檢索之單語，使用文字識別之槪度、文字圖案之配置資訊、及相對於單語之文書影像的配置資訊等，計算檢出單語之槪度。檢出單語之槪度計算上，首先會考量文字列路徑（已檢索之單語以文字碼列及文字圖案列之組合來表示。將其稱爲路徑。詳細說明如第5圖所示）上之文字圖案的識別槪度來計算單語之識別槪度（401) 。其次，計算和文字圖案之配置相關的損失（402)。例如，相對於統計學上之平均値，將相對於路徑整體之高度的文字高度比、相對於路徑整體之中心線的文字中心線偏離、平均文字寬度、及和相鄰之文字圖案的間隔等的偏離程度視爲損失的方法。在考量已檢出之單語整體的位置下，計算其損失（403)例如，會使用檢出單語是否位於文書影像中之特定區域內的資訊等。然而，儲存於OCR讀取假設 -13- 200409046 Ο) 檔案之資訊會有數階段之層級（後述），可對應其層級而省略處理402及處理403。後面會詳細說明OCR讀取假設檔案〇針對第5圖及第6圖進行說明。第5圖爲單語檢索之過程的槪念圖。第6圖爲候補文字網狀結構之槪念圖及資料之詳細圖。以第5圖爲基礎說明單語檢索之流程。對讀取對象文字行（a)執行認爲是文字圖案的各種切割，作成候補文字圖案，再對各候補文字圖案執行文字識別作成候補文字網狀結構（b)。候補文字網狀結構具有最低限之文字圖案、具有依文字識別結果所得之順位的識別碼群、及候補文字網狀結構中之文字圖案間的相連關係資訊。OCR讀取假設檔案含有部份此種資訊。其形態則爲二進位形態、或使用XML等之標記的正文形態。因本發明之方法使用 OCR讀取假設檔案，候補文字網狀結構會依據從檔案讀取之資訊來作成。其次，使用文字列表示知識（c)，從候補文字網狀結構計算文字列路徑（d)。實例中，文字列表示知識採用以OR記號（I)來區隔單語之方式。亦即，代表夾於記號I之間的單語群被指定爲檢索對象。文字列表示除了此表示以外，尙可使用嘗試法、或上下文無關文法等（如曰本特開20(Π-0143 11等之記載）。第6圖爲文字候補網狀結構之詳細說明。文字候補網狀結構之表現上，爲以架構 (6 0 1)來表現文字圖案之候補、及以節點（6 〇 2)來表現文字圖案之境界的有向圖。各文字圖案含有代表左右（直書時爲上下）之節點（圖案境界）的境界ID編號、文字識別候補 -14- (10) (10)200409046 (60 3)、及識別類似度（604)之資訊。單語檢索處理則爲’ 將此候補文字網狀結構及文字列表示知識視爲輸入搜尋候補文字網狀結構含有之單語及其圖案列的處理。例如，文字列表示知識上之「血液化學檢查」單語，在第6圖之候補文字網狀結構中進行搜尋而找到如以605圈出之文字碼及文字圖案。搜尋文字碼及文字圖案之演算法爲公知技術 (日本特願平1 0-28077、日本特願平1 1 - 1 8 7 5 3等）。確定單語檢索之結果、文字列路徑。文字列路徑爲由文字碼歹[]( 亦即文字列）、及對應各文字碼之文字圖案所構成的資訊〇前述之各資訊，會儲存於第10圖所示之檢索裝置的下述位置。OCR讀取假設檔案會儲存於外部儲存裝置101 2或記憶體1013。單語檢索程式亦儲存於外部儲存裝置1012或記憶體1013，利用中央演算裝置1014來執行處理。依據讀取假設檔案作成之候補文字網狀結構會在記憶體1 〇 1 3上展開。對其執行單語檢索，並經由外部儲存裝置1 01 2、記憶體1 0 1 3、或通信裝置1 0 15，將檢索結果儲存於外部裝置。針對第7圖進行說明。第7圖利用本發明方法之文書檢索系統的一畫面構成例。此處，以處方文書之檢索系統爲例。首先，在輸入欄7 0 1指定欲檢索之關鍵子’在輸入欄 702指定以何種規則處理檢索關鍵字。在此圖中，選取之規則代表尋找指定之全部關鍵字的其中之一。將前述2項目視爲輸入，對儲存著OCR讀取假設檔案之資料庫執行處方文書檢索。顯示欄7 0 3會顯示檢索結果所得到的處方名 -15- (11) 200409046 稱。顯示欄704會顯示檢索到之文書當中目前顯示之的相關資料。顯示欄705會以視覺方式顯示檢索結 OCR讀取假設檔案因爲具有和原紙文書或文書影像完應之文書ID碼，故可同時顯示文書影像及檢索結果。檢索到之單語會顯示於有706之底線的位置。顯示文索結果時，可依可利用OCR讀取假設檔案計算之檢出槪度及檢索文書槪度來設定優先順序。針對第8圖進行說明。第8圖爲使用OCR讀取假設之檢索系統的文字切割及文字識別之多重假設化的效。圖（a)爲讀取對象之文書（的部份影像），以粗線框住份相當於1個行假設。圖（b)中，以無特別知識之一般讀取此部份時，原本應爲「少卩7 K錠」會被讀成「 u y卜''症」。因爲，「少」爲2個文字圖案的合成，被以分離方式讀取，而「V」因顏色較淡而被誤讀成第1位的結果，又，「錠」因爲部份變形而被誤讀成第1位的結果。相對於此，OCR讀取假設上，會具有 (c)所示之候補文字網狀結構。亦即，雖然會存在將」讀成「/」及「1/」之假設，亦會存在讀成「少」設，又，「7」及「錠」等之1位文字識別結果雖然誤讀成「V」、「症」，但在進一步之識別候補中則正確識別結果之「7」及「錠」。對OCR之正文讀取執行單語檢索時，必須從「/ 1/ y V K症」檢索「少卜''錠」之單語，此時，若以編輯距離測量2文字列之，則爲成爲1文字插入2文字不讀取，以單語而言，無文書果。全對又，書檢單語檔案果圖之部 OCR J v 故會識別識別如圖「少之假會被含有結果距離法將 -16-· (12) (12)200409046 其視爲類似。另一方面，使用OCR讀取假設檔案之檢時，不會有文字插入及不讀取的情形，而使單語檢索更爲容易。結果，可以檢索到如圖（d)所示之正確單語。針對第9圖進行說明。第9圖爲使用OCR讀取假設檔案之檢索系統的文字行之多重假設化的效果圖。圖U)爲讀取對象之文書（的部份影像）。圖（b)則爲從其中利用單一假設析出文字行時的結果。此圖中，會以將圖（a)中之中間3 行視爲1行執行析出。因爲，將文字行朝橫向影射切割時，行爲被夾於印刷行之間，因爲有手寫行及蓋章行，故影射時無法形成明確之分隔，而將其判斷成1行。相對於此，因爲不但允許前述單一假設，亦容許複數之行假設，故會將圖（b)所示之較粗的文字行進一步切割成較細的文字行，並將其視爲假設，構成如圖（c)所示之文字行假設群。針對前述複數之行假設展開OCR讀取假設檔案，並對其執行單語檢索，結果，可檢索到如圖（d)所示之正確單語。OCR讀取假設檔案不但會儲存文字切割、文字識別之資訊，亦會儲存文字行假設檔案。OCR讀取假設檔案含有之資訊會在後面進行詳細說明（和第12圖〜第16圖相關）。針對第1 0圖進行說明。第1 〇圖爲利用本發明之方法，以OCR裝置及檢索裝置分離之形態構成文書檢索系統時之一構成實例。第1〇圖之上段爲OCR裝置之一構成實例’而第1 0圖之下段則爲檢索裝置之一構成實例。首先，上段之OCR裝置會利用影像輸入裝置（1001)將文書轉換成電子資料，並將其儲存於外部儲存裝置（1 0 0 4) -17- (13) (13)200409046 及記憶體（ 1 005)，然後利用中央演算裝置（ 1006)讀取。文書格式之定義等，儲存於外部儲存裝置（1004) ’文書構造解析時，會參照儲存於此之定義。這些處理可經由操作終端裝置（ 1 002)由人執行操作，處理結果等則可利用顯示終端裝置（ 1003)來顯示，資料則會儲存於外部儲存裝置、或透過通信裝置（ 1 007)傳送至外部裝置。OCR之讀取結果，亦會如傳統裝置所示，將其視爲正文檔案執行輸出，亦可將其視爲OCR讀取假設檔案執行輸出。OCR讀取假設檔案會被儲存於外部儲存裝置、或經由通信裝置傳送至外部之裝置。此時，OCR讀取假設檔案含有對應OCR讀取之文書（或影像）的文書ID碼。利用此文書ID碼，可執行紙文書或文書影像、及OCR讀取假設檔案之對應。利用其和OCR讀取假設檔案之對應，可實現下述文書檢索機能，例如，提供將檢索到之單語顯示於原來之文書影像上之人類較易理解的GUI機能、以及選取含有目的單語之文書影像等。第7 圖即是單語檢索之GUI的一構成實例，然而，此時採同時顯示文書影像（705)及檢索到之單語（70.6)之方式。此顯示機能可利用在OCR讀取假設檔案上檢索到之單語的位置資訊、以及對應OCR讀取假設檔案之ID的影像檔案來實現。第10圖下段之檢索裝置，會利用前述OCR機能裝置輸出之OCR讀取假設檔案來執行檢索，具有針對一旦產生 OCR讀取假設檔案之文書重複執行（只要存在假設檔案）無限次數之檢索的機能。此檢索裝置會從通信裝置（1 0 1 5)讀取OCR讀取假設檔案並將其下載至記憶體（1013)，再利用 -18- (14) (14)200409046 中央演算裝置（1 〇 1 4)執行檢索處理。欲檢索之單語及文書檢索規則可儲存於外部儲存裝置、或利用操作終端裝置 (1 〇 11)輸入。單語之檢索結果則會透過顯示終端裝置 (1011)顯示，又，會透過通信裝置將資料傳送給外部機器、或將檢索結果儲存於外部儲存裝置。這些裝置會利用內部匯流排（ 1 008、1 009、101 6)進行連結。針對第1 1圖進行說明。第1 1圖爲將文書檢索系統應用於實際業務上之自動學習機構的模式圖。首先，對文書檢索系統輸入大量紙文書·文書影群（110 1 )，作成對應各文書之OCR讀取假設檔案（1102)。其次，利用OCR讀取假設檔案執行單語檢索（1 1 〇 3)。此時，檢索對象之單語儲存於資料庫（1 1 1 〇)，各單語會附有代表該單語之重要度、及檢索時之槪度臨界値的可學習參數（11 1 1)。其次，對檢索到之單語（1104)應用文書檢索規則（1105)。此時，文書檢索規則儲存於資料庫（1 1 1 2)，各規則會附有代表該規則之重要度、及應用時之槪度臨界値的可學習參數（1 1 1 3)。其次 ’依據對象文書群中之檢索槪度等決定檢索之受理·廢棄 ’確定檢索文書群（或未符合補集合之檢索條件的文書群：：：非檢索文書群），結果則會通過顯示器等之顯示裝置對使用者進行顯不（11 0 6)。使用者將顯不之結果當做判斷材料 ’直接利用檢索結果當中之必要文書（1 1 〇 7 )，並將檢索結果中之垃圾（無意義之檢索結果）、或未出現於檢索結果之文書相關資料回饋至系統（1108)。學習機構（1109)會針對文書檢索結果，以降低被判斷爲檢索垃圾者之檢索槪度的 -19- (15) (15)200409046 方式來調整其參數（1 1 1 1、1 1 1 3)，而以提高未出現於檢索候補之文書的檢索槪度之方式來調整其參數。針對學習進行更詳細之說明。本發明之方法可針對檢出之單語，從識別槪度及文字配置之槪度來計算檢出單語之槪度。使用此檢出單語槪度，即使其和檢索規則相關，亦能計算其槪度（符合度）。例如，將文書檢索規則訂爲檢索對象之單語及if-then規則。此時，if-then規則之真假値會將檢出之單語的槪度當做乏晰邏輯値來表示。一般而言，if-then規則則以分解成下述邏輯演算。200409046 (1) 发明 Description of the invention [Technical field to which the invention belongs] The present invention relates to document retrieval and processing for retrieving a document group containing a specific key search from a paper document group or a document image group using text recognition technology and obtaining necessary information Method, its device, and document retrieval processing program. [Previous Technology] Although digital information technology has become popular because of computers, paper documents are still widely used information transmission media. However, for the requirements of searching for necessary information by a certain keyword from a large number of documents, or searching for and automatically classifying documents containing a specific keyword group, etc., it is clear that compared with digital data, paper The instrument has problems that are difficult to cope with. To solve this problem, various methods have appeared for the purpose of searching and automatic processing of paper documents. Means for searching for necessary keywords from paper documents or document images include online processing of identifying paper documents with OCR (optical reading device) each time a search is needed, and commendation with 〇CR from the beginning The read result is always retained and then processed off-line for retrieval. For example, a device such as a mail sorter belongs to online processing. In this online processing, since the keywords to be searched are specified, the character characteristics (full-width, half-width, Chinese characters, alphanumeric characters, etc.) contained in the keywords can be used to change the parameters of text cutting, or limited text recognition can be performed. To improve retrieval accuracy, such as the processing of the time type. In contrast, since image analysis and character recognition must be performed for each search, it is not practical from the viewpoint of calculating time when repeating the application form of the search. -6- (2) (2) 200409046 Actual. The method provided by the present invention is based on offline processing. The most basic method for non-online keyword search of paper documents is to convert paper documents into text files and then search the text files. However, because the text code converted with OC R generally contains errors, it sometimes occurs that it cannot be processed by simple text search. Of course, you can manually correct the body code after OCR conversion, and then search the correction results. However, in terms of processing speed and cost, using manual correction is very impractical. As a means for improving the reading accuracy of OCR, a method of performing morphological analysis on the recognition result of 〇CR is well known (see Patent Document 1). It is true that knowledge processing such as morphological analysis can correct misreading, but it cannot achieve 100% correction. In addition, the dictionary used for general morphological analysis is targeted at general articles such as news. In order to correct documents for special business purposes with good accuracy, a special dictionary suitable for the field must be defined. Therefore, there are still problems in terms of maintainability and calculation amount. Stomach 7 prevents the misreading of text from adversely affecting retrieval, and some have proposed a method for performing monolingual retrieval to facilitate OCR's misreading of similar text information (see Patent Document 2). In addition, there are also proposed methods that allow OCR to read results: there are plural candidates for character recognition, and then a character code is selected from them to detect a single language (see Patent Document 3). Indeed, the use of these techniques can prevent misreading of 1-character units from adversely affecting monolingual retrieval. However, when using the aforementioned method, the realm of text design cannot be clearly defined because of text separation or text contact with each other. Therefore, it is impossible to cope with incorrect cutting of text design. For example, when 〇CR reads "y, less" as "(3) (3) 200409046", the aforementioned patent method can be applied, but when it is read as "r y〆", it cannot be applied. In addition, it is often difficult to detect and identify a line of text before reading the text for documents containing drawings or tables, or documents containing many ruled lines, such as document forms. However, the aforementioned method cannot deal with this problem. [Patent Document 1] Japanese Patent Application Laid-Open No. 0-5 〇 8 8 9 [Patent Literature 2] Japanese Patent Application Laid-Open No. 0-74250 [Patent Literature 3] Japanese Patent Application Laid-Open No. 9-1343 69 [Invention [Contents] The object of the present invention is to provide a document retrieval processing system for retrieving necessary keys from a paper document group based on a result of character recognition, a document retrieval processing system that performs processing of document retrieval, document classification, etc. using the results Device, and a recording medium that records a search processing program. The traditional method of document retrieval of paper document groups is to retrieve the text of the OCR reading results. However, it is difficult to deal with text distortion such as text distortion or fading, which results in OCR text recognition errors and illegibility of text design. There are some problems, such as errors, or the combination of document, drawing, and ruled lines, which can cause errors in OCR text line precipitation. A first object of the present invention is to provide a method for avoiding an adverse effect on a monolingual search caused by errors such as character recognition, character cutting, and text line precipitation caused by OCR reading. In addition, when performing document retrieval and document classification processing using a keyword group, generally, a specific keyword and a set rule (AND or OR: and or or) are used to perform the processing. (8) (4) (4) 200409046 For example, an example of searching for documents that have the words "〇c R" and "Search" at the same time (A N D) is retrieved. When performing a search on a traditional text document, the presence or absence of keywords will be specified by a number of 1 or 0, and the set rules can be applied by simply processing. However, this method is related to text recognition, so the presence or absence of keywords will not The degree of continuous 値 from 0 to 1 is expressed. Therefore, if the keywords with lower degrees are always used to perform the document search using the intersection rule, there is a problem that the full search cannot be performed, and if the keywords with lower degrees are always ignored to perform the document search, there is a problem that they cannot be retrieved. To the necessary instruments. A second object of the present invention is to provide a method for deriving the degree of monolingual retrieval and the degree of intersection rules by using the degree of character recognition, and using automatic learning to manage the accuracy of document retrieval. In order to achieve the aforementioned first objective, the present invention separates OCR and retrieval devices. The output form of OCR uses multiple hypothetical files (OCR reading hypothetical files) that can permanently save text line precipitation, text cutting, and speed word recognition. OCR reads the function of searching keywords based on hypothetical files, and provides a system that can perform the retrieval and classification of necessary documents. In order to achieve the aforementioned second objective, a mechanism is provided for OCR to read the hypothetical file containing the similarity of text recognition, the position information of the text pattern, etc., and use it as a calculation of the degree of the keywords retrieved and the set of keyword rules The information of time is determined based on these times to determine the acceptance and discard of document search results. [Embodiment] -9- (5) (5) 200409046 Take the first figure as an example to explain the difference between the traditional method and the method of the present invention. Fig. 1 is a schematic diagram showing the differences between the conventional monolingual search method and document search method, and the method of the present invention. First of all, in the flow of the traditional method, there is a paper document group represented by 101, and reading is performed by using OCR represented by 102. The read result is output as a text file indicated by 103. Next, input the text file into the device indicated by 104 and perform a monolingual search. In this process, the search term is referred to the reference term DB (113). However, when the text was originally "blood chemistry test", the reading result of OCR read it as "dish liquid < dagger test", and the monogram "blood chemistry test" could not be retrieved based on the text file. At this time, it is generally considered that the retrieval has failed. Therefore, even if a device represented by 105 is used to perform a document retrieval rule (114) on a searched term, the search fails because the term that must apply the rule does not exist. As a result, documents cannot be searched and screened. In contrast, in the processing flow of the present invention, first, a paper document group indicated by 107 is read by an OCR indicated by 108. The reading result is output as an OCR reading hypothetical file shown at 109. Secondly, the OCR reading hypothetical file is inputted to the device shown at 10, and a monolingual search is performed. The monograms that must be retrieved are defined in the monogram DB indicated by 11 3. Because OCR reading assumes that the file contains various candidates for text line precipitation, candidates for text segmentation, and candidates for character recognition, in addition to the results of "Dish 4 t: Learning Examination", correct recognition results-"blood", "chemical" "Results in easier monolingual retrieval. Next, a device represented by 1 i 1 is used to search and screen documents in accordance with the document retrieval rules describing the detected monologues and their relationships. Document search rules are set out in Rule D B as shown in Table 1-4 (6) (6) 200409046. Examples of the document search rules are "documents in which the words" CR "and" search "coexist", and are constructed by linking plural words with OR or AND. The use of OCR to read hypothetical files can improve the accuracy of monolingual retrieval. As a result, the document retrieval rules can be applied and the document retrieval and screening indicated by 112 can be performed. OCR reading assumes that the file contains a document ID code that can fully identify the corresponding paper document or document image, and can be permanently stored in a magnetic storage device. The retrieval system that uses OC R to read hypothetical files, when there is a document retrieval request, it reads the necessary keywords from the pre-stored OCR to retrieve the hypothetical files, and compares the document retrieval rules with the document ID code of the appropriate document. The search result will also display the paper document or document image identified by the document ID code. With this method, even if the OCR device and the retrieval device are separated, a document processing system that uniformly processes document images and reads data can be constructed. The second figure will be described. In the document identification device according to the embodiment of the present invention, first, the OC device implements paper document photography and converts it into electronic image data (201). If the document is originally electronic image data, this process can be omitted. Secondly, based on the electronic image data, the analysis of the document structure such as grid line analysis, frame structure analysis, and position estimation of the reading target frame is performed (202). At this time, the identification processing used is a known technique (Japanese Patent Laid-Open No. 09-319824, Japanese Patent Laid-Open No. 2000-251012, etc.). Secondly, the result of the analysis of the structure of the document is received, and the candidate for the character line of the reading target is extracted (203). Next, the text pattern candidates are cut from the text lines and shadows (204), and each text pattern candidate is identified (205). Plural character line candidates, character pattern candidates, and character recognition candidates are separated from the target document to form multiple hypotheses. Finally, output -11-(7) (7) 200409046 word line candidates, text cutting pattern candidates, and their recognition results to the file (20 6). This output file is called 〇CR read hypothesis file. The OCR reading hypothetical file will be explained in detail later. The aforementioned processes 201 to 206 are processes for converting a paper document into an OCR to read a hypothetical file using a special device such as an optical reading device. In contrast, if it is electronic image data, the image reading (207) is used instead of processing 201, and it is converted into an OCR reading hypothetical file. At this time, if there is a conversion program and a general-purpose calculation device for the purpose of a driver, it can be executed. The information described above is stored in the following location of the OCR device shown in FIG. The image data converted from paper documents or the image data of the processing object prepared in advance will be stored in the external storage device 1 004 or the memory 1 00 5. The OCR program is stored in the external storage device 1 004 or the memory 1005, and the central calculation device 1 006 is used to execute the processing. Analysis of the image data results The frame information, line information, alternate network structure, and alternate text network structure obtained are mainly implemented in memory 105. The OCR read hypothetical file output by this process is stored in the external device through the external storage device 1004, the memory 1 005, or the communication device 1007. The third figure will be described. Figure 3 is a flowchart of the process of a document retrieval engine using OCR to read hypothetical files. First, "read the 0CR reading hypothesis file corresponding to the paper document group (or document image group) corresponding to the search target" and create a candidate character network structure for each OCR reading hypothesis (301). Secondly, 'the monolingual group of the candidate character mesh structure and the search target are regarded as input' and a monolingual search is performed (302). Because the OCR reading assumes that the file contains various text line candidates, text cutting candidates, and text recognition candidates i ', the searched words must be executed. -12- (8) (8) 200409046 Whether the correct judgment processing is performed. Then, based on the search results, based on the information such as the degree or order of character recognition, and the arrangement of patterns, calculate the degree of the retrieved monograms, and decide whether to accept or discard the monogram search results (303). Relevant information such as the degree or order of text recognition, and the arrangement of patterns is included in the OCR reading hypothesis file. The OCR reading hypothetical file will be explained in detail later (related to Figures 12 to 16). Secondly, for documents containing the retrieved monolingual group, a document search is performed by applying a document search rule (304). Finally, for the documents that have been retrieved, the decision to accept or discard the document retrieval results is based on the degree of the detected words that have been filtered by the rules or the importance of the adopted rules (305). FIG. 4 is described. Fig. 4 is a detailed description of the aforementioned process 303. In this process, for the retrieved monolingual language, the degree of the detected monolingual language is calculated using the degree of text recognition, the layout information of the text pattern, and the layout information of the document image relative to the monolingual language. In the calculation of the degree of a detected language, the text string path is first considered (the retrieved language is represented by a combination of a text code string and a text pattern string. This is called a path. The detailed description is shown in Figure 5) The recognition degree of the text pattern is used to calculate the recognition degree of the monolingual (401). Second, calculate the loss associated with the layout of the text pattern (402). For example, relative to the statistical average, the height ratio of the text relative to the height of the entire path, the deviation of the centerline of the text relative to the centerline of the entire path, the average text width, and the distance from adjacent text patterns, etc. The degree of deviation is considered a method of loss. Taking into account the overall position of the detected monogram, the loss is calculated (403). For example, information such as whether the monogram is located in a specific area in the document image is used. However, the information stored in the OCR reading hypothesis -13- 200409046 〇) The information of the file will have several levels of hierarchy (described later), and processing 402 and 403 can be omitted according to its level. The OCR reading hypothesis file will be explained later in detail. Figure 5 is a diagram of the process of monolingual retrieval. Fig. 6 is a detailed diagram of data and data of the alternate network structure. The process of monolingual search will be explained based on Fig. 5. A variety of cuts that are considered to be text patterns are performed on the text line to be read (a), candidate text patterns are created, and character recognition is performed on each candidate text pattern to create candidate text mesh structures (b). The alternate text mesh structure has a minimum number of text patterns, an identification code group with a sequence obtained according to the text recognition result, and information about the connection relationship between the text patterns in the alternate text mesh structure. OCR reading assumes that the file contains some of this information. The form is a binary form or a body form using tags such as XML. Because the method of the present invention uses OCR to read the hypothetical file, the alternate text mesh structure is created based on the information read from the file. Next, the character string is used to represent knowledge (c), and the character string path (d) is calculated from the candidate character mesh structure. In the example, the text column indicates that knowledge is separated by monograms with OR (I). That is, a monolingual group representing a sandwich between symbols I is designated as a search target. The text column indicates that in addition to this expression, you can use trial methods or context-free grammars (such as those described in Japanese Patent Application Publication No. 20 (Π-0143 11 and so on). Figure 6 is a detailed description of the candidate network structure. Text The performance of the alternate mesh structure is a directed graph that represents the pattern of the text pattern with a structure (601) and a directed graph that expresses the state of the text pattern with a node (602). Each text pattern contains a representation of left and right (straight The realm ID number of the node (pattern realm), character recognition candidate -14- (10) (10) 200409046 (60 3), and the similarity of recognition (604). The monolingual search processing is: 'Treat this alternate text mesh structure and text string representation knowledge as input processing to search for the monograms and pattern lines contained in the alternate text mesh structure. For example, the text string represents the "blood chemistry test" phrase on knowledge. Search for the alternate text mesh structure in Figure 6 to find the text code and text pattern circled by 605. The algorithm for searching the text code and text pattern is a well-known technique (Japanese Patent Application No. 1 0-28077, Japanese special Wish flat 1 1- 1 8 7 5 3, etc.) Determine the result of monolingual search and the path of the text string. The text string path is information composed of the text code 歹 [] (that is, the text string) and the text pattern corresponding to each text code. Each piece of information will be stored in the following location of the retrieval device shown in Figure 10. The OCR reading assumes that the file will be stored in the external storage device 1012 or the memory 1013. The monolingual retrieval program is also stored in the external storage device 1012 or The memory 1013 is processed by the central calculation device 1014. The alternate text mesh structure created based on the reading hypothesis file will be expanded on the memory 1 03. The monolingual retrieval is performed on it and the external storage device 1 01 2. Memory 1 0 1 3 or communication device 1 0 15 and store the search result in an external device. The following is a description of FIG. 7. FIG. 7 is an example of a screen configuration of a document retrieval system using the method of the present invention. Here Take the retrieval system of prescription documents as an example. First, specify the key to be retrieved in the input field 701, and specify the rules for processing search keywords in the input field 702. In this figure, the selected rules Represents searching for one of all the keywords specified. Regarding the above two items as input, a prescription document search is performed on the database storing the OCR reading hypothetical file. The display field 7 0 3 displays the prescription name obtained from the search result. -15- (11) 200409046 said. Display field 704 will display the relevant information currently displayed in the retrieved documents. Display field 705 will visually display the retrieval results OCR read the hypothetical file because it has the same as the original paper document or document image. The corresponding document ID code can display the document image and the search result at the same time. The retrieved words will be displayed at the bottom line of 706. When displaying the search results, you can set the priority according to the checkout degree and retrieval document degree that can be calculated by using OCR to read hypothetical files. FIG. 8 will be described. Figure 8 shows the effect of multiple hypotheses on text cutting and text recognition in a retrieval system using OCR reading hypotheses. Figure (a) is the document (partial image) of the reading target, and the framed with a thick line is equivalent to one line hypothesis. In the figure (b), when reading this part with no special knowledge, what should have been "7 K ingots" would be read as "u y Bu" syndrome. Because "less" is a combination of two text patterns, and it is read separately, and "V" is misinterpreted as the first result because the color is lighter. Also, "ingot" is being used because of partial deformation. Misinterpreted as the first result. In contrast, the OCR reading hypothesis will have a candidate text mesh structure as shown in (c). That is, although there are hypotheses that read "" as "/" and "1 /", there will also be settings that read "less". Moreover, the recognition results of 1-digit characters such as "7" and "ingot" are misread "V" and "symptom", but in further identification candidates, the "7" and "ingot" of the result are correctly identified. When performing a monolingual search on the OCR text, you must retrieve the monogram of "Shaobu" from "/ 1 / y VK Syndrome". At this time, if you measure the two characters in the edit distance, it will become 1 text is inserted 2 texts are not read, in monolingual terms, no results. All right, the OCR J v part of the fruit map of the book inspection monolingual file will therefore recognize the identification as shown in the figure "Small leave will be included with the result distance method. -16- · (12) (12) 200409046 It is considered similar. Another On the one hand, when using OCR to read hypothetical file inspection, there will be no text insertion and non-reading, which makes the monolingual retrieval easier. As a result, the correct monolingual can be retrieved as shown in (d) Illustrate Figure 9. Figure 9 shows the effect of multiple hypotheses on the text lines of a retrieval system that uses OCR to read hypothetical files. Figure U) is the document (partial image) of the reading target. Figure ( b) is the result from which a single hypothesis is used to separate the lines of text. In this figure, the middle 3 lines in (a) are regarded as a line to perform the precipitation. Because when the lines of text are cut in the horizontal direction, The behavior is sandwiched between printed lines. Because there are handwritten lines and stamp lines, it is impossible to form a clear separation when mapping, and it is judged as one line. In contrast, because not only the foregoing single hypothesis is allowed, but plural Assumptions, so the thicker one shown in Figure (b) The word line is further cut into thinner text lines and regarded as hypotheses, forming a text line hypothesis group as shown in Figure (c). For the foregoing plural line hypotheses, OCR is expanded to read the hypothesis file and execute a single The results can be retrieved as shown in (d). The OCR read hypothetical file will not only store the information of text cutting and character recognition, but also the text line hypothetical file. OCR read hypothetical file contains The information will be described in detail later (related to Figure 12 to Figure 16). Figure 10 will be described. Figure 10 is a method of using the method of the present invention to separate the OCR device from the retrieval device. An example of the structure of the retrieval system. The upper part of Fig. 10 is an example of the composition of an OCR device, and the lower part of Fig. 10 is an example of the composition of a retrieval device. First, the OCR device in the upper part uses an image input device ( 1001) Convert the document into electronic data and store it in an external storage device (1 0 0 4) -17- (13) (13) 200409046 and memory (1 005), and then use the central calculation device (1006) to read take The definition of the document format is stored in the external storage device (1004). When the structure of the document is analyzed, the definition stored here will be referred to. These processes can be performed by humans through the operation terminal device (1 002), and the results can be processed. The display terminal device (1003) is used for display, and the data will be stored in the external storage device or transmitted to the external device through the communication device (0077). The reading result of OCR will also be regarded as the traditional device, and it will be regarded as The text file execution output can also be regarded as OCR reading hypothesis file execution output. OCR reading hypothesis file will be stored in an external storage device or transmitted to an external device via a communication device. At this time, OCR reading assumes that the file contains the document ID code corresponding to the document (or image) read by OCR. Using this document ID code, you can perform correspondence between paper documents or document images and OCR read hypothetical files. Using its correspondence with OCR to read hypothetical files, the following document retrieval functions can be realized, for example, providing a human-readable GUI function that displays the retrieved words on the original document image, and selecting the words that contain the target Document images, etc. Figure 7 shows an example of a GUI for monolingual retrieval. However, at this time, the document image (705) and the retrieved monolingual (70.6) are displayed simultaneously. This display function can be realized by using the position information of the words retrieved from the OCR reading hypothetical file and the image file corresponding to the ID of the OCR reading hypothetical file. The retrieval device in the lower part of FIG. 10 will perform retrieval by using the OCR reading hypothetical file output by the aforementioned OCR function device, and has the function of repetitively performing an unlimited number of searches for the documents once the OCR reading hypothetical file is generated (as long as the hypothetical file exists) . This retrieval device will read the OCR from the communication device (1 0 1 5), read the hypothetical file and download it to the memory (1013), and then use -18- (14) (14) 200409046 central computing device (1 〇1 4) Perform a search process. Words and documents to be searched The search rules can be stored in an external storage device or entered using the operation terminal device (101). The monolingual search results will be displayed on the display terminal device (1011), and the data will be transmitted to the external device through the communication device, or the search results will be stored in the external storage device. These devices are connected using internal buses (1 008, 1 009, 101 6). Figure 11 will be described. Figure 11 is a schematic diagram of an automatic learning mechanism that applies a document retrieval system to actual business. First, a large number of paper documents and document shadow groups are input to the document retrieval system (110 1), and an OCR reading hypothetical file corresponding to each document is created (1102). Second, use OCR to read hypothetical archives to perform a monolingual search (1 103). At this time, the search target's monograms are stored in the database (1 1 1 0), and each monogram will be accompanied by a learnable parameter (11 1 1) that represents the importance of the monogram and the criticality of the search. . Secondly, the document retrieval rules (1105) are applied to the retrieved words (1104). At this time, the document retrieval rules are stored in the database (1 1 1 2), and each rule is accompanied by a learnable parameter (1 1 1 3) that represents the importance of the rule and the criticality of the rule when applied. Next, the search document group (or document group that does not meet the search conditions of the supplementary set: :: non-retrieval document group) is determined by "determining the acceptance or disapproval of the search based on the search degree of the target document group, etc.", and the result will be displayed on the display, etc. The display device displays to the user (11 0 6). The user takes the obvious result as the judgment material, and directly uses the necessary documents (1 107) in the search results, and relates the garbage (meaningless search results) in the search results, or the documents that do not appear in the search results. The data is fed back to the system (1108). The learning institution (1109) will adjust the parameters of the document retrieval results by -19- (15) (15) 200409046 to reduce the retrieval degree of those who are judged to be spammers (1 1 1 1, 1 1 1 3) , And adjust its parameters in a way that increases the retrieval power of documents that do not appear on the search candidate. More detailed explanations for learning. The method of the present invention can calculate the degree of the detected monogram from the degree of recognition and the degree of text arrangement for the detected monogram. Using this detection monolingual degree, even if it is related to a search rule, the degree (conformance) can be calculated. For example, the document retrieval rules are the monolingual and if-then rules of the retrieval target. At this time, the true and false of the if-then rule will indicate the degree of the detected monogram as unclear logic. In general, the if-then rule is decomposed into the following logical calculus.

邏輯積ΑΓΊΒ、邏輯和AUB、非〜A 若檢出之單語可分別套用於A及B，則單語之識別槪度視爲乏晰邏輯値，而相對於前述各要素之乏晰演算元則可代換成槪度（Α ΓΊ B) = MIN(槪度（A)、槪度（B)) 槪度（A U B) = MAX(槪度（A)、槪度（B)) 槪度（〜A) = l-槪度（A)。此處之槪度U)爲計算相對於單語X或邏輯式X之槪度的函數。利用此方式，因爲亦可對文書檢索規則反映其文字識別之槪度，例如，對於重要之規則，即使特定單語之識別槪度稍低，亦會加以重視而應用規則進行文書檢索，並對文書檢索進行加權。又，發生單語檢索之失誤（因精度較低而廢棄單語時）、或符合規則之失誤（因精度較低而廢棄規則時），因而無法析出原本必須檢出之資訊時，調整單語檢索時之臨界値及符合規則時之槪度參數，同時， -20- (16) 200409046 以提高槪度（檢出槪度、規則符合槪度）之方式實施調，故可使其學習而成爲更適用實際運用之檢索系一般之文書檢索中，檢索之性能會以再現率及之2項尺度來測量。再現率是利用該檢索引擎搜尋之本來想要檢索的文書之量測尺度。又，符合率爲檢索引擎檢出之文章當中有幾成爲本來想要搜尋之測量尺度。前述學習處理的目的，在利用使用者之提高再現率及符合率。爲了提高符合率，會使用使饋之「使用者選取何文書」資訊，針對使用者選取群提高檢出槪度之方式來調整參數。又，爲了提高，從第11圖之1106的非檢索文書群中以隨機抽樣等漏掉檢索之文書群」，並以提高其檢出槪度之方式參數。具體之學習演算法爲最陡坡降法等。假設有檢語表[Wl、W2、...、Wn}。又，假設已對前述單語索時之槪度臨界値{ΤΙ、T2、...、Τη}。亦即，已對統輸入單語及其檢索槪度臨界値之配對丨（W 1、Τ 1)、 Τ2)、…、（Wn、Τη)}。假設，使用OCR讀取假設檔檢索的結果，以識別槪度Lk搜尋到某單語Wlc(當然度之計算上，不只單純考慮文字識別之槪度而已，好也將文字圖案之配置資訊等考慮在內）。此時，槪度可以槪度臨界値Tk及識別槪度Lk之函數來表示，單語之檢出槪度Fk = F(Tk、Lk)。可以離散函數來例如，若識別槪度Lk低於槪度臨界値Tk時，單語之參數微統。符合率到幾成利用該文書的回饋來用者回之文書再現率找出「來調整索之單訂定檢檢索系 (W2、案執行，此槪而是最單語之。假設表示，檢出槪 -21 - (17) (17)200409046 度爲0，而若識別槪度Lk高於槪度臨界値Tk時，單語之檢出槪度爲1，此外，亦可考慮識別槪度及槪度臨界値之差 Lk-Tk的S形函數或類似之連續函數。如前面所述，亦可對規則實施以原本之邏輯演算元之槪度函數定義爲基礎之邏輯式槪度的計算。亦即，含有單語Wk之規則槪度，因爲是針對單語Wk之槪度的函數，故以R(Fk)表示。又，若將其視爲參數Tk之函數，則因爲Fk 爲參數Tk之函數，故可以表示成R(Fk) = R\Tk)。學習爲指示必須強化何種規則應用、必須忽視何種規則應用之指導式學習。例如，有必須強化之規則時，只要以增大該規則之槪度R = R(Fk)方式來調整單語Wk相關之參數即可。例如，若將前述槪度臨界値Tk當做想要學習之參數，對本來之參數Tk，提供和參數Tk相關之偏導函數 δR/δTk成比例且爲被視爲參數Tk之函數的規則槪度R’（Tk) 之擾動，即可增大規則槪度R\Tk)之値。當然，這是規則槪度K相對於參數Tk爲較平滑時的學習方法。然而，在此說明之最陡坡降法以外，尙有可使用離散函數之參數學習法，如GA(遺傳式演算）、SA(退火法）、單工法。這些學習方法之機構，是以針對對象資料群整體使代表對象資料之判別是否良好的某種評估尺度之最佳化方式，來調整判別演算相關之參數群。本發明之從檢出單語之槪度計算規則之槪度的機構，可以利用規則槪度明確表示之函數來表現前述評估尺度的定義，而且，因爲可以參數來調整檢出單語之精度等，故不論爲連續或離散， -22- (18) 200409046 皆可進行學習。針對OCR讀取假設檔案之構造進行詳細說取假設檔案具有最小限之完全對應本來之紙文像的文書ID碼、複數行假設資訊、各文字行候字切割假設、及文字識別假設。行假設資訊、設、及文字識別假設如下所示。首先，說明以具有文字行之多重假設爲目訊。文字行多重假設由複數之第12圖所示之單設資訊集合而成。構成文字行假設之資訊可以級來考慮。此圖中將其分成3階段。層級1爲以假設爲目的之必要最低限資訊。由表示文字行含該文字行內之文字切割及文字識別假設、及座標資訊所構成。亦可以代表行假設整體之區代行ID。利用此行ID識別文字行單位之資訊整字切割及文字識別假設從該文字行檢出單語，行座標資訊時之過剩檢索問題（以複數之行假鍵檢索的問題）。層級2是跨越文字行間執行單必要資訊，爲表示文字行間之連結構造的資訊單據文書，絕大部份之文章爲條列而匯整成一形下，並不需要此資訊，然而，針對如學術文書之整體較長之文書進行檢索時，就需要此資，層級3雖然不是以具有多重行假設爲目的的以影像資訊爲基礎來執行文字再切割及文字再爲有用的資訊。明。OCR讀書或文書影補之複數文文字切割假的之必要資一文字行假分成數個層具有多重行之行ID、包該文字行之隔記號來取體，依據文可防止使用設檢出相同語檢索時之。如處方或行單位的情書及一般文訊。本質上資訊，但在識別時，則 -23- (19) 200409046 其次，針對以具有各文字行假設之文字切割及文別的多重假設爲目的之必要資訊進行說明。各行之文割及文字識別的多重假設，由複數之第1 3圖所示之單字圖案的假設資訊所構成。構成文字切割假設之資訊前述相同，可分成數個層級來考慮。層級1爲以具有切割假設及多重識別假設爲目的之必要最低限資訊。，文字切割及文字識別之多重假設以代表文字圖案間接關係的境界ID編號cn、nn來表示，而文字識別之多設則由複數之識別碼dt所構成。文字圖案間之連接關以如第6圖所示之網狀結構來掌握。文字圖案之切割以網狀結構上之節點（第6a圖之白圓點）來表示，前述 ID編號cn、nn爲對應該節點之編號。層級2是針對單檢索結果計算槪度時使用之資訊。例如，依據文字圖配置及文字識別之類似度dk實施單語之槪度的加權時要此資訊。在檢索後之處理上，若需要實施更詳細之圖案解析，則需要層級3之資訊。 OCR讀取假設檔案含有前面所述之資訊。OCR裝對應必要層級將前述資訊輸出至OCR讀取假設檔案，裝置會從OCR讀取假設檔案復原候補文字網狀結構後行單語檢索。將釋於至〇CR讀取假設檔案之資訊分成級’可對應系統調整檔案之容量及單語檢索之精度。讀取假設檔案之形態可以爲二進制檔案或正文檔案。，針對以XML標記利用正文記載〇CR讀取假設檔案時實施例進行說明。字識字切一文可和多重亦即之連重假係可位置境界語之案之，需文字置會檢索再執數層 OCR 此處的一 -24- (20) (20)200409046 說明OCR讀取假設檔案之XML標記例之前’先針對目前JEITA針對文字識別多重假設提唱之XML規格進行說明。此規格提唱使用多重文字碼用之標鑛<^c〉及標籤內屬性v之XML構造。標籤me表示複數之文字識別碼’標籤內屬性v則表示識別類似度。可省略標籤內屬性v。XML標記例如下所示（第14圖爲文字圖案例）。例1) 有「文字」之文字行，各文字圖案被識別成如下所示時’ 對「文」之識別結果爲「文交大」、類似度0.80、0.71、 0.60 對「字」之識別結果爲「字宇学」、類似度0.89、0.00、 0.00 標記例1: 文 <mc> 交大 </mc> 宇学 </mc> 標記例2 : 文 <mc ν = '·〇.80 0.7 1 0.60，·〉交大 </mc> 字 <mc ν = Μ〇·89 0·00 0_00"> 宇学 </mc> 本發明中，爲依前述規格之架構記載OCR讀取假設檔案之標§3例。首先，以文字切割之多重假設化爲目的，追加標籤內屬性c η、η η，表示文字圖案之連接關係。此處之 cn、nil爲代表第13圖所示文字圖案之境界的境界ID編號。 XML標記例如下所示（第15圖爲文字圖案例）。例2) 有「文字」之文字行，各文字圖案被識別成如下所示時， -25- (21) 200409046 對「文」之識別結果爲「文交大」、類似度0.80 、0.71 、 0.60 對「字」之識別結果爲「字宇学」、類似度0.89 、0.00、 0.00 有跨越「文字 j 之圖案，m 別結果爲対効」、類似度 0.60 、0 • 57 標記例] L : 文 < me c n = :1 l in 二：2 > 交: 大 <1 mc> 字 < me c n = ：2 i ιη = ：3> 宇丨学 </mc> 対 < m c c n = :1 l 111 = =3> 効 </mc> 標記例2 : η11=2 ν = Μ〇.80 0.7 1 0.60Μ> 文 < me c η = 1 交大 < / m c > 字対 <mc cW3 ν?〇·89 〇.〇〇〇.〇〇”> 宇学 </船 <mc π:1 ηη=3 π"〇·6〇〇·57"> 効 </mc> 其次，以文字切割之多重假設化爲目的，追加行資訊標籤< m 1 >，表示文字行假設。標鑛間之階層關係，m c標籤可包含於m 1標籤。亦即’規疋爲<m 1〉標籤至< /m 1 >標籤之間可夾有複數之從< m c >標籤至< / m c >標籤所涵蓋的範圍。X M L標記例如下所示（第1 6圖爲文字圖案例之圖示）。例3) 行切割假設1將「文字」視爲文字行析出’含有下述文字圖案。對「文」之識別結果爲「文交大」、類似度0·80、〇·71、 0.60 -26- (22) (22)200409046 對「字」之識別結果爲「字宇学」、類似度0.89、0.00、 0.00 有跨越「文字」之圖案，識別結果爲「対効」、類似度 0.60 、 0.57 而且，行切割假設2將「多重」視爲文字行析出，含有下述文字圖案。對「多」之文字碼「多名」的類似度爲0.80、0.71時、對「重」之文字碼「重乘」的類似度爲0.89、0.70時、標記例1: <ml> 文 <mc c n = 1 η η = 2 > 交大 </mc> 字 < m c c η = 2 η η = 3 > 宇学 </mc> 対 < m c c n = 1 η n = 3 > 効 < / m c > < / m 1 > <ml>多 <mc cn = 1 ηn = 2> 多名 </mc> 重 <mc cn = 2 ηn = 3> 重乘 </mc> </m 1 > 如第1 2圖之說明所示，構成文字行假設之資訊可分成數個層級來考量。尤其是以具有多重行假設爲目的之必要最低限資訊爲代表文字行之行ID、包含於該文字行內之文字切割及文字識別假設、及該文字行之座標資訊。行ID可以代表行假設整體之區隔記號取代。前述標記例1中， <ml>標籤即相當於此區隔記號，表示<ml>標籤及籤所夾之部份爲文字切割及文字識別假設。其次，將前述標記例擴充爲可以行之矩形座標來表現。行之座標資訊在防止過剩檢索問題（以複數之行假設檢出相同鍵檢索之問題）上，爲有效之資訊。爲了表現行之矩形座標，使用標 -27- (23) 200409046 籤內屬性1、r、t、b。1、ι·、t、b分別代表包含各行之外接矩形的左端X座標、右端X座標、上Y座標、及下Y座標。亦可考慮其他之座標表示法。有以行之中心座標及尺寸來表示的方法、及使用行矩形四角之點座標來表示的方法等。使用外接矩形座標時之XML標記例如下所示（第1 6圖爲文字圖案例）。例4)Logical product ΑΓΊΒ, logic and AUB, non- ~ A If the detected monograms can be applied to A and B, respectively, the recognition degree of the monograms is regarded as lack of clear logic, and the lack of clear operators relative to the foregoing elements Then it can be replaced by 槪 (Α ΓΊ B) = MIN (槪 (A), 槪 (B)) 槪 (AUB) = MAX (槪 (A), （(B)) （( ~ A) = l- 槪 degree (A). Here, the degree U) is a function for calculating the degree with respect to the monolingual X or the logical formula X. Using this method, because the document retrieval rules can also reflect the degree of text recognition, for example, for important rules, even if the recognition degree of specific monograms is slightly lower, they will be valued and the rules will be used for document retrieval, and Paper searches are weighted. In addition, if a monolingual search error (when a monogram is discarded due to low accuracy) or a rule-compliant error (when a rule is discarded due to low precision) prevents the information that would have to be detected from being extracted, adjust the monogram The threshold value when searching and the degree parameter when the rule is met. At the same time, -20- (16) 200409046 adjusts the method to increase the degree (detection degree, rule compliance degree), so it can be learned by learning The more practical search is the general document search. The performance of the search will be measured by the reproduction rate and two scales. The reproduction rate is a measure of the documents that were originally searched by the search engine. In addition, some of the articles detected by the search engine have become the measurement scales that were originally searched. The purpose of the aforementioned learning process is to improve the reproduction rate and coincidence rate by using the user. In order to improve the compliance rate, the information of "what user selects" is used to adjust the parameters in a manner that improves the detection accuracy for the user selection group. In addition, in order to increase the number of non-retrieval documents from the non-retrieval document group of 1106 in FIG. 11 by random sampling, etc., and increase the detection parameter. The specific learning algorithm is the steepest descent method. Suppose there are checklists [Wl, W2, ..., Wn}. In addition, it is assumed that the degree of criticality {Ti, T2, ..., Tn} of the foregoing monolingual time has been obtained. That is, the pairings of the system input monogram and its search criticality (W1, T1), T2), ..., (Wn, Tn)} have been entered. Suppose, use OCR to read the results of the hypothetical file retrieval to find a certain word Wlc with the recognition degree Lk (of course, in the calculation of degree, not only the degree of text recognition is considered, but also the configuration information of the text pattern is considered Included). At this time, the degree can be expressed as a function of the degree criticality Tk and the recognition degree Lk, and the detection degree Fk = F (Tk, Lk) of the monolingual. Discrete functions can be used, for example, if the recognition degree Lk is lower than the degree critical 値 Tk, the parameters of the monolingual system are micrometric. The compliance rate reaches a few percent using the document ’s feedback to the user ’s document reproduction rate to find out “to adjust the order's order inspection system (W2, the case is executed, this is the most monolingual. Suppose that the inspection槪 -21-(17) (17) 200409046 The degree is 0, and if the recognition degree Lk is higher than the criticality threshold Tk, the detection degree of the monolingual is 1; in addition, the recognition degree and the The sigmoid function or similar continuous function of the difference between the critical thresholds Lk-Tk. As mentioned earlier, the rule can also be used to calculate a logical formula based on the definition of the original function of the logical operator. That is, the regular degree of monogram Wk is a function of the degree of monogram Wk, so it is represented by R (Fk). Moreover, if it is regarded as a function of parameter Tk, then Fk is the parameter Tk Function, so it can be expressed as R (Fk) = R \ Tk). Learning is a guided learning that indicates which rules must be strengthened and which rules must be ignored. For example, if there are rules that must be strengthened, just increase In order to adjust the parameter R = R (Fk), adjust the parameters related to monogram Wk. For example, if the aforementioned criticality threshold Tk is used as a parameter to be learned, for the original parameter Tk, a partial derivative function δR / δTk related to the parameter Tk is provided and is a regular degree that is regarded as a function of the parameter Tk. The disturbance of R '(Tk) can increase the magnitude of the regular degree R \ Tk). Of course, this is the learning method when the regular degree K is relatively smooth with respect to the parameter Tk. However, the steepest slope described here In addition to the descent method, there are no parameter learning methods that can use discrete functions, such as GA (genetic algorithm), SA (annealing method), and simplex method. The mechanism of these learning methods is to make the representative data of the target data group as a whole. An optimization method to determine whether a certain evaluation criterion is good or not, to adjust the parameter group related to the calculus. The mechanism of the present invention for calculating the degree of a rule from the degree of a detected monogram can be clearly expressed by the degree of the rule. Function to represent the definition of the aforementioned evaluation scale, and because parameters can be used to adjust the accuracy of the detected monolingual, etc., whether continuous or discrete, -22- (18) 200409046 can be learned. Reading hypothetical files for OCR The structure is explained in detail. It is assumed that the file has the minimum document ID code that completely corresponds to the original paper image, the plural line hypothesis information, the hypothesis of each word line, and the character recognition hypothesis. The line hypothesis information, design, and text The identification hypothesis is shown below. First, the multiple hypotheses with text lines are described as visual information. The multiple hypotheses of text lines are made up of a set of individual information shown in Figure 12 of the plural. The information constituting the text line hypotheses can be considered on a level basis. It is divided into 3 stages in this figure. Level 1 is the necessary minimum information for the purpose of hypothesis. It consists of the text cutting and text recognition hypothesis that contains the text line and the coordinate information. It can also represent the district as a whole. The use of this line ID to identify the information of the text line unit. Word cut and text recognition. Assuming that a single word is detected from the text line, and there is an excessive search problem when the coordinate information is searched (a problem of searching by plural false keys). Level 2 is the information necessary to execute the document across the lines of text. It is an information document document representing the link structure between the lines of text. Most of the articles are consolidated into a single line. This information is not required. However, This document is needed for retrieval of long documents as a whole. Although Level 3 is not based on image information for text recutting and text re-use as useful information for the purpose of multiple line hypothesis. Bright. OCR reading or document shadow supplementary plural text cutting false necessary for a text text leave is divided into several layers with multiple line ID, including the text line to separate the body, according to the text can prevent the use of settings to detect the same When searching. Such as prescription or agency love letters and general messages. It is information in nature, but when identifying, then -23- (19) 200409046 Secondly, the necessary information for the purpose of text cutting with multiple hypotheses for text lines and multiple hypotheses for texts is explained. The multiple hypotheses of text segmentation and character recognition in each line are made up of hypothetical information of the character patterns shown in the plural of Figure 13. The information that makes up the word-cutting hypothesis is the same as above, and can be divided into several levels for consideration. Level 1 is the minimum necessary information for the purpose of having a cutting assumption and multiple identification assumptions. The multiple hypotheses of character cutting and character recognition are represented by the realm ID numbers cn and nn, which represent the indirect relationship between character patterns, and the multiple settings of character recognition are composed of plural identification codes dt. The connection between text patterns is grasped by a mesh structure as shown in FIG. 6. The cutting of the text pattern is represented by the nodes on the mesh structure (white dots in Figure 6a). The aforementioned ID numbers cn and nn are the corresponding node numbers. Level 2 is the information used to calculate the degree of latitude for a single search result. For example, this information is needed when weighting the monogram's degree based on the similarity dk of the text map layout and text recognition. In the post-retrieval process, if more detailed pattern analysis is required, level 3 information is required. OCR reading assumes that the file contains the information previously described. The OCR device outputs the foregoing information to the OCR reading hypothetical file at the necessary level, and the device restores the candidate text mesh structure from the OCR reading hypothetical file and performs monolingual retrieval. Dividing the information of the hypothetical file read to 〇CR into levels' can adjust the capacity of the file and the accuracy of monolingual retrieval corresponding to the system. The read hypothetical file can be a binary file or a text file. A description will be given of an example in which a hypothetical file is read by using a textual description in the XML markup to read a hypothetical file. The word literacy can be combined with multiple words, that is, the case of the positionable realm, which requires the text to be retrieved and then executed in several layers of OCR. Here -24- (20) (20) 200409046 describes OCR reading Before the example of the XML markup of the hypothetical file, the XML specification of JEITA's multiple hypothesis for character recognition is explained first. This specification uses the XML structure of the standard ore <^ c> for multiple text codes and the attribute v in the tag. The tag me represents a plural character identification code 'and the attribute v in the tag represents the recognition similarity. The attribute v in the label can be omitted. Examples of XML tags are shown below (Figure 14 shows examples of text patterns). Example 1) When there is a text line with "text" and each text pattern is recognized as shown below, the recognition result for "text" is "Wen Jiaotong University", the similarity is 0.80, 0.71, 0.60. The recognition result for "word" is "Word Science", similarity 0.89, 0.00, 0.00 Notation Example 1: Text < mc > Jiaotong University < / mc > Asics < / mc > Notation Example 2: Text < mc ν = '· 〇.80 0.7 1 0.60, ·> Jiaotong University < / mc > The word < mc ν = Μ〇 · 89 0 · 00 0_00 " > Yu Xue < / mc > In the present invention, the OCR reading is recorded according to the framework of the foregoing specifications Assume that the file is marked by § 3 cases. First of all, for the purpose of multiple hypotheses of text cutting, the attributes c η and η η in the tag are added to represent the connection relationship of text patterns. Here, cn and nil are the realm ID numbers that represent the realm of the text pattern shown in Figure 13. An example of XML markup is shown below (Figure 15 shows an example of a text pattern). Example 2) When there is a text line of "text" and each text pattern is recognized as shown below, -25- (21) 200409046 The recognition result of "text" is "wenjiaotong university", similarity 0.80, 0.71, 0.60 pairs The recognition result of "word" is "word science", similarity 0.89, 0.00, 0.00 has a pattern that spans "character j, m different results are ineffective", similarity 0.60, 0 • 57 mark example] L: 文 < me cn =: 1 l in two: 2 > cross: big < 1 mc > word < me cn =: 2 i ιη =: 3 > space < / mc > 対 < mccn =: 1 l 111 = = 3 > Efficiency < / mc > Marking Example 2: η11 = 2 ν = Μ〇.80 0.7 1 0.60Μ > Text < me c η = 1 Jiaotong University < / mc > Character < mc cW3 ν? 〇 · 89 〇〇〇〇〇〇〇〇 "> Universe < / boat < mc π: 1 ηη = 3 π " 〇 · ６〇〇 · 57 " > Effect < / mc > Secondly, for the purpose of multiple hypotheses of text cutting, a line information tag < m 1 > is added to indicate the text line hypothesis. Hierarchical relationship, the mc tag can be included in the m 1 tag. That is, the 'regularity is < m 1> tag to < / m 1 > tags can be sandwiched between plural tags from < mc > tag to < / mc > The range covered by the tag. An example of XML tags is shown below (Figure 16 is an illustration of a text pattern example). Example 3) Line Cut Hypothesis 1 Treating "text" as a line of text, 'contains the following Text pattern. The recognition result of "wen" is "Wen Jiaotong University", similarity 0 · 80, 0.71, 0.60 -26- (22) (22) 200409046 The recognition result of "word" is "word science", similarity 0.89, 0.00, 0.00 have patterns across "text", and the recognition results are "ineffective", similarity 0.60, 0.57. Furthermore, line cutting hypothesis 2 considers "multiple" as text lines and contains the following text patterns. When the similarity to the text code "multiple" of "multi" is 0.80, 0.71, and the similarity to the text code "multiple" of "duplicate" is 0.89, 0.70, mark example 1: < ml > text < mc cn = 1 η η = 2 > Jiaotong University < / mc > word < mcc η = 2 η η = 3 > space < / mc > 対 < mccn = 1 η n = 3 > effect < / mc > < / m 1 > < ml > multiple < mc cn = 1 ηn = 2 > multiple names < / mc > weight < mc cn = 2 ηn = 3 > multiplication < / mc > < / m 1 > As shown in the description of FIG. 12, the information constituting the text line hypothesis can be considered in several levels. In particular, the minimum information necessary for the purpose of having multiple line assumptions is the line ID representing the text line, the text cutting and text recognition assumptions contained in the text line, and the coordinate information of the text line. The row ID may be replaced by a segmentation symbol representing the whole of the row hypothesis. In the aforementioned Marking Example 1, the < ml > tag is equivalent to this segmentation mark, indicating that the < ml > tag and the part enclosed by the tag are hypotheses for character cutting and character recognition. Next, the above-mentioned example of the mark is extended to a rectangular coordinate that can be expressed. The coordinate information of the line is effective information to prevent the problem of excessive retrieval (the problem of retrieving the same key is detected by plural lines). In order to represent the rectangular coordinates of the row, use the -27- (23) 200409046 attribute 1, r, t, b in the signature. 1, ι ·, t, and b respectively represent the left X coordinate, the right X coordinate, the upper Y coordinate, and the lower Y coordinate including the outer rectangle of each row. Other coordinate representations can also be considered. There are methods to express the center coordinates and dimensions of a row, and methods to use the dot coordinates of the four corners of a row rectangle. An example of XML markup when using external rectangular coordinates is shown below (Figure 16 shows an example of a text pattern). Example 4)

標記例1: <ml 1=1000 r=1200 t=800 b=850> 文 <mc cn = 1 ηn = 2> 交大 </mc> 字 <mccn = 2nn = 3> 宇学 </mc> 対 <mc cn = 1 ηn = 3> 効 </mc> </ml> </ml> <ml 1=1000 r=1200 t=850 b=900> 多 <mc cn= 1 nn二2> 多名 </mc>Notation example 1: < ml 1 = 1000 r = 1200 t = 800 b = 850 > text < mc cn = 1 ηn = 2 > Jiaotong University < / mc > word < mccn = 2nn = 3 > Yu Xue < / mc > 対 < mc cn = 1 ηn = 3 > effect < / mc > < / ml > < / ml > < ml 1 = 1000 r = 1200 t = 850 b = 900 > multiple < mc cn = 1 nn 2 2> multiple names < / mc >

重 <mc cn = 2 nn = 3> 重乘 </mc> < / m 1 > 同樣的，可將前述標記例擴充爲可記述行間之連繫方法。此時，使用標籤內屬性lc、In來規定和文字圖案相同之行間的連繫方法。XML標記例如下所示（第1 6圖爲文字圖案例）。例5) 標記例1: -28- (24) (24)200409046 文 < m c cn-l η n 二 2 > 交大 </ mc> 字 < m c c n = 2 η n = 3 > 宇学 </mc> 対 < me c n = 1 η n = 3 > 効 < / m c > < / m 1 > τι 1 1 c = 2 In-2 i> 多 < m c c η = :1 η n 二 :2> 多名 </mc> 重 < me c n二 2 η n 二 3 > 重乘 </mc> </m1> 傳統之方法時，對紙文書群之文書檢索爲針對OCR讀取結果之正文進行檢索，對於因文字變形或顏色變淡等導致OCR之文字識別錯誤、因文字境界模糊而導致〇CR之文字切割錯誤、或文書-圖形-格線混合存在而導致OCR之文字行析出錯誤，很難有效處理。而利用本發明時，因利用具有文字識別、文字切割、及文字行析出方法之OCR讀取假設檔案來進行單語檢索及文書檢索，故可迴避前述問題〇又’利用本發明的話，對於傳統方法無法調整之文書檢索性能、及單語檢索性能的折衷關係（只以文字識別上 ig賴度較商之關鍵字來進行文書檢索時，無法檢索到必要文書，而同時使用信賴度較底之關鍵字來進行文書檢索時 ’則會在文書檢索時出現無用之檢索結果），亦可因爲採用包含於OCR讀取假設檔案內之文字識別順位·類似度· 圖案配置槪度等之資訊，而可依據針對各單語檢索結果之槪度、及單語檢索槪度來計算文書檢索槪度，並利用檢索結果是否良好之使用者回饋’以提高檢索結果之精度爲目 -29 - (25) (25)200409046 的’實施自動參數學習，而可自動構築符合使用者檢索意願之文書檢索系統。 [圖式簡單說明] 第1圖爲使用OCR讀取假設檔案之檢索及傳統方法之比較槪念圖。第2圖爲至輸出OCR讀取假設檔案爲止之流程圖。第3圖爲使用OCR讀取假設檔案之檢索處理的流程圖〇第4圖爲檢索到之單語路徑的檢出流程圖。第5圖爲針對候補文字網狀結構執行單語析出處理之槪念圖。第6圖爲候補文字網狀結構之槪念圖。第7圖爲文書檢索系統之一畫面構成例。第8圖爲表示OCR讀取假設檔案之效果的圖1。第9圖爲表示OCR讀取假設檔案之效果的圖2。第1 0圖爲文書檢索系統之一構成例。第1 1圖爲文書檢索之學習流程的槪念圖。第12圖爲OCR讀取假設檔案之資料構造圖1。第13圖爲OCR讀取假設檔案之資料構造圖2。第1 4圖爲以〇CR讀取假設檔案表現之文字列圖案的槪念圖1。第1 5圖爲以OCR讀取假設檔案表現之文字列圖案的槪念圖2。 -30- (26) 200409046 第1 6圖爲以〇C R讀取假設檔案表現之文字列圖案的槪念圖3。 [元件符號之說明] 101 102 103 104 105 106 107 108 109 110 111 112 1 13 114 201 202 203 204 205 206Repetition < mc cn = 2 nn = 3 > Remultiplication < / mc > < / m 1 > Similarly, the above-mentioned mark example can be extended to a connection method capable of describing lines. In this case, the attributes lc and In in the tag are used to define the connection method between lines with the same text pattern. An example of XML markup is shown below (Figure 16 is a text map example). Example 5) Notation Example 1: -28- (24) (24) 200409046 Text < mc cn-l η n two 2 > Jiaotong University < / mc > Character < mccn = 2 η n = 3 > Universe < / mc > 対 < me cn = 1 η n = 3 > effect < / mc > < / m 1 > τι 1 1 c = 2 In-2 i > multiple < mcc η =: 1 η n two: 2 > multiple names < / mc > heavy < me cn two 2 η n two 3 > multiplication < / mc > < / m1 > The search is to search the text of the OCR reading result. For the text recognition error of OCR due to text deformation or color fade, etc., the text cutting error of 〇CR due to blurred text realm, or a mixture of document-graphic-grid lines. As a result, the OCR text line separation error is difficult to deal with effectively. When using the present invention, because the OCR reading hypothetical file with text recognition, text cutting, and text line extraction methods is used to perform monolingual retrieval and document retrieval, the aforementioned problems can be avoided. Also, if the present invention is used, the traditional The trade-off relationship between the document retrieval performance and the monolingual retrieval performance that cannot be adjusted by the method (when the document retrieval is performed only with keywords with relatively high reliability on text recognition, the necessary documents cannot be retrieved, and at the same time, the lower reliability is used Keywords when searching documents, 'useless search results will appear during document searches), or because of the use of text recognition order, similarity, pattern layout, etc. included in the OCR reading hypothesis file, and You can calculate the document retrieval degree based on the degree of each monolingual retrieval result and the monolingual retrieval degree, and use the user feedback whether the retrieval result is good to improve the accuracy of the retrieval result. -29-(25) (25) 200409046, 'Implement automatic parameter learning, and can automatically build a document retrieval system that meets the user's search wishes. [Schematic description] Figure 1 is a comparison diagram of the retrieval and the traditional method of reading hypothetical files using OCR. Fig. 2 is a flowchart until the OCR reads the hypothetical file. Fig. 3 is a flowchart of a retrieval process for reading hypothetical files using OCR. Fig. 4 is a flowchart of a retrieved monolingual path. Fig. 5 is a schematic diagram of performing monolingual extraction processing on a candidate text mesh structure. Figure 6 is an image of the alternate text mesh structure. Fig. 7 is an example of a screen configuration of a document retrieval system. Fig. 8 is a view showing the effect of OCR reading a hypothetical file. Figure 9 is a view showing the effect of OCR reading a hypothetical file. Fig. 10 shows an example of a document retrieval system. Figure 11 is a thought diagram of the learning process of document retrieval. Figure 12 shows the data structure of OCR reading hypothetical file. Figure 13 shows the data structure of OCR reading hypothetical file. Fig. 14 is a conceptual diagram of a character string pattern in which a hypothetical file expression is read with 0CR. Fig. 15 is a conceptual diagram of a character string pattern read from an OCR reading hypothetical file representation. -30- (26) 200409046 Fig. 16 is a drawing of reading the character string pattern of hypothetical file expression at 0C R. Fig. 3. [Explanation of component symbols] 101 102 103 104 105 106 107 108 109 110 111 112 1 13 114 201 202 203 204 205 206

輸入至傳統文書檢索系統的紙文書傳統文書檢索系統之〇 c R部傳統文書檢索系統之〇 C R輸出形態傳統文書檢索系統之單語檢索部傳統文書檢索系統之文書檢索部傳統文書檢索系統之文書檢索結果輸入至本發明文書檢索系統的紙文書本發明文書檢索系統之OCR部本發明文書檢索系統之OCR輸出形態本發明文書檢索系統之單語檢索部本發明文書檢索系統之文書檢索部本發明文書檢索系統之文書檢索結果單語檢索所使用之單語資料庫部文書檢索所使用文書檢索規則之單語資料庫部 OCR裝置之影像輸入部〇CR裝置之文書構造解析部 OCR裝置之文字行析出部 OCR裝置之文字圖案產生部 OCR裝置之文字識別部〇CR裝置之〇CR讀取假設檔案輸出部 -31 - (27) OCR裝置之輸入文書影像時的流程文書檢索裝置之OCR讀取假設檔案輸入部文書檢索裝置之單語檢索部文書檢索裝置之檢索單語檢定部文書檢索裝置之檢索規則應用部文書檢索裝置之檢索文書檢定部文書檢索裝置之路徑識別槪度計算部文書檢索裝置之文字配置槪度計算部文書檢索裝置之路徑配置槪度計算部候補文字網狀結構上之文字圖案候補文字網狀結構上之圖案境界候補文字網狀結構上之文字識別結果候補文字網狀結構上之文字識別類似度從候補文字網狀結構上檢索到之單語文書檢索系統畫面之關鍵字輸入欄文書檢索系統畫面之檢索規則指定欄文書檢索系統晝面之檢索文書顯示欄文書檢索系統畫面之檢索文書的詳細資訊顯示欄文書檢索系統畫面之檢索影像顯示欄文書檢索系統晝面之單語檢索結果 OCR裝置之影像輸入裝置 OCR裝置之操作終端裝置 OCR裝置之顯示終端裝置〇CR裝置之外部儲存裝置 -32- (28) OCR裝置之記憶體 OCR裝置之CPU OCR裝置之通信裝置 OCR裝置之通信匯流排網狀結構部檢索裝置部之操作終端裝置檢索裝置部之顯示終端裝置檢索裝置部之外部儲存裝置檢索裝置部之記憶體Paper documents entered into the traditional document retrieval system. Documents in the traditional document retrieval system. 0c. Section R. Document retrieval in the traditional document retrieval system. CR Output format. Monograph retrieval in the traditional document retrieval system. Document retrieval in the traditional document retrieval system. Document retrieval in the traditional document retrieval system. The search result is input to the paper document of the document retrieval system of the present invention. The OCR portion of the document retrieval system of the present invention. The OCR output form of the document retrieval system of the present invention. The monolingual retrieval portion of the document retrieval system of the present invention. The document retrieval portion of the document retrieval system of the present invention. Document search results of the document retrieval system. Monolingual database used by the document retrieval system. Monolingual database of the document retrieval rules used by document retrieval. Image input of the OCR device. Document structure analysis of the CR device. Text lines of the OCR device. Character pattern generation part of OCR device Character recognition part of OCR device Character reading part of CR device 〇CR reading hypothesis file output part -31-(27) OCR reading procedure OCR reading assumption of document retrieval device Document input unit document retrieval unit monolingual retrieval unit document retrieval The search rules of the document search device of the document search device, the application of the search rule of the document search device, the path recognition of the document search device of the document search device, the path recognition of the document search device, the character layout of the document search device, and the path layout of the document search device of the calculation unit. Degree calculation section Text candidate on the mesh structure Candidate text Pattern on the mesh structure Candidate text Recognition result on the mesh structure Candidate text similarity on the mesh structure is retrieved from the candidate text mesh structure Keyword input field on the document search system screen to search the search rules on the document search system screen. Specify the document search system on the day. Search document display column. Detailed information on the search document on the document search system screen. Search on the document search system screen. Image display column document retrieval system Day-to-day monolingual search results OCR device image input device OCR device operation terminal device OCR device display terminal device 〇CR device external storage device -32- (28) OCR device memory OCR Device CPU OCR Device Communication Device Communication bus of the OCR device Mesh structure section Operation terminal device of the retrieval device section Display terminal device of the retrieval device section External storage device of the retrieval device section Memory of the retrieval device section

檢索裝置部之CPU 檢索裝置部之通信裝置檢索裝置部之通信匯流排輸入至文書檢索系統的紙文書文書檢索系統作成之OCR讀取假設檔案文書檢索系統之單語檢索部文書檢索系統所得之單語檢索結果文書檢索系統之文書檢索規則適用部文書檢索系統所得到之檢索文書·非檢索文書檢索文書之利用指定檢索文書之良否的指導信號文書檢索系統之學習部文書檢索系統之檢索對象單語文書檢索系統之檢索對象單語參數文書檢索系統之文書檢索規則 -33- (29)200409046 1113 文書檢索系統之文書檢索規則參數The CPU of the search device section, the communication device of the search device section, the communication bus of the search device section, input to the paper search system of the paper search system, the OCR read hypothetical file, the document search system, and the monologue search section of the document search system. Search results for document search system, document search system, application of document search rules, search documents obtained by document search system, non-search document search documents, use of designated search documents, guidance signal, document search system, learning unit, document search system, search target monologue Document retrieval system's search target monolingual parameters Document retrieval rules of document retrieval system-33- (29) 200409046 1113 Document retrieval rules parameters of document retrieval system

- 34--34-

Claims

(1) (1) 200409046 Patent application scope 1. An 〇CR device, which has an image input device, a central computing device, and an external storage device that can receive the image input of written text, and is characterized in that the aforementioned central computing device can Extract text line candidates and text cutting candidates from the input image, and perform text recognition on the text cutting candidates. The text recognition result, the text line candidates, and the text cutting candidates are combined into a read hypothetical file and stored in The aforementioned external storage means. 2. For the OCR device in the first scope of the patent application, the above-mentioned central calculation device will further analyze the relationship between the above-mentioned text cutting candidates and the similarity of the above-mentioned text recognition results, and will further analyze the aforementioned text cutting candidates. The relationship between and the similarity of the aforementioned text recognition results are incorporated into the aforementioned reading hypothesis file and stored in a storage means. 3. For the OCR device in the first or second scope of the patent application, the above-mentioned central calculation device will further analyze one of the upper and lower left and right coordinates 前述 of the aforementioned text cutting candidate, and will further cut out the aforementioned coordinates of the text cutting candidate. Enter the previously read hypothetical file and store it in a storage means. 4. If the OCR device of any one of the items 1 to 3 of the patent application scope, the aforementioned central calculation device will further analyze one of the vertex coordinates 上下 of the top, bottom, left, and right sides of the aforementioned rectangle, -35 -(2) (2) 200409046, and the aforementioned vertex coordinates that have been precipitated will be stored in the aforementioned reading hypothesis file and stored in the storage means. 5. A document retrieval system comprising a retrieval device having each of an operation terminal device, an external storage device, a central calculation device, a display terminal device, and a communication device; and a patent application scope having a communication device and linked to the search device The OCR device of items 1 to 4 is constituted, and is characterized in that the central computing device of the aforementioned OCR device transmits the aforementioned reading hypothesis file from the aforementioned communication device on the side of the OCR device, and the central computing device of the aforementioned searching device, will The communication device on the retrieval device side receives the reading hypothesis file transmitted from the OCR device, and uses the foregoing information in the received reading hypothesis file to retrieve and input from the text described in the image to the foregoing The operation of the terminal device key search matches the aforementioned character string, and outputs the search result to the external storage device or the display terminal device. 6. If the document retrieval system of item 5 of the patent application scope, wherein the central calculation device of the aforementioned retrieval device further sets the weight of the aforementioned key retrieval, and corresponds to the aforementioned key retrieval entered by the aforementioned weighted change. 7. The document retrieval system of item 6 in the scope of patent application, wherein the weighting of the aforementioned key retrieval is set by using the past reproduction rate and coincidence rate on the retrieval records of the aforementioned key retrieval. -36- (3) (3) 200409046 8. If the document retrieval system of any of the items 5 to 7 of the scope of patent application, the image input device of the aforementioned OCR device will further receive a plurality of image inputs, and the aforementioned OCR device In the central calculation device of the plurality of images, the document ID corresponding to the foregoing image is further combined with the document ID, which is regarded as a read hypothetical file and stored in the storage means, and the central calculation device of the retrieval device is stored. , The aforementioned document 11} will be used to further identify the video in which the character string in the search and the key search match, and output it to the display terminal device. 9. A program for realizing a retrieval method by a computer, which is executed by a computer having an operation terminal device, a storage device, and a display terminal device, and is characterized by having the steps of receiving input of an image in which text is recorded, and from the foregoing image Steps of separating text line candidates, separating text cutting candidates from the aforementioned image, performing text recognition of the text cutting candidates, and treating files containing the results of the text recognition, the text line candidates, and the text cutting candidates as read The steps of taking hypothetical files and storing them in the aforementioned storage means, the steps of receiving the key retrieval input from the aforementioned operation terminal device, the steps of reading the aforementioned hypothetical files from the aforementioned storing means, and using the foregoing text in the aforementioned hypothetical files to cut Candidates and the aforementioned line extraction candidates, the steps from the text search described in the video and the key search -37-(4) (4) 200409046, and the step of outputting the search result to the storage means or the display terminal device . 10. The program for computer-implemented retrieval method as described in item 9 of the scope of patent application, wherein in the step of receiving the input of the aforementioned image, a plurality of input of the aforementioned image may be received, and the aforementioned step of reading the hypothetical file is stored In the above, for each of the plurality of images inputted above, the document ID corresponding to the foregoing image is combined, and it is regarded as a read hypothetical file and stored in the aforementioned storage means, and it is further identified and recorded in the aforementioned retrieval by the aforementioned document ID. The step of retrieving the image in the character string consistent with the key search and outputting the image to the display terminal device. -38-