JPH0887528A

JPH0887528A - Document filing system

Info

Publication number: JPH0887528A
Application number: JP7232285A
Authority: JP
Inventors: Hiromichi Fujisawa; 浩道藤澤; Atsushi Hatakeyama; 敦畠山; Yasuaki Nakano; 康明中野; Junichi Tono; 純一東野; Toshihiro Hananoi; 歳弘花野井
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1995-09-11
Filing date: 1995-09-11
Publication date: 1996-04-02
Anticipated expiration: 2011-12-04
Also published as: JP2560656B2

Abstract

PURPOSE: To store the text data for each document structure by automatically analyzing ana recognizing the document structure in a document filing system which contains a full text searching function. CONSTITUTION: The pattern component which is extracted by an image processing means 322 is analyzed to refer to layout rules of the document structures stored in a document knowledge file 327. Then the character patterns constructing the characters are segmented for each document structure, and the characters are recognized by a character recognition means 331 so that the character strings are obtained. Thus a document recognition means of such a constitution is available. At the same time, a storage means stores the character strings obtained by the document recognition means in response to each document structure.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は文書を画像としてファイ
リングする文書ファイリングシステムに係り、特にフル
テキストサーチ（本文検索）が行えることを特徴とした
文書ファイリングシステムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document filing system for filing a document as an image, and more particularly to a document filing system characterized by being capable of full text search (text search).

【０００２】[0002]

【従来の技術】従来の情報検索方式では主にキーワード
と分類コードに従った検索手段を提供してきた。文献情
報や特許情報は上記の方式を用いてこれまでにデータベ
ース化されている。ここでは主に抄録までを含めた書誌
情報がデータベース化されており、真の情報検索のニー
ズに対してはその一部分の機能しか提供できていない。
すなわち、関連すると思われる文献や特許が見い出され
ても、本文を得るためには大量な書棚の中を探しまわる
必要があった。2. Description of the Related Art Conventional information retrieval systems have mainly provided retrieval means in accordance with keywords and classification codes. Literature information and patent information have been stored in databases so far using the above-mentioned method. Here, the bibliographic information including the abstracts is mainly stored in the database, and only a part of the functions can be provided for the true information retrieval needs.
In other words, even if a document or patent that seems to be related is found, it is necessary to search through a large number of bookshelves in order to obtain the text.

【０００３】これに対して、大容量データを記憶できる
光ディスクが登場して、本文をもデータベースに格納し
て、いわゆる原文書情報サービスを行うことが社会ニー
ズとしてクローズアップされて来た。特許庁におけるペ
ーパレス化計画もその流れに沿ったものである。これら
のシステムでは、大量な文書を画像データの形で光ディ
スクに記憶させ、従来のキーワードを主体として情報検
索技術が適用されている。On the other hand, an optical disk capable of storing a large amount of data has appeared, and it has been highlighted as a social need to store a text in a database and provide a so-called original document information service. The paperless plan at the JPO follows this trend. In these systems, a large amount of documents are stored in the form of image data on an optical disc, and a conventional keyword-based information retrieval technique is applied.

【０００４】しかしながら、上記従来の情報検索技術で
は、数１０件から数１００件のオーダまでしか絞り込む
ことが出来ず、更に１／１０程度まで関連文書を絞り込
む手法が求められている。一つの方法は、画像データと
して貯えられている原文書（本文）を端末上に呼び出し
て、検索者が目で読む方法である。この方法は原理的に
は確実であるが、最大数１００件の文書を画像データの
形式で読み出すのは、データ量が多く、また一件一件目
視により読み取るのでは効率が悪く、実用的には問題で
ある。However, in the above-mentioned conventional information retrieval technology, it is possible to narrow down only several tens to several hundreds of orders, and there is a demand for a method of narrowing down related documents to about 1/10. One method is to call an original document (text) stored as image data on a terminal and read it by a searcher. This method is reliable in principle, but reading a maximum of several hundred documents in the form of image data requires a large amount of data, and it is not efficient to read each document visually, which makes it practical. Is a problem.

【０００５】一方、従来のキーワードや分類コードによ
る方法は、分類体系自体が時間とともに変化するため常
に更新する必要があり、本質的な問題点を抱えている。
例えば、既に分類してしまった大量な文書を、後になっ
て分類体系を変更する必要が発生したとして、変更する
ことは実際上不可能である。科学技術の進歩を記録する
ところの文献や特許は本来は従来の分類体系に乗らない
概念が提示されていてこそ新規なものであり、かつ価値
があるものである。この意味において、本来概念を表わ
すところのキーワードや分類体系は、前もって定義して
おくことは不可能であり、情報検索方式として本質的な
課題である。On the other hand, the conventional method using a keyword or a classification code has an essential problem because the classification system itself needs to be constantly updated because it changes with time.
For example, it is practically impossible to change a large amount of documents that have already been classified, because it becomes necessary to change the classification system later. Documents and patents that record the progress of science and technology are new and valuable only if the concept that does not fit into the conventional classification system is presented. In this sense, it is impossible to define in advance a keyword or a classification system that originally represents a concept, and it is an essential subject as an information retrieval system.

【０００６】上記のような理由から、文書の本文を直接
参照して内容検索する方法が要望されている。本文を参
照する方法によれば、文書をデータベースに登録する際
には重要とは思われなかった概念で、かつ検索する時点
において新しい概念と認知されている語彙を用いて検索
することが可能となる。あるいは、登録する際のインデ
クサ（索引を付与する専任者）という「フィルタ」を介
さずに重要な文書を直接探し出すことが可能となる。For the above reasons, there is a demand for a method of directly referring to the text of a document to retrieve the content. According to the method of referring to the text, it is possible to search for a concept that was not considered important when registering a document in a database, and using a vocabulary recognized as a new concept at the time of searching. Become. Alternatively, it becomes possible to directly search for an important document without using the "filter" called an indexer (a dedicated person who gives an index) when registering.

【０００７】このような要求を満すためには、画像デー
タとしての文書から、文字パターンを抽出して本文を文
字コードに置き換える必要があり、このためには文字認
識の技術を適用すればよい。しかしながら、文書ファイ
リングの対象となる文書は、たとえ印刷文書であったと
しても、印字品質や活字（フォント）の種類の多様性な
どから、従来の文字認識技術では完全な文字認識を期待
することは難しい。従来の文字読取装置では、誤認識や
認識不能（拒絶）などの不完全な認識はオペレータによ
りチェックと修正を行う方法を取っていた（例えば、橋
本著「文字認識概論」オーム社，１９８２年，pp.１５
３−１５４参照）。従って、仮に認識精度が極めて高く
ても、文書の量が膨大である場合には、本文を認識させ
た結果を人間がチェックする方法は現実的ではなく、本
文検索が可能な画像主体の文書ファイリングシステムは
現在までに実現されていない。In order to satisfy such a demand, it is necessary to extract a character pattern from a document as image data and replace the body with a character code. For this purpose, a character recognition technique may be applied. . However, even if the document to be subjected to document filing is a printed document, it is not possible to expect complete character recognition by the conventional character recognition technology because of the print quality and the variety of type (font) types. difficult. In a conventional character reading device, an incomplete recognition such as erroneous recognition or unrecognizable (rejection) is checked and corrected by an operator (for example, "Introduction to Character Recognition" by Hashimoto, Ohmsha, 1982, pp.15
3-154). Therefore, even if the recognition accuracy is extremely high, if the amount of documents is enormous, it is not realistic for a human to check the result of recognizing the text, and image-based document filing that enables text search The system has not been realized to date.

【０００８】[0008]

【発明が解決しようとする課題】本発明の目的は、上記
のような問題点を解決することにより、文書の本文を直
接参照して検索するところのフルテキストサーチ機能を
有する文書ファイリングシステムを提供することにあ
る。SUMMARY OF THE INVENTION An object of the present invention is to provide a document filing system having a full-text search function for searching by directly referring to the text of a document by solving the above problems. To do.

【０００９】[0009]

【課題を解決するための手段】本発明は、上記の目的を
達成するために、文書画像を蓄積するイメージファイル
と、文書の種類ごとに文書構造のレイアウト規則を記憶
する文書知識ファイルと、文書画像からパターン成分を
抽出する画像処理手段と、文書画像から切り出された文
字パターンを文字認識する文字認識手段とを有して、文
書知識ファイルに記憶された文書構造のレイアウト規則
を参照して、画像処理手段により抽出されたパターン成
分を解析して文書構造ごとに文字を構成する文字パター
ンを切り出し、切り出された文字パターンを文字認識手
段により文字認識して文字列を得る文書認識手段と、文
書認識手段により得た文字列を文書構造に対応させて蓄
積する蓄積手段と、検索要求を受けて蓄積手段に対して
検索を行い、検索要求を満たす文書を同定する検索手段
と、検索手段により同定された文書の文書画像をイメー
ジファイルから出力する出力手段とを有することを特徴
とする。In order to achieve the above object, the present invention provides an image file for storing document images, a document knowledge file for storing layout rules of a document structure for each document type, and a document. An image processing unit for extracting a pattern component from an image and a character recognition unit for recognizing a character pattern cut out from a document image are provided, and the layout rule of the document structure stored in the document knowledge file is referred to. A document recognition means for analyzing a pattern component extracted by the image processing means to cut out a character pattern forming a character for each document structure, and recognizing the cut out character pattern by a character recognition means to obtain a character string, and a document A storage means for storing the character string obtained by the recognition means in association with the document structure, and a search for the storage means in response to a search request And having search means for identifying documents that satisfy the determined, and output means for outputting the document image of the document identified by the search means from an image file.

【００１０】[0010]

【作用】すなわち、本発明による文書ファイリングシス
テムは、文書などを画像として扱うことの利点を損うこ
となく、同時に画像として扱うことの不利な点を改善す
るものである。すなわち、画像として扱うファイリング
システムでは従来、主に別途付与したキーワードや書誌
的事項にもとづいて検索することが主であったが、本発
明によれば、更に中に書かれている文章を参照して検索
することが出来る。That is, the document filing system according to the present invention improves the disadvantages of simultaneously treating a document or the like as an image without deteriorating the advantage of treating a document or the like as an image. That is, in the filing system that handles images, conventionally, it has been the main to search mainly based on keywords and bibliographical items added separately, but according to the present invention, it is possible to refer to the sentences written therein. You can search.

【００１１】例えば、「ホンブンケンサク」と検索用端
末から入力することにより、検索対象の文書群の中のあ
る文書の本文中に例えば「……文字認識による本文検索
……」と書いてある文書があれば、同文書を同定・抽出
して、端末上に同文書を画像のまま表示することが出来
る。For example, by inputting "Honbunkensaku" from the search terminal, for example, "... text search by character recognition ..." Is written in the text of a document in the document group to be searched. If there is a document, the document can be identified and extracted, and the document can be displayed as an image on the terminal.

【００１２】画像として表示することにより、文字認識
により情報が失われることを避けることが出来る。一般
に、文字認識では、各文字の位置、大きさ、フォントな
どの２次的情報は正規化の過程で捨ててしまう。したが
って、ゴシック体であったか明朝体であったか、どの大
きさかは認識後では分らなくなり、重要性を表わすため
にゴシック体にしたり、大きなフォントにしたりして印
刷したことの意味がなくなってしまう。音声でいえば、
音声認識してしまうと、誰が話したのか、とか、その時
の感情とかは、分らなくなってしまうことに対応する。
文書の場合においても、読取る人間にとっては、これら
２次的な情報も重要であり、単に文字認識してしまうの
は得策ではない。By displaying as an image, it is possible to avoid loss of information due to character recognition. Generally, in character recognition, secondary information such as the position, size, and font of each character is discarded in the process of normalization. Therefore, it is not known after recognition whether the size is Gothic or Mincho, and it is meaningless to print in Gothic or a large font to show the importance. Speaking of voice,
If you do voice recognition, you will not be able to understand who spoke and the emotions at that time.
Even in the case of a document, such secondary information is important to the reading person, and simply recognizing the character is not a good idea.

【００１３】本発明システムの第一の原理は、以上述べ
たように、文書を画像として記憶する一方、文字の部分
は文字コードとして重ねて記憶させている点である。As described above, the first principle of the system of the present invention is that the document is stored as an image while the character portion is stored as a character code in an overlapping manner.

【００１４】さて、文字の部分を画像から抽出して文字
コードに置換えるには文字切り出しと文字認識を行うこ
とが必要である。これには従来技術を用いることが可能
であるが、１００％の認識率を期待することはできな
い。Now, in order to extract a character portion from an image and replace it with a character code, it is necessary to perform character cutting and character recognition. Although a conventional technique can be used for this, a 100% recognition rate cannot be expected.

【００１５】本発明システムの第二の原理は、文字認識
の結果、判定不能になった文字については、上位に残っ
た文字カテゴリーを集合として扱って、認識結果文字列
の中にそのまま残す点にある。The second principle of the system of the present invention is that with respect to a character that cannot be determined as a result of character recognition, the character categories remaining in the higher order are treated as a set and are left as they are in the recognition result character string. is there.

【００１６】たとえば、「……文字認識による本文検索
……」を認識した場合、本システムでは「……文〔字
学〕認〔識織〕による〔本木〕文検索……」と認識結果
を表わす。ここで〔〕で囲んだ文字はある一つの文字
パターンに対する認識結果であり、「〔識織〕」は
「識」は「織」かのどちらかであることを意味する。従
来は、必ずオペレータの介入により判定不能の文字は正
しい文字コードに置き換えて、文字認識結果（ＯＣＲの
出力）としていた。ここで記号「〔」，「〕」は特
殊記号であり、一般にテキストに表われないコードを割
り当てるものとする。単に表示のときに、分りやすいよ
うに記号〔，〕を用いるものとする。For example, when "... body search by character recognition ..." is recognized, the recognition result of this system is "... sentence [character study] [honki] sentence search by [text] recognition [knowledge] ...". Represents Here, the characters enclosed in [] are the recognition results for a certain character pattern, and “[knowledge]” means that “knowledge” is either “weave”. Conventionally, a character that cannot be determined is always replaced by a correct character code due to operator intervention and used as a character recognition result (OCR output). Here, the symbols "[" and "]" are special symbols, and generally, a code that does not appear in the text is assigned. For the sake of simplicity, the symbols [,] shall be used for easy understanding.

【００１７】本発明を用いたシステムでは、結局図１に
示すように、文書１０は２０で示すような記号式に変換
される。同記号列はＬＩＳＰ言語などで用いているＳ式
と呼ばれる記法に従う。文書（画像）１０を記号式２０
に変換する過程を、文書理解ないしは文書認識という。
同記号式は、およそ次のような意味を表わす。すなわ
ち、ドキュメント＃９９であり、そのクラスは「論
文」，ＶＯＬ＝５，ＮＯ＝７，タイトルは“文〔字学〕
認〔識織〕……”，著者名は“山田〔太大〕郎”，本文
は“……自動文字読み取〔りリ〕によるフルテキスト
〔ト卜〕サ〔ー一−〕チ……”などを意味する。ここで
〔りリ〕は平仮名と片仮名，〔ト卜〕は片仮名と漢字，
〔ー一−〕は片仮名の長音，漢数字の１、およびマイナ
ス記号〕を意味する。文字認識において曖昧なものの中
には、上記の例のように、殆んど通常では対処しようの
ない文字パターンも多い。In the system using the present invention, the document 10 is converted into a symbolic expression as shown by 20 as shown in FIG. The symbol string follows a notation called an S-expression used in the LISP language and the like. Document (image) 10 symbolic expression 20
The process of converting to is called document understanding or document recognition.
The symbolic expression has the following meanings. That is, it is document # 99, its class is “paper”, VOL = 5, NO = 7, and title is “sentence [literary].
The author's name is "Yamada [tadai] rou", the text is "... full text by automatic character reading [lili], [...] -..." Where [riri] is hiragana and katakana, [tou] is katakana and kanji,
[-One-] means katakana long sound, Chinese numeral 1, and minus sign]. Among the ambiguous characters in character recognition, there are many character patterns that cannot be usually dealt with, as in the above example.

【００１８】さて、検索に当っては、ユーザはローマ字
又は片仮名で「ホンブンケンサク」と入力する。システ
ムではこれを仮名漢字変換する。一般に同音異義語があ
り、この場合、「ホンブン」は「本文」が「本分」かの
どちらかであり、「ケンサク」は「検索」か「献策」の
どちらかである。本方式ではこのような曖昧性を自動的
に扱うことができる。In the search, the user inputs "honbunkensaku" in Roman letters or katakana. The system converts this into Kana-Kanji. In general, there are homonyms, in which case "honbun" is either "text" or "main text" and "kensaku" is either "search" or "dedication." This method can handle such ambiguity automatically.

【００１９】同様に、「モジヨミトリ」と入力した場合
には、送り仮名に曖昧性（２つ以上の可能性）がある。
「文字読取」，「文字読取り」，「文字読み取り」があ
り、未知の本文にどのような送り仮名が振られているか
分らないため、原理的にはすべての可能性を扱う必要が
ある。Similarly, when "Mojiyomitori" is entered, there is ambiguity (more than one possibility) in the sending kana.
There are "character reading", "character reading", and "character reading", and it is not possible to know what kind of futuristic kana is attached to the unknown text, so it is necessary to handle all possibilities in principle.

【００２０】更にまた、「モジニンシキ」と入力した場
合は、仮名漢字変換では一意に「文字認識」が得られる
が、「文字認識」は場合によっては「文字読み取り」と
言われることがあるので、同義語として「文字読み取
り」も検索キーとして自動的に選択することも望まれ
る。この場合、上記の例と同様に複数の送り仮名の可能
性も列挙する。ここで、「文字認識」の同義語に「文字
読み取り」が上っても、「文字読み取り」の同義語には
「文字認識」が上らないという非対称性が一般に求めら
れるが、本方式でも満されている。Furthermore, when "Mojininshinki" is entered, "character recognition" is uniquely obtained by kana-kanji conversion, but "character recognition" is sometimes referred to as "character reading". It is also desired to automatically select "reading characters" as a synonym as a search key. In this case, similar to the above example, the possibilities of plural sending kana are also listed. Here, the asymmetry that "character recognition" does not occur in the synonym of "character reading" even if "character reading" comes up in the synonym of "character recognition", is also required in this method. Is full.

【００２１】結局、被検索対象文章の中で見い出すべき
複数の部分文字列は、図２で示す如く有限状態オートマ
トンとして表現される。一方、図１の例で示した被検索
文章の文字列も同様に、図３のオートマトンで表現され
る。本発明では、検索キー（部分文字列）および被検索
文章双方ともに曖昧性（複数の可能性；一意に決定でき
ない要素が存在する状況）が存在する場合のテキストサ
ーチ機能を提供しており、これが第三の原理である。After all, a plurality of partial character strings to be found in the sentence to be searched are expressed as a finite state automaton as shown in FIG. On the other hand, the character string of the searched sentence shown in the example of FIG. 1 is similarly expressed by the automaton of FIG. The present invention provides a text search function in the case where there is ambiguity (a plurality of possibilities; a situation in which there is an element that cannot be uniquely determined) in both the search key (substring) and the searched sentence. It is the third principle.

【００２２】複数の部分文字列をそれらの有限状態オー
トマトンを用いて、曖昧性のないテキストから探し出す
方法としては文献〔Ａ．Ｖ．Ａho，et al.“Ｅfficient
Ｓtring Ｍatching：Ａn Ａid to Ｂibliographic Ｓe
arch，”Ｃommunications ofthe ＡＣＭ，Ｖol.１８，
Ｎo.６，１９７５〕による方法が知られている。As a method for finding a plurality of substrings from unambiguous text by using their finite state automata, there is a method described in [A. V. Aho, et al. “Efficient
String Matching: An Aid to Bibliographic Se
arch, "Communications of the ACM, Vol.18,
No. 6, 1975] is known.

【００２３】[0023]

【実施例】以下、本発明を実施例にもとづいて説明す
る。EXAMPLES The present invention will be described below based on examples.

【００２４】図４は本発明の一実施例である文書ファイ
リングシステムの構成図である。同システムは、系全体
の制御とデータベース機能を提供する制御サブシステム
１００，文書などの入力とファイルへの登録を行うため
の入力サブシステム２００，文書を認識するための文書
認識装置３００，高速なテキストサーチを行うところの
テキストサーチサブシステム４００，検索を行うための
端末サブシステム８００とから成っている。FIG. 4 is a block diagram of a document filing system which is an embodiment of the present invention. The system includes a control subsystem 100 that provides control of the entire system and a database function, an input subsystem 200 that inputs documents and registers them in a file, a document recognition device 300 that recognizes documents, and a high-speed It is composed of a text search subsystem 400 for performing a text search and a terminal subsystem 800 for performing a search.

【００２５】各サブシステムの構成と動作の流れを以下
に詳細に説明する。The configuration and operation flow of each subsystem will be described in detail below.

【００２６】入力サブシステム２００は、同サブシステ
ムを制御するＣＰＵ（中央処理装置）２０１，主メモリ
２０２，システムファイル２５１，端末２０３を基本部
として持つ。端末２０３からの操作によりサブシステム
を制御し、文書２２０の各ページの画像をスキャナ２２
１により光学的に読み取り、ディジタル化した画像デー
タをバス２１０を介してビデオメモリ２２４にまず蓄え
る。同画像データは次に画像処理装置（ＩＰ）２２３に
より冗長性圧縮を行って、ＭＨ（Ｍodified Ｈuffman
n）符号あるいはＭＲ（Ｍodified Ｒead）符号に変換さ
れ、再度ビデオメモリ２２４の別なエリアに戻される。The input subsystem 200 has a CPU (central processing unit) 201 for controlling the subsystem, a main memory 202, a system file 251, and a terminal 203 as basic parts. The subsystem is controlled by an operation from the terminal 203, and the image of each page of the document 220 is scanned by the scanner 22.
First, the image data optically read by 1 and digitized is stored in the video memory 224 via the bus 210. Next, the image data is subjected to redundancy compression by an image processing device (IP) 223 to obtain MH (Modified Huffman).
It is converted into an n) code or an MR (Modified Read) code and is returned to another area of the video memory 224 again.

【００２７】入力された文書画像は端末２０３上に確認
のため表示されると同時に、オペレータは表示された画
像を見ながら書誌的事項などを入力することが出来る。
後述するように、定形文書の書誌的事項は自動的に文書
理解により読み取ることが出来るが、不定形文書の書誌
的事項や、紙面上に記入されていない情報は人間が入力
する必要がある。例えばユーザが定義した文書内容の分
類コードや、紙面上にないキーワードの入力はオペレー
タに依存せざるを得ないのは当然である。また、各文書
の価値や位置付けは、同文書の利用者が独自に付す必要
があり、これらも端末２０３より入力することができ
る。入力された該書誌的事項などのデータは、ビデオメ
モリ２２４内の画像データ（圧縮されたデータ）と関連
付けられて、主メモリ２０２に格納される。The input document image is displayed on the terminal 203 for confirmation, and at the same time, the operator can input bibliographic items while looking at the displayed image.
As will be described later, the bibliographic items of the fixed form document can be automatically read by understanding the document, but the bibliographic items of the irregular form document and information not entered on the paper must be input by a human. For example, it is natural that the operator has to rely on the classification code of the document content defined by the user and the input of the keyword not on the paper. Further, the value and position of each document must be uniquely attached by the user of the document, and these can also be input from the terminal 203. The input data such as the bibliographic item is stored in the main memory 202 in association with the image data (compressed data) in the video memory 224.

【００２８】ここで、各文書には固有番号（ドキュメン
トＩＤ）が付され、同文書固有番号をキーとして画像デ
ータと書誌的事項等が引出せるようにメモリには記憶さ
れる。文書固有番号は、例えば、サブシステムＩＤ
（‘ＩＮＳＹＳ０１’など）と日付・時間を表わす文
字列の連結で表わすことができる。例えばＩＮＳＹＳ０
１．８５０５０１．１３２４３７は１９８５年５月１
日，１３時２４分３７秒に入力サブシステムＩＮＳＹＳ
０１より入力された文書であることを表わす。システム
の応用によっては入力時刻が重要な場合があり、タイム
スタンプとしても機能する。Here, a unique number (document ID) is attached to each document, and the image data and bibliographic items are stored in the memory so that the document unique number can be used as a key. The document unique number is, for example, the subsystem ID
It can be represented by a concatenation of a character string representing "date / time" (such as "INSYS 01"). For example INSYS0
1.850501.132437 is May 1, 1985
Input subsystem INSYS at 13:24:37
Indicates that the document is input from 01. The input time may be important depending on the application of the system, and it also functions as a time stamp.

【００２９】さて、所定の量の文書がサブシステム２０
０に一定量溜るか、あるいは端末２０３からの所定の指
令があると、割込信号がバスアダプタ１７１へ送られ
る。Now, a predetermined amount of documents is stored in the subsystem 20.
When a certain amount is accumulated in 0 or a predetermined command is issued from the terminal 203, an interrupt signal is sent to the bus adapter 171.

【００３０】制御サブシステム１００は該割込信号をセ
ンスして、入力サブシステム２００のメモリ２０２内の
所定のアドレスを読み取る。これにより、入力サブシス
テムの要求の内容を判断することが出来る。The control subsystem 100 senses the interrupt signal to read a predetermined address in the memory 202 of the input subsystem 200. This makes it possible to judge the content of the request from the input subsystem.

【００３１】入力した文書のデータベースへの登録の要
求の場合には次のように動作する。In the case of a request for registration of the input document in the database, the operation is as follows.

【００３２】中央処理装置（ＣＰＵ）１０１は主メモリ
１０２内の所定のプログラムに従って、入力サブシステ
ムに一時的に貯えられた文書（複数）の固有番号を知
り、更にそれらに関する書誌データ（書誌的事項）と画
像データの記憶アドレスを知る。The central processing unit (CPU) 101 knows the unique numbers of the documents (plurality) temporarily stored in the input subsystem according to a predetermined program in the main memory 102, and further bibliographical data (bibliographic items) related to them. ) And the storage address of the image data.

【００３３】制御サブシステム１００は書誌データなど
の記号データを記憶・管理するデータベースファイル１
５１と、画像データを記憶・管理するイメージファイル
１５２を有する。The control subsystem 100 is a database file 1 for storing and managing symbol data such as bibliographic data.
51 and an image file 152 for storing and managing image data.

【００３４】入力サブシステム２００から読み出された
書誌データは、図５に示す表形式のデータベース（ファ
イル１５１内に格納してある）に新規レコードとして書
込まれる。上記表は、ＭＡＩＮ−ＤＩＲ（メインディレ
クトリ）なる名称をもち、以下のようなカラム（データ
欄）を有する。The bibliographic data read out from the input subsystem 200 is written as a new record in the tabular database (stored in the file 151) shown in FIG. The above table has the name MAIN-DIR (main directory) and has the following columns (data columns).

【００３５】・ＤＯＣ＃：本システム内の登録文書に
対する通番・ＩＤ＃：入力サブシステムで付した文書固有番号・ＮＰ：該文書を構成しているページ数・ＴＩＴＬＥ：表題（文字列）・ＡＵＴＨＯＲ：著者名（繰返し、すなわち複数データ
を許す。）・ＣＬＡＳＳ：文書の分類，種類などを表わす符号・ＰＵＢＬ＃：出版物のシステム内登録番号（詳細は図
７に示す表で管理する。）・ＶＯＬ，ＮＯ，ＰＰ：巻，号，頁・ＫＷＤ：複数のキーワード・ＡＢＳ：文字コード列（テキストデータ）として
表わされている抄録のテキスト固有番号・ＴＸＴ：文字コード列としての本文の固有番号・ＩＭＧ：画像データの固有番号。各画像データは
頁毎に管理されるので、複数のイメージ固有番号が記録
される。DOC #: serial number for a registered document in this system ID #: document unique number given by the input subsystem NP: number of pages constituting the document TITLE: title (character string) AUTHOR : Author name (repetition, that is, multiple data is allowed.)-CLASS: Code indicating the classification, type, etc. of the document-PUBL #: In-system registration number of the publication (details are managed in the table shown in FIG. 7)- VOL, NO, PP: Volume, Issue, Page ・ KWD: Multiple keywords ・ ABS: Text unique number of abstract represented as character code string (text data) ・ TXT: Unique number of text as character code string IMG: Unique number of image data. Since each image data is managed for each page, a plurality of image unique numbers are recorded.

【００３６】書誌データの登録では、上記カラムの内、
書誌データに関係する一部データのみが新規に書き込ま
れる。When registering bibliographic data, among the above columns,
Only part of the data related to the bibliographic data is newly written.

【００３７】次に、各文書を構成する頁の画像が入力サ
ブシステムの所定の記憶領域から制御サブシステム１０
０へ読み出され、イメージファイル１５２の空領域へ順
次記憶される。同時に、各画像（頁単位）には画像固有
番号（ＩＭＧＩＤ）が振られる。また、画像データを格
納したファイルのボリューム番号（ＶＯＬＳＥＲ）、フ
ァイル装置番号（ＵＮＩＴ）、同ファイルにおける格納
物理アドレス（ＰＨＹＳＡ）、同ファイルに占めた記憶
領域の長さ（ＳＬＥＮＧ）などを、図６（ｂ）および図
８に示すような表に書き込む。新規に振られた該画像固
有番号ＩＭＧＩＤは表ＭＡＩＮ−ＤＩＲ（図５）のＩＭ
Ｇカラムにも記録される。Next, the images of the pages forming each document are transferred from the predetermined storage area of the input subsystem to the control subsystem 10.
It is read out to 0 and sequentially stored in the empty area of the image file 152. At the same time, an image unique number (IMGID) is assigned to each image (page unit). In addition, the volume number (VOLSER) of the file storing the image data, the file device number (UNIT), the storage physical address (PHYSA) of the file, the length of the storage area occupied by the file (SLENG), etc. are shown in FIG. Fill in the table as shown in (b) and FIG. The image unique number IMGID newly assigned is the IM of the table MAIN-DIR (FIG. 5).
It is also recorded in the G column.

【００３８】ここで、図６（ｂ）に示す表ＩＭＧ−ＬＯ
Ｃは、イメージファイル１５２が、複数の駆動装置、あ
るいは複数のボリュームから構成されているときに特に
有効であり、各画像の所在を管理する。当然、オペレー
タによるボリュームのアンマウントやマウントの動作毎
に更新される。Here, the table IMG-LO shown in FIG.
C is particularly effective when the image file 152 is composed of a plurality of drive devices or a plurality of volumes, and manages the location of each image. Naturally, it is updated every time the operator unmounts or mounts the volume.

【００３９】また、図８は、イメージファイル１５２の
各ボリューム毎に設けられたディレクトリであり、以下
のカラムを有す。FIG. 8 shows a directory provided for each volume of the image file 152 and has the following columns.

【００４０】・ＩＭＧＩＤ：画像固有番号・ＰＮ：文書内の頁通番（１〜ｎ）・ＰＨＹＳＡ：ボリューム内の物理アドレス・ＳＬＥＮＧ：記録長（例えばセクタ数）・ＣＯＤＥ：画像圧縮符号名・ＳＩＺＥ：画像サイズ（画素数）・ＤＯＣ＃：文書通番などである。また、同図において、レコード１５７のカ
ラムＰＨＹＳＡのデータはイメージファイル内のイメー
ジデータ領域１５６内での該画像データ１５８の先頭ア
ドレスを示している。IMGID: image unique number PN: page serial number in document (1 to n) PHYSA: physical address in volume SLENG: recording length (for example, number of sectors) CODE: image compression code name SIZE: Image size (number of pixels) DOC #: Document serial number. Further, in the figure, the data in the column PHYSA of the record 157 indicates the start address of the image data 158 in the image data area 156 in the image file.

【００４１】さて、以上の動作が終了すると、本システ
ムは書誌的事項とキーワードからの検索が端末群８００
から行えるようになる。When the above operation is completed, the present system searches the terminal group 800 for bibliographic items and keywords.
You can start from.

【００４２】検索用端末から入力された検索条件はゲー
トウェイ１７５を経由して制御サブシステム１００のＣ
ＰＵ１０１へ転送される。メモリ１０２の所定の検索処
理プログラムに従って、データベースファイル１５１内
の表ＭＡＩＮ−ＤＩＲ１５３（図５）の検索が行われ
る。表１５３の主要なカラムに対してはインデキシング
（ハッシングや逆ファイルなどの検索高速化のための手
段）が施されていることは言うまでもない。The search condition input from the search terminal is passed through the gateway 175 to the C of the control subsystem 100.
It is transferred to the PU 101. The table MAIN-DIR 153 (FIG. 5) in the database file 151 is searched according to a predetermined search processing program in the memory 102. It goes without saying that indexing (means for speeding up search such as hashing and reverse file) is applied to the main columns of Table 153.

【００４３】検索処理の結果として、表１５３（図５）
からＤＯＣ＃のリストと、画像固有番号ＩＭＧＩＤのリ
ストが作られメモリ１０２の所定の領域に記憶させる。
検索用端末から表示要求を出すと、表ＩＭＧ−ＬＯＣ１
５４（図６（ｂ））と表ＩＭＧ−ＤＩＲ１５５（図８）
を用いて、イメージファイルの中の位置を同定して、画
像データを逐次メモリ１０２上へ読み出す。同時に、読
み出された画像データから順に検索用端末へ転送され、
端末上での指示に従って画面上に表示される。As a result of the search processing, table 153 (FIG. 5)
To DOC # list and image unique number IMGID list are created and stored in a predetermined area of the memory 102.
When a display request is issued from the search terminal, the table IMG-LOC1
54 (Fig. 6 (b)) and Table IMG-DIR155 (Fig. 8).
Is used to identify the position in the image file, and the image data is sequentially read out onto the memory 102. At the same time, the read image data is transferred to the search terminal in order,
It is displayed on the screen according to the instructions on the terminal.

【００４４】次に本文内容検索に用いるテキストの管理
方法について説明する。Next, a method of managing texts used for body text search will be described.

【００４５】メインディレクトリＭＡＩＮ−ＤＩＲ（図
５）で説明したように、各文書は画像データのみなら
ず、文字コード列で表現されるテキストも記憶・管理さ
れる。本実施例の場合、抄録と本文とが各々テキストと
してテキストファイル４５１，４５２，４５３で記憶・
管理される。各テキスト（文字列）には固有テキスト番
号を振り、表１５３（図５）のＡＢＳ欄、ＴＸＴ欄，図
６（ａ）に示すＴＸＴ−ＬＯＣ表のＴＸＴＩＤ欄、およ
び図９に示すＴＥＸＴ−ＤＩＲ表のＴＸＴＩＤ欄に記録
される。As described in the main directory MAIN-DIR (FIG. 5), each document stores and manages not only image data but also text represented by a character code string. In the case of the present embodiment, the abstract and the text are stored as text in the text files 451, 452, 453, respectively.
Managed. A unique text number is assigned to each text (character string), and the ABS column and the TXT column of the table 153 (FIG. 5), the TXTID column of the TXT-LOC table shown in FIG. 6A, and the TEXT-DIR shown in FIG. Recorded in the TXTID column of the table.

【００４６】図９はテキストファイル４５１，４５２，
４５３でのテキスト記憶と管理の方法を示す。同図にお
いて、ファイル記憶領域４６６には、テキスト本体が一
次元的に記憶される。各テキスト（一本の文字列）には
固有番号ＴＸＴＩＤが振られ、ディレクトリ表、ＴＥＸ
Ｔ−ＤＩＲ４６５で管理される。表４６５は以下のカラ
ムを有す。FIG. 9 shows the text files 451, 452.
A method of text storage and management at 453 is shown. In the figure, the text body is one-dimensionally stored in the file storage area 466. A unique number TXTID is assigned to each text (one character string), the directory table, TEX
It is managed by T-DIR465. Table 465 has the following columns.

【００４７】・ＴＸＴＩＤ：テキスト固有番号・ＮＣＨ：該テキストを構成する文字の総数・ＰＨＹＳＡ：該テキストが記録されている物理的アド
レス・ＳＬＥＮＧ：該テキストの記憶媒体上での記録の長さ・ＣＣＬＡＳＳ：該テキストを表現する文字のクラス
（漢字混り日本文，英文，ローマ字，仮名文字など）表４６５のレコード４６７は、同ファイル内で、該レコ
ードが表わすテキストが、記憶領域内の４６８の部分で
あることなどを表わしている。-TXTID: Text unique number-NCH: Total number of characters forming the text-PHYSA: Physical address at which the text is recorded-SLENG: Length of recording of the text on the storage medium-CCLASS : Class of characters expressing the text (Japanese characters mixed with kanji, English, Roman characters, kana characters, etc.) The record 467 of the table 465 indicates that the text represented by the record is 468 part in the storage area in the same file. It means that

【００４８】一方、図４に示す如く、複数のボリューム
にテキストを記録することが可能であり、上記テキスト
ディレクトリは各ボリューム内のテキストを管理するも
のである。複数ボリュームをマウントしている場合、あ
るテキストがどのボリュームに在るのかを知る必要があ
るが、図６（ａ）に示すＴＸＴ−ＬＯＣ表が各テキスト
の所在を管理する。テキスト固有番号ＴＸＴＩＤを有す
テキストが記録されているボリューム通番ＶＯＬＳＥＲ
と、同ボリュームがマウントされているファイル装置番
号ＵＮＩＴが管理される。当然、オペレータにより物理
的なボリュームがアンマウントされたり、新しくマウン
トされたりすると、ＴＸＴ−ＬＯＣは自動的に更新され
る。On the other hand, as shown in FIG. 4, it is possible to record text in a plurality of volumes, and the text directory manages the text in each volume. When mounting multiple volumes, it is necessary to know in which volume a certain text exists, but the TXT-LOC table shown in FIG. 6A manages the location of each text. Volume serial number VOLSER in which the text with the text unique number TXTID is recorded
And the file device number UNIT on which the same volume is mounted is managed. Of course, when the physical volume is unmounted or newly mounted by the operator, TXT-LOC is automatically updated.

【００４９】さて、大きな動作の流れとして、文書画像
入力，書誌的事項の入力、および文書登録が終了する
と、登録が終了した文書の本文認識（文書理解）が文書
認識装置３００によって行われる。該認識装置の入力
は、イメージファイル１５２内の図１に示したような文
書画像１０であり、認識結果出力は同図に同じく示した
ような記号式２０である。記号式２０内の抄録および本
文のテキスト部分は上記の説明のようにテキストファイ
ル４５１〜４５３に新規に記憶され、管理される。文書
認識を図１０に示す文書認識装置の詳細ブロック図を用
いて説明する。As a large flow of operations, when the document image input, the bibliographical item input, and the document registration are completed, the document recognition apparatus 300 performs the body recognition (document understanding) of the registered document. The input of the recognition device is the document image 10 in the image file 152 as shown in FIG. 1, and the recognition result output is the symbolic expression 20 as shown in FIG. The abstract and the text portion of the text in the symbolic expression 20 are newly stored and managed in the text files 451 to 453 as described above. Document recognition will be described with reference to the detailed block diagram of the document recognition apparatus shown in FIG.

【００５０】該認識装置３００は制御サブシステム１０
０のバス１１０とバスアダプタ３７１を介して接続され
ＣＰＵ３０１により制御される。メモリ３０２は該装置
の動作を制御するためのプログラムとパラメータなどの
データを記憶する。The recognition device 300 includes the control subsystem 10
It is connected to the 0 bus 110 via the bus adapter 371 and is controlled by the CPU 301. The memory 302 stores data such as programs and parameters for controlling the operation of the device.

【００５１】認識すべき画像データはイメージファイル
１５２からメモリ３２１へ転送される。該画像データは
圧縮符号化されており、画像処理回路ＩＰ３２２により
ビット表現画像に復号化され、再度メモリ３２１に記憶
される。続いて、ビット表現に直された画像からパター
ンの輪郭抽出をＩＰ３２２が行い、抽出結果を再びメモ
リ３２１に格納する。The image data to be recognized is transferred from the image file 152 to the memory 321. The image data is compression-encoded, decoded into a bit representation image by the image processing circuit IP322, and stored again in the memory 321. Then, the IP 322 extracts the contour of the pattern from the image converted into the bit representation, and stores the extraction result in the memory 321 again.

【００５２】抽出された輪郭データは次のように表わさ
れる。The extracted contour data is expressed as follows.

【００５３】[0053]

【数１】（ｉＣｉｘ_max,_iｘ_min,_iｙ_max,_iｙ_min,_iｘ_siｙ_ｓｉ（θ_１ｉＬ_1i）……（θ_niＬ_ni）） …（１）ここでｉは輪郭の固有番号（１，２，３，……）であ
り、Ｃｉは該輪郭のクラスを表わす。Ｃｉ＝０は外輪郭
（図１１の実線１００１）を表わし、Ｃｉ＝１は内輪郭
（図１１の破線１００２）を表わす。ｘ_max，ｘ_min，ｙ
_max，ｙ_minは図１１に示すように、輪郭の外郭四角形の
頂点の座標を表わす。（ｘ_s，ｙ_s）は輪郭長のある一点
Ｐｓ（例えば輪郭探索で最初に見い出された点）の座標
である。輪郭データ自体は、点Ｐｓを基点として、図１
２に示す如く、量子化された方向コードθと、同方向が
連続する画素数Ｌとの組の列で表わされる。(1) (i Ci x _max , _i x _min , _i y _max , _i y _min , _i x _si y _si (Θ _1i L _1i ) ... (θ _ni L _ni )) (1) Here, i is the unique number (1, 2, 3, ...) Of the contour, and Ci represents the class of the contour. Ci = 0 represents the outer contour (solid line 1001 in FIG. 11), and Ci = 1 represents the inner contour (broken line 1002 in FIG. 11). x _max , x _min , y
_As shown in FIG. 11, _max and y _min represent the coordinates of the vertices of the outline quadrangle. (X _s , y _s ) is the coordinate of one point Ps having the contour length (for example, the point first found in the contour search). The contour data itself is based on the point Ps as the base point in FIG.
As shown in FIG. 2, it is represented by a set of columns of a quantized direction code θ and the number of pixels L in which the same direction continues.

【００５４】次に、数（１）で表わされる輪郭データか
ら、傾き補正回路３２３は文書入力時に発生した傾き角
度を検出し、輪郭データを補正して再びメモリ３２１へ
書き戻す。同傾き補正アルゴリズムとしては例えば特願
昭６０−１５２２１０にて開示した方式を用いることが
できる。Next, the inclination correction circuit 323 detects the inclination angle generated at the time of document input from the contour data represented by the equation (1), corrects the contour data, and writes it back to the memory 321. As the inclination correction algorithm, for example, the method disclosed in Japanese Patent Application No. 60-152210 can be used.

【００５５】傾き補正を施した輪郭データの内、特に外
郭四角形を表わすデータ部分（ｘ_max，ｘ_min，ｙ_max，
ｙ_min）から、次に行切り出しと、列切り出しとをボト
ムアップセグメンタ（ＢＳＧ）３２４により行う。Of the contour data which has been subjected to the inclination correction, particularly the data portion (x _max , x _min , y _max ,
y _min ), row cutting and column cutting are then performed by the bottom-up segmenter (BSG) 324.

【００５６】ボトムアップセグメンタＢＳＧは数（１）
の形式で表わされるデータを入力し、数（２）で表わさ
れるパターンリストを生成し、メモリ３２１に格納す
る。Bottom-up segmenter BSG is number (1)
The data represented by the formula (2) is input, the pattern list represented by the equation (2) is generated and stored in the memory 321.

【００５７】[0057]

【数２】（ｊｘ_max,_jｘ_min,_jｙ_max,_jｙ_min,_j） …（２）ここでｊはパターン固有番号であり、パターンは互いに
重ならない矩形領域として定義され、数（２）は更に該
矩形領域の頂点座標を定義する。たとえば、図１３で、
破線で示す矩形領域１００８，１００９はＢＳＧの入力
であるが、ＢＳＧの結果として矩形１０１０が得られ
る。矩形１００８，１００９は各々一つの輪郭から作ら
れ、成分（エレメント）であり、矩形１０１０は一つの
文字を形成するパターンである。パターンｊを構成する
成分は数（２）で定義される矩形領域に含まれる矩形を
数（１）の輪郭データから探索することにより求めるこ
とができる。もしくは別途求めておいて、データとして
格納しておいてもよい。図１４に行切り出し処理の結果
を、図１５に列切り出し処理の結果を図式的に示す。(2) (j x _max , _j x _min , _j y _max , _j y _min , _j ) (2) Here, j is a pattern unique number, and the patterns are defined as rectangular areas that do not overlap each other. 2) further defines the vertex coordinates of the rectangular area. For example, in FIG.
Although rectangular areas 1008 and 1009 indicated by broken lines are inputs to the BSG, a rectangle 1010 is obtained as a result of the BSG. Rectangles 1008 and 1009 are components (elements) each formed from one contour, and rectangle 1010 is a pattern forming one character. The component forming the pattern j can be obtained by searching the rectangle included in the rectangular area defined by the equation (2) from the contour data of the equation (1). Alternatively, it may be separately obtained and stored as data. FIG. 14 schematically shows the result of the row cutout process, and FIG. 15 schematically shows the result of the column cutout process.

【００５８】文字切り出し部（ＣＳＧ）３２５は、文書
の書式などの規則をまとめた文書知識を参照しながら、
上記パターンリストから文字を構成するパターンを抽出
する。文書知識は図１０に示す如く、文書知識ファイル
（ＤＫＦ）３２７に格納されている。The character slicing unit (CSG) 325 refers to the document knowledge that summarizes the rules such as the document format,
Patterns that form characters are extracted from the pattern list. The document knowledge is stored in the document knowledge file (DKF) 327 as shown in FIG.

【００５９】文書知識ファイルには、文書の種類毎に、
その表題，著者名，著者の所属，抄録，本文などのレイ
アウト（配置）の構造的な規則などが、フォントの大き
さなどのパラメトリックな知識とともに記憶されてい
る。これらの知識は書式記述言語により記述する。書式
記述言語としては、特願昭６０−１２２４２５に開示し
た言語を用いることができる。In the document knowledge file, for each type of document,
The title, author name, author affiliation, abstract, structural rules of layout (arrangement) such as text are stored together with parametric knowledge such as font size. These knowledges are described by the format description language. As the format description language, the language disclosed in Japanese Patent Application No. 60-122425 can be used.

【００６０】文字切り出し部ＣＳＧでは、一文字を本来
構成するものでありながら、２つ以上のパターンに分れ
てしまったものの統合や、逆に２つ以上の文字が１つの
パターンに接触により融合してしまったものの強制的な
分離という処理も行う。In the character slicing section CSG, although one character is originally formed, it is divided into two or more patterns, but conversely, two or more characters are fused into one pattern by contact. It also performs a process of forcibly separating what has been lost.

【００６１】文字切り出し部ＣＳＧは、処理結果とし
て、表題とか抄録、あるいは本文といった項目ごとに、
各文字を構成するパターンの番号をリストとして出力す
る。たとえば、The character slicing section CSG, as a processing result, for each item such as a title, an abstract, or a text,
The numbers of the patterns that make up each character are output as a list. For example,

【００６２】[0062]

【数３】（ＡＢＳＴＲＡＣＴ（ｊ₁ｊ₂ｊ₃…（ｊ_nｊ_n+1ｊ_n+2）…ｊ_N）） …（３）は抄録がパターン番号ｊ_kで表わされる文字の列で構成
されることを表わす。ここで、（ｊ_nｊ_n+1ｊ_n+2）は
該文字がｊ_n，ｊ_n+1，ｊ_n+2番目の３つのパターンで構
成されていることを表わす。(Abstract (j ₁ j ₂ j ₃ ... (j _n j _{n + 1} j _{n + 2} ) ... j _N )) (3) is composed of a character string whose abstract is represented by the pattern number j _k It means to be done. Here, (j _n j _{n + 1} j _{n + 2} ) indicates that the character is composed of the three patterns j _n , j _{n + 1} and j _{n + 2} .

【００６３】文字認識部（ＣＲＧ）３３１は、上記パタ
ーンリスト（例えば数（３））とメモリ３２１上にある
輪郭データ（数（１）で表現）とから、各文字パターン
を構成する輪郭データを前述のごとく抽出し、特徴抽出
が可能なデータ構造に変換する。The character recognition unit (CRG) 331 uses the pattern list (for example, the number (3)) and the contour data (represented by the number (1)) on the memory 321 to extract the contour data forming each character pattern. The data is extracted as described above and converted into a data structure capable of feature extraction.

【００６４】文字認識手法としては公知の技術を用いる
ことができるので詳細な説明は省略するが、輪郭データ
から特徴抽出を行った後、標準パターンファイル３３３
内の標準パターンとのパターン整合を行って、各文字を
認識することができる。図１０において、メモリＳＴＰ
Ｍ３３４は、参照頻度が高い標準パターンを記憶するた
めのものであり、高速処理を目的とする。Since a known technique can be used as the character recognition method, detailed description thereof will be omitted. However, after the feature extraction from the contour data, the standard pattern file 333 is extracted.
Each character can be recognized by performing pattern matching with the standard pattern in. In FIG. 10, the memory STP
M334 is for storing a standard pattern having a high reference frequency, and is intended for high-speed processing.

【００６５】文字認識の結果は、前述したごとく、図１
に示すような記号２０で出力する。文字認識における最
終判定過程において、パターン整合の結果得られる類似
度が数（４）を満すときは、該類似度を与える文字カテ
ゴリ（文字コード）ω_kを出力する。As described above, the result of character recognition is shown in FIG.
The symbol 20 as shown in FIG. When the similarity obtained as a result of the pattern matching satisfies the expression (4) in the final determination process in the character recognition, the character category (character code) ω _k that gives the similarity is output.

【００６６】[0066]

【数４】 ρ_k≧ρ_l ｍｉｎ（ρ_k−ρ_l）≧ε（但し、ｋ≠ｌ）ｆｏｒｌ＝１，２，…，Ｋ …（４）ここで、ρ_kは文字カテゴリｋに対する類似度、Ｋは全
カテゴリ数、εは相対閾値である。(4) ρ _k ≧ ρ _l min (ρ _k −ρ _l ) ≧ ε (where k ≠ l) for l = 1, 2, ..., K (4) where ρ _k is the similarity to the character category k, K is the total number of categories, ε is a relative threshold.

【００６７】もし、数（４）が満されない場合には、数
（５）を満す文字カテゴリの集合｛ω_k｜ｋ＝ｋ₁，
ｋ₂，…｝を、特殊な２つの文字コードに挟んで出力す
る。例えば、ω_sω_k1ω_k2…ω_eなる文字（コード）列を
出力する。ここでω_sは“〔”，ω_eは“〕”を表わす。If the number (4) is not satisfied, the set of character categories satisfying the number (5) {ω _k | k = k ₁ ,
k ₂ , ...} is sandwiched between two special character codes and output. For example, the character (code) string ω _s ω _k1 ω _k2 ... ω _e is output. Here, ω _s represents “[” and ω _e represents “]”.

【００６８】[0068]

【数５】 ρ_k≧ρ_lｆｏｒｌ＝１，２，…，Ｋ ρ_k−ρ_ki≦ε₁ ｋ_i｛１，２，３，…，Ｋ｝ …（５）以上のような処理により、類似文字が存在して数（４）
が満足されない場合には、例えば、「フルテキストサー
チ」という入力パターンに対して、「フルテキス〔ト
卜〕サ〔ー一−〕チ」という認識結果が得られる。認識
結果はメモリ３２１上にバッファリングされた後、一括
してメモリ１０２（図４）に転送される。(5) ρ _k ≧ ρ _l for l = 1, 2, ..., K ρ _k −ρ _ki ≦ ε ₁ k _i {1, 2, 3, ..., K} (5) By the above processing, the similar character exists and the number (4)
If is not satisfied, for example, a recognition result of "full text search" is obtained for the input pattern of "full text search". The recognition result is buffered in the memory 321, and then transferred to the memory 102 (FIG. 4) collectively.

【００６９】制御サブシステム１００では、表ＴＸＴ−
ＬＯＣ（図６）を参照して最大のテキスト固有番号を検
出し、値１を加算した値を新規のテキスト固有番号とし
て、認識結果の文字コード列（テキスト）を登録する。
登録処理は、メインディレクトリ１５３，表ＴＸＴ−Ｌ
ＯＣおよび表４６５（図９）に対して行われ、テキスト
データ自体はテキストファイル４５１〜４５３のいずれ
かに格納する。In the control subsystem 100, the table TXT-
The maximum text unique number is detected with reference to LOC (FIG. 6), and the value obtained by adding the value 1 is used as a new text unique number, and the character code string (text) of the recognition result is registered.
The registration process is performed in the main directory 153, table TXT-L.
This is performed for the OC and the table 465 (FIG. 9), and the text data itself is stored in any of the text files 451 to 453.

【００７０】さて、以上のようにして、テキストデータ
が与えられた文書に対しては、テキストサータサブシス
テム４００を用いた検索を行うことが可能である。As described above, it is possible to search the document to which the text data is given by using the text searcher subsystem 400.

【００７１】次に、本文内容検索のためのテキストサー
チサブシステム４００とその動作について詳しく説明す
る。Next, the text search subsystem 400 for searching the contents of the text and its operation will be described in detail.

【００７２】端末８００で発せられる本文内容検索の要
求、たとえば「ＡＢＳ＝＊モジニンシキ＊」は制御サブ
システム１００へまず転送される。サブシステム１００
では、被検索文書が既にキーワード検索などによって絞
られている場合には、該文書に付随しているテキストの
固有番号をメインディレクトリ１５３から選択し、更に
表ＴＸＴ−ＬＯＣを参照することにより、テキストファ
イル毎に、被検索テキスト固有番号のリスト数（６）を
作成する。A request for body content search issued from the terminal 800, for example, "ABS = * modin *" is first transferred to the control subsystem 100. Subsystem 100
Then, if the document to be searched has already been narrowed down by a keyword search or the like, the unique number of the text attached to the document is selected from the main directory 153, and further, by referring to the table TXT-LOC, the text For each file, the list number (6) of the searched text unique number is created.

【００７３】[0073]

【数６】（ｕ_iｖ_i（ｔ_i1ｔ_i2…ｔ_in））ｉ＝１，２，…，Ｍ …（６）ここで、ｕ_iはｉ番目のファイル装置番号、ｖ_iが該ボリ
ューム通番、ｔ_ikは該ボリューム上で検索すべきｋ番目
のテキストのテキスト固有番号である。また、Ｍはテキ
ストファイル装置の最大数である。(U _i v _i (t _i1 t _i2 ... t _in )) i = 1, 2, ..., M (6) where u _i is the i-th file device number and v _i is the volume. The serial number, t _ik, is the text unique number of the kth text to be searched on the volume. M is the maximum number of text file devices.

【００７４】一方、被検索文書が全体である場合には、
特殊な記号（例えば数（７））が全テキストファイルに
対して送られる。On the other hand, if the searched document is the entire document,
Special symbols (eg number (7)) are sent for all text files.

【００７５】[0075]

【数７】（ｕ_iｖ_i＊）ｉ＝１，２，…，Ｍ …（７）リスト数（６）、或いは数（７）と、部分文字列（たと
えば「モジニンシキ」）が制御サブシステム１００か
ら、バスアダプタ１７２を経由して、テキストサーチサ
ブシステム４００内のメモリ４０２へ転送される。(U _i v _i *) i = 1, 2, ..., M (7) The list number (6) or the number (7) and the partial character string (for example, “modininski”) are control subsystems. 100 is transferred to the memory 402 in the text search subsystem 400 via the bus adapter 172.

【００７６】サブシステム４００（図４）では、メモリ
４０２内の所定のプログラムに従って、転送された該部
分文字列の仮名漢字変換，異表記発生処理，同義語処理
などを行う。仮名漢字変換辞書，異表記発生規則，同義
語辞書はファイル４０３に記憶されている。The subsystem 400 (FIG. 4) performs kana-kanji conversion, different notation generation processing, synonym processing, etc. of the transferred partial character string according to a predetermined program in the memory 402. The kana-kanji conversion dictionary, different notation generation rule, and synonym dictionary are stored in the file 403.

【００７７】仮名漢字変換により「モジニンシキ」から
「文字認識」が得られる。同義語辞書を参照することに
より更に「文字読み取り」が得られる。これらの結果に
対して異表記発生規則を適用すると、「文字読み取り」
から、送り仮名の異る異表記「文字読取り」と「文字読
取」が得られる。仮名漢字変換や同義語発生には公知技
術を用いることができる。By converting the kana into kanji, "character recognition" can be obtained from "modininshiki". Further "character reading" can be obtained by referring to the synonym dictionary. Applying the variant notation rule to these results would result in "character reading".
From this, different notations "character reading" and "character reading" with different sending kana can be obtained. Known techniques can be used for kana-kanji conversion and synonym generation.

【００７８】異表記発生規則とは送り仮名，人名などの
旧字体などの多様性を扱うためのものであり、以下のよ
うな書換規則で表わされる。The different notation generation rule is for handling the variety of old fonts such as sending kana and personal names, and is expressed by the following rewriting rule.

【００７９】[0079]

【数８】（Ｒ１）ＸみＹり→ＸＹり｜ＸＹ（Ｒ２）ＸみＹき→ＸＹき｜ＸＹ（Ｒ３）ＸりＹり→ＸＹり｜ＸＹ（Ｒ４）ＸきＹみ→ＸＹみ｜ＸＹ：：（Ｒ１０１）ＸみＹる→ＸＹる（Ｒ１０２）ＸりＹる→ＸＹる（Ｒ１０３）ＸきＹむ→ＸＹむ：（Ｒ２０１）Ｘなる→Ｘる：（Ｒ５０１）藤沢→藤澤： …（８）ここで、Ｘ，Ｙは任意の漢字であり、「｜」は併置を意
味する。更に、異表記発生に関しては例えば、特開昭６
０−１５０１７６で表示の方法もとることができる。[Equation 8] (R1) X only Y → XY only | XY (R2) X only Y → XY (XY) (R3) X only Y → XY R | XY (R4) X only Y → XY only XY :: (R101) X-Yru → XY-R (R102) X-Yu → XY-R (R103) X-Y-YM → XY-Mu: (R201) X becomes → X-Ru: (R501) Fujisawa → Fujisawa : (8) Here, X and Y are arbitrary kanji, and “|” means juxtaposition. Further, regarding the occurrence of different notation, see, for example, JP-A-6
The display method can be taken from 0-150176.

【００８０】異表記発生処理は、入力文字列に数（８）
の規則の左辺が当てはまるものが存在するか否かを判定
し、存在する場合には、当該規則の右辺を生成する。但
し、変数Ｘ，Ｙには当てはめられた漢字を挿入する。The different notation generation process is performed by inputting the number (8) in the input character string.
It is determined whether or not the left side of the rule is applicable, and if there is, the right side of the rule is generated. However, the fitted Chinese characters are inserted in the variables X and Y.

【００８１】上記の処理により、結局、「モジニンシ
キ」に対して、文字列の集合（文字認識，文字読み取
り，文字読取り，文字読取）が得られる。これを数
（９）で表わすことにする。By the above processing, a set of character strings (character recognition, character reading, character reading, character reading) is finally obtained for "Mojininshinki". This will be represented by the number (9).

【００８２】[0082]

【数９】（Ａ₁…Ａ_i…Ａ_n）＝（（ａ₁₁ａ₁₂…ａ_1m1）：（ａ_i1ａ_i2…ａ_imi）：（ａ_n1ａ_n2…ａ_nmn） …（９）ここで、ｎは文字列の数、ｍ_iはｉ番目の文字列の長
さ、ａ_ijはｉ番目の文字列Ａ_iの先頭からｊ番目の文字
コードである。(A ₁ ... A _i ... A _n ) = ((a ₁₁ a ₁₂ ... a _1m1 ): (a _i1 a _i2 ... a _imi ): (a _n1 a _n2 ... a _nmn ) (9) where Where n is the number of character strings, m _i is the length of the i-th character string, and a _ij is the j-th character code from the beginning of the i-th character string A _i .

【００８３】サブシステム４００は更に文字列集合数
（９）を所定のプログラムにより、図２で説明した有限
オートマトンを表わす状態遷移リスト数（１０）に変換
する。The subsystem 400 further converts the number of character string sets (9) into the number of state transition lists (10) representing the finite state automaton described with reference to FIG. 2 by a predetermined program.

【００８４】[0084]

【数１０】ａlist＝（（Ｓ_j1Ｃ_k1Ｓ_l1）：（Ｓ_jiＣ_kiＳ_li）：（Ｓ_jmＣ_kmＳ_lm）） …（１０）ここで、リストａlist数（１０）の各要素は、状態Ｓ_ji
において、文字Ｃ_kiが入力された（に一致した）場合、
状態はＳ_liに遷移することができることを意味する。ま
た、同式において、｛Ｓ_j1，…，Ｓ_ji，…，Ｓ_jm｝の中
には互いに等しいものが含まれている。Alist = ((S _j1 C _k1 S _l1 ): (S _ji C _ki S _li ): (S _jm C _km S _lm )) (10) Here, each element of the list alist number (10) Is the state S _ji
In, when the character C _ki is input (matches),
The state means that it can transit to S _li . In the same equation, {S _j1 , ..., S _ji , ..., S _jm } include the same ones.

【００８５】更に、出力リスト数（１１）を生成する。Further, the number of output lists (11) is generated.

【００８６】[0086]

【数１１】 σlist＝（（Ｓ_j1Ａ_j1）：（Ｓ_jpＡ_ip）：（Ｓ_jnＡ_in）） …（１１）ここで、（Ｓ_jpＡ_ip）は、状態Ｓ_jpに到達した時点
で、文字列Ａ_ipが見つかったことを意味する。一般にオ
ートマトンで出力関数と呼ばれるものに相当する。図１
６に、文字列集合数（１１）から状態遷移リスト数（１
０）と、出力リスト数（１１）を導出するアルゴリズム
のＰＡＤ図式（Program Analysis Diagram）を示す。Σlist = ((S _j1 A _j1 ): (S _jp A _ip ): (S _jn A _in )) (11) where (S _jp A _ip ) is the time point when the state S _jp is reached. Means that the character string A _ip has been found. It generally corresponds to what is called an output function in an automaton. FIG.
6, the number of character string sets (11) to the number of state transition lists (1
0) and a PAD diagram (Program Analysis Diagram) of an algorithm for deriving the number of output lists (11).

【００８７】次に、失敗遷移リスト数（１２）を状態遷
移リスト数（１０）より作る。Next, the failure transition list number (12) is created from the state transition list number (10).

【００８８】[0088]

【数１２】ｆlist＝（（Ｓ₀Ｓ_j0）…（Ｍ_mＳ_jm）） …（１２）ｆlistの要素（Ｓ_mＳ_jm）は、状態Ｓ_mに於いて入力さ
れた文字Ｃ_kに対して、遷移すべき状態がａlist数（１
０）の中に指定されていなかった場合には、ｆlistを参
照して状態Ｓ_jmに遷移することを指定する。一般に失敗
関数と呼ばれることがある。Flist = ((S ₀ S _j0 ) ... (M _m S _jm )) (12) The element (S _m S _jm ) of flist corresponds to the character C _k input in the state S _m. The state to be transited is the number of alist (1
If it is not specified in 0), it refers to flist and specifies to transit to the state S _jm . It is generally called a failure function.

【００８９】ｆlistを設ける目的は、部分文字列マッチ
ングにおいて、ある文字列の途中までマッチングが成功
したが次の文字が一致しない場合、すなわち所定の状態
遷移先が見つからない場合に、初期状態Ｓ₀に状態を戻
すことは一般に正しくない場合があることに対処するた
めである。例えば、２つの部分文字列｛文字認識，光学
的文字読取装置｝を探索することを想定する。いま、
「…光学的文字認識…」という文章を入力したとする
と、「光学的文字」までの部分が２番目の部分文字列に
一致するが、次の文字「認」がマッチングしない。ここ
でもし、状態をＳ₀にまで戻して、リセットしてしまう
と、オートマトンは「認識…」以降の文章を入力文字と
してしまうため、結局、「文字認識」という部分文字列
を見落してしまうことになる。従って、マッチングが失
敗した場合の遷移すべき状態はＳ₀ではなく、「文字認
識」の遷移パスの「字」までをマッチングした状態にす
る必要がある。The purpose of providing flist is that in the partial character string matching, if the matching is successful halfway through a character string but the next character does not match, that is, if a predetermined state transition destination is not found, the initial state S ₀ This is to deal with the fact that returning the state to is generally incorrect. For example, assume searching for two substrings {character recognition, optical character reader}. Now
If the sentence "... Optical character recognition ..." is input, the part up to "optical character" matches the second partial character string, but the next character "recognition" does not match. Here, if the state is reset to S ₀ and reset, the automaton will use the sentence after “recognition ...” as the input character, and eventually miss the partial character string “character recognition”. It will be. Therefore, when the matching fails, the state to be transited is not S ₀ , but the state up to “letter” in the transition path of “letter recognition” needs to be in the state of being matched.

【００９０】さて次に、サブシステム４００は、上記説
明の如く作成した状態遷移リストａlist，出力リストσ
list，および失敗遷移リストｆlistを下位のフレキシブ
ルストリングマッチング回路ＦＳＭ５０１〜５０３に転
送する。Next, the subsystem 400 causes the state transition list alist and the output list σ created as described above.
The list and the failure transition list flist are transferred to the lower flexible string matching circuits FSM501 to 503.

【００９１】フレキシブルストリングマッチング回路５
０１のより詳細なブロック図を図１７に示す。（ＦＭＳ
５０２，５０３についても同様である。）上記３種類の
リストａlist，σlist，ｆlistはバスアダプタ５７１を
経由してメモリ５１３の所定のエリアに格納される。マ
イクロプロセッサ５１１は所定のマイクロプログラムに
よって、上記情報をもとに図１８（ｂ）に示す拡張有限
オートマトンを状態遷移行列の形で生成する。Flexible string matching circuit 5
A more detailed block diagram of 01 is shown in FIG. (FMS
The same applies to 502 and 503. The three types of lists alist, σlist, and flist are stored in a predetermined area of the memory 513 via the bus adapter 571. The microprocessor 511 generates an extended finite state automaton shown in FIG. 18B in the form of a state transition matrix based on the above information by a predetermined microprogram.

【００９２】該リストａlistおよびｆlistが直接的に意
味するところの有限オートマトンは図１８（ａ）に示す
単純な形をしている。同図はａlistの中のThe finite state automaton that the lists alist and flist directly mean has a simple form shown in FIG. 18 (a). This figure is in alist

【００９３】[0093]

【数１３】（Ｓ_jＣ_k1Ｓ_l1）｝ …（１３）（Ｓ_jＣ_k2Ｓ_l2）なる２つの遷移を図示したものである。[Equation 13] (S _j C _k1 S _l1 )} (13) Two transitions of (S _j C _k2 S _l2 ) are illustrated.

【００９４】マイクロプロセッサ５１１は図１８（ａ）
で示す有限オートマトンを同図（ｂ）の如く拡張変換す
る。同変換は一意的に定まる変換である。この変換によ
り、曖昧性を有する被検索テキストからも、所定の部分
文字列を探し出すことが可能となる。ここで、同図にお
いて、ｆ（Ｓ_j）は失敗遷移リストｆlistから作られる
失敗関数であり、状態Ｓ_jでマッチングに失敗したとき
の遷移先の状態を表わす。また、状態Ｗ_jは状態Ｓ_jに一
対一に対応するものであり、曖昧な文字列（記号〔〕
で囲まれた文字列）をスキャンしている状態である。更
にまた、状態Ｔ _j1，Ｔ_j2は状態Ｓ_jからの遷移に対応し
て、状態Ｗ_jから派生する状態であり、曖昧な文字列の
中に探索中の文字（同図の場合、Ｃ_K1またはＣ_K2）を見
い出した状態である。The microprocessor 511 is shown in FIG.
The finite state automaton shown by is extended and transformed as shown in FIG.
It This conversion is a conversion that is uniquely determined. With this conversion
A certain part of the searched text with ambiguity
It is possible to find a character string. Here, in the figure
And f (S_j) Is created from the failure transition list flist
Failure function, state S_jWhen matching fails with
Represents the state of the transition destination of. Also, state W_jIs state S_jOne
It has a one-to-one correspondence and is an ambiguous character string (symbol []
(Character string enclosed in) is being scanned. Change
Also, state T _j1, T_j2Is state S_jCorresponding to the transition from
State W_jIt is a state derived from
The character being searched for (in the case of the figure, C_K1Or C_K2)To look at the
It is in a state of being put out.

【００９５】実際には、マイクロプロセッサ５１１は２
つのリストａlistとｆlistから図１９（ａ）に示す状態
遷移表を直接生成することが出来る。該状態遷移表の列
（縦）は現在の状態を表わし、行（横）は同状態で入力
される文字（コード）に対応する。表の中には、次に遷
移すべき状態が記される。同状態遷移表を生成するアル
ゴリズムは図１８による説明から容易に類推できるの
で、説明を省略する。In reality, the microprocessor 511 has two
The state transition table shown in FIG. 19A can be directly generated from the two lists alist and flist. The column (vertical) of the state transition table represents the current state, and the row (horizontal) corresponds to the character (code) input in the same state. In the table, the state to be changed next is described. Since the algorithm for generating the same state transition table can be easily inferred from the description with reference to FIG. 18, the description will be omitted.

【００９６】マイクロプロセッサ５１１は更に出力リス
トσlistを図１９（ｂ）に示す出力表の形に変換して上
記状態遷移表とともにメモリ５１３の所定のエリアに記
録する。The microprocessor 511 further converts the output list σlist into the form of the output table shown in FIG. 19B and records it in a predetermined area of the memory 513 together with the above state transition table.

【００９７】以下に、上記有限状態オートマトンを用い
たストリングサーチアルゴリズムを記す。The string search algorithm using the finite state automaton will be described below.

【００９８】ここで、関数（ｃ，Ｓ）は図１９（ａ）に示す状態遷移
表から、文字ｃと現在の状態Ｓをもとに次の状態を求め
る関数である。また、関数ｏｕｔ（Ｓ）は図１９（ｂ）
に示す出力表を参照して状態Ｓに出力があるか否かを判
断する関数である。[0098] Here, the function (c, S) is a function for obtaining the next state from the state transition table shown in FIG. 19A based on the character c and the current state S. The function out (S) is shown in FIG.
It is a function that determines whether or not there is an output in the state S by referring to the output table shown in FIG.

【００９９】なお、上記説明では１文字のコードの単位
に状態を割当てているが、日本語のように１文字のコー
ドが２バイトになる場合は、１バイトづつに分割して、
上記方法を適用することができる。In the above description, the state is assigned to the unit of 1-character code, but if the 1-character code has 2 bytes, as in Japanese, divide it into 1-byte units.
The above method can be applied.

【０１００】次に、テキストサーチサブシステム４００
は、上位から送られて来る被検索テキスト固有番号リス
ト数（６），数（７）を受理し、各ＦＳＭで検索処理す
べきテキスト固有番号リストとして、対応するＦＳＭへ
転送する。従って各ＦＳＭは、対応するテキストファイ
ルに検索対象が存在すれば、その固有番号（ｔ_i1ｔ_i2
ｔ_i3…… ｔ_in）を得る。テキスト固有番号リストはメ
モリ５１３（図１７）に格納される。マイクロプロセッ
サＭＰＵ５１１はマイクロプログラムメモリ５１２内の
所定のプログラム（図２０参照）に従って、まず各テキ
ストの所在物理アドレスを検知する。テキスト固有番号
と物理アドレスは図９で説明したＴＥＸＴ−ＤＩＲで管
理されており、該表をファイル４５１から読み出して検
知することができる。Next, the text search subsystem 400
Accepts the searched text unique number lists (6) and (7) sent from the upper layer, and transfers them to the corresponding FSMs as the text unique number lists to be searched by each FSM. Therefore, if a search target exists in the corresponding text file, each FSM has its unique number (t _i1 t _i2
t _i3 ...... t _in ) is obtained. The text unique number list is stored in the memory 513 (FIG. 17). The microprocessor MPU 511 first detects the physical address of each text according to a predetermined program (see FIG. 20) in the micro program memory 512. The text unique number and physical address are managed by the TEXT-DIR described in FIG. 9, and the table can be read from the file 451 and detected.

【０１０１】マイクロプロセッサ５１１は次に各テキス
トデータをファイル４５１から読み出す。ファイル制御
部５３１は読み出したテキストデータ（文字列）を逐次
ＦＩＦＯ（First-in-first-out）回路５３２へ入力す
る。マイクロプロセッサＭＰＵ５１１はＦＩＦＯ５３２
から一文字づつ読み出し、メモリ５１３内に定義されて
いる有限オートマトン（図１８（ｂ））に従って所定の
部分文字列が存在するか否かを検定する。ストリングマ
ッチング結果ｂlist（図２０参照）を上位プロセッサの
メモリ４０２へ返送する。The microprocessor 511 next reads each text data from the file 451. The file control unit 531 sequentially inputs the read text data (character string) to the FIFO (First-in-first-out) circuit 532. The microprocessor MPU511 is a FIFO532
Each character is read out from and the presence or absence of a predetermined partial character string is verified according to the finite state automaton (FIG. 18B) defined in the memory 513. The string matching result blist (see FIG. 20) is returned to the memory 402 of the upper processor.

【０１０２】ＣＰＵ４０１は所定のプログラムに従っ
て、下位の複数のＦＳＭから返送される検索条件が合致
したテキスト固有番号リストを１つにまとめ、更に上位
の制御サブシステム内のメモリ１０２に転送する。テキ
スト固有番号から、メインディレクトリ１５３（図５）
を参照することにより、部分文字列がマツチングした文
書の固有番号ＤＯＣ＃や文書画像の固有番号ＩＭＧＩＤ
あるいは表題ＴＩＴＬＥなどを同定することが出来る。According to a predetermined program, the CPU 401 compiles the text unique number lists returned from the plurality of lower FSMs and having the matching search conditions, and transfers the list to the memory 102 in the upper control subsystem. From the text unique number, the main directory 153 (Fig. 5)
By referring to, the unique number DOC # of the document in which the partial character string is matched or the unique number IMGID of the document image
Alternatively, the title TITLE or the like can be identified.

【０１０３】これらの検索結果は端末８００へ返送され
る。ユーザは表題などをＣＲＴ上で見ながら、所望の文
書の画像を同ＣＲＴに呼び出して表示することができ
る。These search results are returned to the terminal 800. The user can call and display an image of a desired document on the CRT while viewing the title and the like on the CRT.

【０１０４】次に第二の実施例について説明する。該実
施例ではフレキシブルストリングマッチング回路５０１
のみの構成方法が異っている。図２１は第二の実施例に
おけるフレキシブルストリングマッチング回路ＦＳＭの
構成図である。Next, a second embodiment will be described. In this embodiment, the flexible string matching circuit 501
Only the configuration method is different. FIG. 21 is a configuration diagram of the flexible string matching circuit FSM in the second embodiment.

【０１０５】同図において、２次記憶装置（テキストフ
ァイル）４６１は同時に信号の読み出しができる複数の
ヘッドを有しており、本実施例では、同時に４個のヘッ
ドからデータを読み出すことが可能である。該データは
ファイル制御装置ＦＣＵ５４１を経由して、各々４個の
ＦＩＦＯ回路５５１〜５５４へ転送される。In the figure, the secondary storage device (text file) 461 has a plurality of heads capable of reading signals simultaneously. In this embodiment, data can be read from four heads at the same time. is there. The data is transferred to each of the four FIFO circuits 551 to 554 via the file control unit FCU541.

【０１０６】一方、上位サブシステム４００から送られ
る検索条件はマイクロプロセッサ５１１で翻訳された
後、データメモリを内包するマイクロプロセッサユニッ
トＭＰＵ₁〜ＭＰＵ₄５６１〜５６４へ転送される。On the other hand, the search condition sent from the upper subsystem 400 is translated by the microprocessor 511 and then transferred to the microprocessor units MPU ₁ to MPU ₄ 561 to 564 which include the data memory.

【０１０７】テキストファイル４６１から読み出される
テキストデータはＦＩＦＯ回路５５１〜５５４を経由し
て、各々マイクロプロセッサユニット５６１〜５６４へ
読み出される。該マイクロプロセッサユニットは並行し
て、４本の文字列（テキストデータ）の中から所定の部
分文字列を探索し、結果をデータバス５２１を介してマ
イクロプロセッサ５１１へ返送する。The text data read from the text file 461 is read to the microprocessor units 561 to 564 via the FIFO circuits 551 to 554, respectively. In parallel, the microprocessor unit searches for a predetermined partial character string from the four character strings (text data), and returns the result to the microprocessor 511 via the data bus 521.

【０１０８】他の部分は第一の実施例と等しいので説明
を省略する。Since the other parts are the same as those in the first embodiment, the description thereof will be omitted.

【０１０９】次に第三の実施例について説明する。同実
施例では、ハードウェア構成は第一の実施例または第二
の実施例と等しいが、テキストサーチ処理が異なる。Next, the third embodiment will be described. In this embodiment, the hardware configuration is the same as in the first or second embodiment, but the text search processing is different.

【０１１０】階層的な検索法を用いて、まずキーワード
や分類コードを用いて被検索文書を絞り込む場合を考え
ると、同過程でスクリーンされた被検索文書はあるテキ
ストファイルのボリュームに偏在していることが一般的
にある。Considering the case where the search target documents are narrowed down by using the keyword or the classification code by using the hierarchical search method, the search target documents screened in the same process are unevenly distributed in the volume of a certain text file. Generally there is.

【０１１１】本実施例システムでは、複数のテキストフ
ァイルボリュームに、多重性を有効にするために重複し
てテキストデータを記憶する。ＣＰＵ４０１（図４参
照）は所定のプログラムに従って、複数のボリュームに
多重に記憶されているテキストについては、複数のボリ
ュームへのアクセス回数が均等になるようにアクセスす
べきボリュームを選択する。本方式を用いれば、すべて
のフレキシブルストリングマッチング回路が効率よく動
作し、全体として高速な探索が可能となる。In the system of this embodiment, text data is stored in duplicate in a plurality of text file volumes in order to enable multiplicity. According to a predetermined program, the CPU 401 (see FIG. 4) selects a volume to be accessed so that the number of times of access to the plurality of volumes is equalized for the texts stored in the plurality of volumes in a multiplexed manner. By using this method, all flexible string matching circuits operate efficiently, and high-speed search is possible as a whole.

【０１１２】以上の実施例では、フレキシブルストリン
グサーチ回路の多重度は３〜４となっているが、本発明
方式では多重度は限定されない。In the above embodiments, the flexible string search circuit has a multiplicity of 3 to 4, but the multiplicity of the present invention is not limited.

【０１１３】また、テキストサーチは文書全体に対して
一様に行うとして説明したが、ページの境界に関する情
報をテキスト中に特殊記号で記録しておき、ストリング
マッチングが成功したページ番号をも、マッチング結果
として出力するように拡張することが可能であり、同方
式も本発明に含まれる。Although the text search is performed uniformly over the entire document, information about page boundaries is recorded in the text with special symbols so that the page numbers for which string matching has succeeded can be matched. It can be expanded to output as a result, and the same method is also included in the present invention.

【０１１４】更にまた、説明は日本語テキストについて
行ったが、全く同様に英語などの他の言語にも適用する
ことが可能である。Furthermore, although the explanation has been given for the Japanese text, it can be applied to other languages such as English exactly.

【０１１５】また、上記実施例ではテキストデータは文
字認識により抽出するとしたが、明らかに人手などによ
って入力されたテキストデータに対しても本文内容検索
の方式は適用可能であり、本発明に含まれる。Further, although the text data is extracted by the character recognition in the above-mentioned embodiment, the text content search method can be applied to the text data which is obviously input manually and is included in the present invention. .

【０１１６】更にまた、システム形態は図４に示す形態
で説明したが、小形システム，スタンドアロン形システ
ムにおいても、その本質とするところは変わらず、本発
明が含む所である。特に、別システムで用意したテキス
トファイルとイメージファイルをロードして小規模な検
索ステーションとすることが考えられるが、本発明に含
まれる。Further, the system form has been described with reference to the form shown in FIG. 4, but the essentials of the small system and the stand-alone system are not changed, and the present invention includes them. In particular, it is conceivable that a text file and an image file prepared by another system may be loaded into a small-scale search station, which is included in the present invention.

【０１１７】また、検索条件は論理的演算子によって組
合せることが可能なことや、ある相対的位置関係を満す
部分文字列の探索が可能となるように拡張できること
は、言うまでもない。特に、複数の部分文字列のそれぞ
れがどこに存在したかも出力することにより、後処理に
より組合せ的な高度な検索が高速に実現される。Further, it goes without saying that the search conditions can be combined by a logical operator and can be expanded so that a partial character string satisfying a certain relative positional relationship can be searched for. In particular, by outputting where each of the plurality of sub-character strings exists, post-processing enables high-speed combinatorial retrieval.

【０１１８】[0118]

【発明の効果】以上、本発明システムによれば、文書の
本文などの中身を参照して所望の文書を高速に検索する
ことが可能となり、文書を登録した時点では考えられな
かった概念からも効率よく検索することが可能となる。
特に、登録時に、分類コードやキーワードとして何が適
切かを付するのに過度に悩む必要がなくなる。結果とし
て、検索精度を高めることが可能となると同時に、ノイ
ズ発生率を低くおくさえることが可能となる。As described above, according to the system of the present invention, it is possible to search for a desired document at high speed by referring to the contents of the document or the like, and even from the concept that was unthinkable when the document was registered. It is possible to search efficiently.
In particular, when registering, there is no need to worry too much about what is appropriate as a classification code or keyword. As a result, it is possible to improve the search accuracy and at the same time keep the noise generation rate low.

【０１１９】更に、テキストサーチサブシステムの中を
並列化することにより、高速な本文検索が可能となる。
特に、読み出しヘッド毎にストリングマッチング回路を
付加することにより高速化が達成される。Furthermore, by parallelizing the inside of the text search subsystem, high-speed text search can be performed.
In particular, the speedup is achieved by adding the string matching circuit to each read head.

【０１２０】大規模な文書ファイルを対象にする検索の
場合には、キーワードや書誌的事項により被検索文書を
減らしてから、本文内容検索を行うことができ、全体と
して効率のよい検索が行える。In the case of a search for a large-scale document file, the text contents can be searched after reducing the number of documents to be searched by using keywords or bibliographic items, and the entire search can be performed efficiently.

【０１２１】また、文書画像からテキストデータを得る
には従来技術では文書認識結果を人間が逐次検査し、誤
りを修正する必要があったが、本発明によれば人間の介
在を無くすことが可能である。従来は上記理由から実質
的には本文内容検索が実現されておらず、本発明によっ
て効果的な本文内容検索が可能となる。Further, in the prior art, in order to obtain text data from a document image, it was necessary for a human to sequentially inspect the document recognition result and correct errors, but according to the present invention, human intervention can be eliminated. Is. Conventionally, the text content search is not substantially realized for the above reason, and the present invention enables effective text content search.

[Brief description of drawings]

【図１】文書画像と文書理解の結果を示す図。FIG. 1 is a diagram showing a document image and a result of document understanding.

【図２】部分文字列から生成される同音異義語と同義語
の文字列の状態遷移図。FIG. 2 is a state transition diagram of a homonym synonym and a synonym character string generated from a partial character string.

【図３】曖昧性を含む文字認識結果の文字列の状態遷移
図。FIG. 3 is a state transition diagram of a character string resulting from character recognition including ambiguity.

【図４】第一の実施例のシステム構成図。FIG. 4 is a system configuration diagram of the first embodiment.

【図５】文書，画像，テキストを蓄積・管理する方法を
説明する図。FIG. 5 is a diagram illustrating a method of storing and managing documents, images, and texts.

【図６】文書，画像，テキストを蓄積・管理する方法を
説明する図。FIG. 6 is a diagram illustrating a method of storing and managing documents, images, and text.

【図７】文書，画像，テキストを蓄積・管理する方法を
説明する図。FIG. 7 is a diagram illustrating a method of storing and managing documents, images, and texts.

【図８】文書，画像，テキストを蓄積・管理する方法を
説明する図。FIG. 8 is a diagram illustrating a method of storing and managing documents, images, and texts.

【図９】文書，画像，テキストを蓄積・管理する方法を
説明する図。FIG. 9 is a diagram illustrating a method of storing and managing documents, images and texts.

【図１０】文書認識装置のブロック図。FIG. 10 is a block diagram of a document recognition device.

【図１１】文字パターンを囲む矩形領域の説明図。FIG. 11 is an explanatory diagram of a rectangular area surrounding a character pattern.

【図１２】パターンを記述する輪郭形状の表現方法を説
明する図。FIG. 12 is a diagram illustrating a method of expressing a contour shape that describes a pattern.

【図１３】パターン成分と文字パターンの関係を説明す
る図。FIG. 13 is a diagram illustrating a relationship between a pattern component and a character pattern.

【図１４】ボトムアップセグメンタによる行切り出しの
結果を示す図。FIG. 14 is a diagram showing a result of line segmentation by a bottom-up segmenter.

【図１５】ボトムアップセグメンタによる列切り出しの
結果を示す図。FIG. 15 is a diagram showing a result of row segmentation by a bottom-up segmenter.

【図１６】文字列集合から状態遷移リストを得るアルゴ
リズムの説明図。FIG. 16 is an explanatory diagram of an algorithm for obtaining a state transition list from a character string set.

【図１７】フレキシブルストリングマッチング回路（Ｆ
ＳＭ回路）のブロック図。FIG. 17: Flexible string matching circuit (F
(SM circuit) block diagram.

【図１８】曖昧文字列を許容する拡張有限状態オートマ
トン。FIG. 18 is an extended finite state automaton that allows ambiguous strings.

【図１９】拡張有限状態オートマトンの状態遷移表。FIG. 19 is a state transition table of an extended finite state automaton.

【図２０】ＦＳＭ回路のプログラムを説明する図。FIG. 20 is a diagram illustrating a program of an FSM circuit.

【図２１】第二の実施例におけるＦＳＭ回路の構成図。FIG. 21 is a configuration diagram of an FSM circuit according to a second embodiment.

[Explanation of symbols]

１００…制御サブシステム、２００…入力サブシステ
ム、３００…文書認識装置、４００…テキストサーチサ
ブシステム、８００…検索用端末サブシステム、５０１
…フレキシブルストリングマッチング回路、１５１…デ
ータベースファイル、１５２…イメージファイル、４５
１…テキストファイル。100 ... Control subsystem, 200 ... Input subsystem, 300 ... Document recognition device, 400 ... Text search subsystem, 800 ... Search terminal subsystem, 501
... flexible string matching circuit, 151 ... database file, 152 ... image file, 45
1 ... Text file.

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁶ 識別記号庁内整理番号ＦＩ技術表示箇所Ｇ０６Ｋ 9/20 Ｊ (72)発明者東野純一東京都国分寺市東恋ケ窪１丁目280番地株式会社日立製作所中央研究所内 (72)発明者花野井歳弘神奈川県小田原市国府津2880番地株式会社日立製作所小田原工場内─────────────────────────────────────────────────── ─── Continuation of the front page (51) Int.Cl. ⁶ Identification code Internal reference number FI Technical display location G06K 9/20 J (72) Inventor Junichi Higashino 1-280, Higashi Koikeku, Kokubunji, Tokyo Hitachi, Ltd. Central Research Laboratory (72) Inventor Toshihiro Hananoi 2880 Kozu, Odawara City, Kanagawa Stock Company Hitachi Ltd. Odawara Factory

Claims

[Claims]

1. An image file for accumulating a document image, a document knowledge file for storing a layout rule of a document structure for each type of document, an image processing unit for extracting a pattern component from the document image, and an image processing unit for extracting the pattern component from the document image. A character recognition means for recognizing the cut-out character pattern, and referring to the layout rule of the document structure stored in the document knowledge file, analyzing the pattern components extracted by the image processing means. A document recognizing unit that cuts out a character pattern that constitutes a character for each document structure, and character-recognizes the cut-out character pattern by the character recognizing unit to obtain a character string, and a character string that is obtained by the document recognizing unit. And the storage means that stores the data corresponding to the Document filing system, comprising search means for, and output means for outputting the document image of the document identified by the search means from said image file.

2. The storage means has a database file for storing bibliographical items in the document structure and a text file for storing text including the text of the document, and the document recognition means has a document structure name. 2. The document filing system according to claim 1, further comprising: outputting the corresponding character string as a set and storing the corresponding character string in the database file or the text file according to the name of the document structure.

3. The document knowledge file in the document recognition means includes a document title, author name, author affiliation, abstract,
2. The document filing system according to claim 1, wherein parametric knowledge such as font size is also stored in addition to the layout rule of the document structure including the text.