JP2000020549A

JP2000020549A - Device for assisting input to document database system

Info

Publication number: JP2000020549A
Application number: JP10197991A
Authority: JP
Inventors: Hidemichi Fukazawa; 秀通深澤
Original assignee: PLANET COMPUTER KK
Current assignee: PLANET COMPUTER KK
Priority date: 1998-06-29
Filing date: 1998-06-29
Publication date: 2000-01-21

Abstract

PROBLEM TO BE SOLVED: To provide a device for assisting the work to input data in a state that retrieval of the entire sentence for text data is enabled with respect to a system to prepare an electronic document by using a browsing format proper to the system and to construct a database by the electronic document. SOLUTION: The data input work to a document database 20 is assisted. Document information given from an original document 1, etc., is read by a data reading means 11 and is converted into a PDF file by a formation converting means 12. Text data in the PDF file are extracted by a text data extracting means 13. A text file with tag is prepared by displaying the extracted text data on a screen and performing a work to add optional tag information to an optional character string specified by an operator by a tag information adding means 14. The PDF file is outputted as document data for browsing, the text file is outputted as document data for retrieval from a data output means 15 and the files are registered in the database system 20. The retrieval of the entire sentence is executed by using the document data for retrieval.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は文書データベースシ
ステムへの入力支援装置に関し、特に、テキストデータ
の全文検索機能を有する文書データベースシステムへデ
ータを入力する作業を支援するための装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an apparatus for supporting input to a document database system, and more particularly to an apparatus for supporting an operation of inputting data to a document database system having a full-text search function for text data.

【０００２】[0002]

【従来の技術】近年、パーソナルコンピュータの高機能
化、低価格化に伴い、オフィスにおける文書の電子ファ
イル化が急速に進んでいる。特に、ＣＤ−ＲＯＭやＭＯ
ディスクなどの高密度情報記録媒体の普及により、１枚
の媒体に膨大な量の情報を蓄積することが可能になって
きており、オフィスの省スペース化という点で、文書の
電子化は大きな意味をもつようになってきている。文書
を電子化して閲覧するためのフォーマットとして、現
在、最も普及しているフォーマットは、米国アドビ社が
開発したＰＤＦ形式（Portable Document Format）であ
る。このＰＤＦ形式でフォーマットされた文書は、ハー
ドウエアに依存することなしに、常に同じ態様で閲覧や
印刷を行うことができるという特徴をもち、この形式は
事実上の標準的な閲覧フォーマットの地位を確立しつつ
ある。2. Description of the Related Art In recent years, as personal computers have become more sophisticated and cost-effective, documents in offices have been rapidly converted into electronic files. In particular, CD-ROM and MO
With the spread of high-density information recording media such as discs, it has become possible to store an enormous amount of information on a single medium. It has come to have. Currently, the most widespread format for digitizing and browsing documents is the Portable Document Format (PDF) developed by Adobe in the United States. Documents formatted in this PDF format can always be viewed and printed in the same manner without relying on hardware, and this format has become the de facto standard viewing format. Establishing.

【０００３】文書を電子化するもうひとつのメリット
は、文書をコンピュータを用いた検索対象とすることが
できるようになるという点にある。通常のデータベース
システムでは、個々の情報に所定のキーワードを定義し
ておき、このキーワードを用いた検索を行うのが一般的
である。ただ、文書情報のデータベースシステムの場
合、含まれているテキストデータに対する全文検索を行
うことも可能であり、コンピュータの処理能力の向上と
ともに、このような全文検索機能を有する文書データベ
ースシステムも広く普及してきている。Another advantage of digitizing a document is that the document can be searched by using a computer. In a general database system, it is common to define a predetermined keyword for each piece of information and perform a search using this keyword. However, in the case of a document information database system, it is possible to perform a full-text search on contained text data, and with the improvement of computer processing capability, a document database system having such a full-text search function has been widely spread. ing.

【０００４】[0004]

【発明が解決しようとする課題】上述したように、現
在、文書を電子化して閲覧するための事実上の標準フォ
ーマットとして、ＰＤＦ形式が普及しており、多くの企
業がＰＤＦ形式により文書の電子化を行っている。とこ
ろが、このＰＤＦ形式のような閲覧用フォーマットで記
述された文書は、そのフォーマット特有のデータ構造を
有しているため、これをそのままデータベースシステム
に登録した場合、テキストデータに対する全文検索を行
うことができないという問題が生じる。As described above, at present, the PDF format has become widespread as a de facto standard format for digitizing and browsing a document. Is going on. However, a document described in a browsing format such as the PDF format has a data structure peculiar to the format. If the document is registered as it is in a database system, a full-text search for text data can be performed. A problem arises that it is not possible.

【０００５】そこで本発明は、特有の閲覧用フォーマッ
トを用いて文書を電子化し、この電子化した文書により
データベースを構築するシステムに対して、テキストデ
ータに対する全文検索が可能になるような態様でデータ
を入力する作業を支援するための装置を提供することを
目的とする。Accordingly, the present invention provides a system for digitizing a document using a specific browsing format and constructing a database based on the digitized document in a form that enables full-text search for text data. It is an object of the present invention to provide a device for supporting the operation of inputting a password.

【０００６】[0006]

【課題を解決するための手段】(1) 本発明の第１の態
様は、文書データベースシステムへデータを入力する作
業を支援するために用いられる文書データベースシステ
ムへの入力支援装置において、入力対象となる文書情報
をデジタルデータとして読み込むデータ読込手段と、読
み込んだ文書情報を、文書データベースシステムによる
閲覧に適した閲覧用フォーマットに変換し、閲覧用文書
データを作成するフォーマット変換手段と、閲覧用文書
データの中からテキストデータを抽出し、抽出したテキ
ストデータからなる検索用文書データを作成するテキス
トデータ抽出手段と、閲覧用文書データと検索用文書デ
ータとを、文書データベースシステムに対して出力する
データ出力手段と、を設けるようにしたものである。Means for Solving the Problems (1) A first aspect of the present invention is an input support apparatus for a document database system used to support an operation of inputting data to a document database system. Data reading means for reading document information as digital data, format converting means for converting the read document information into a browsing format suitable for browsing by a document database system, and creating browsing document data, and browsing document data. Text data extracting means for extracting text data from the text data and creating search document data composed of the extracted text data, and data output for outputting the browse document data and the search document data to the document database system Means are provided.

【０００７】(2) 本発明の第２の態様は、上述の第１
の態様に係る文書データベースシステムへの入力支援装
置において、データ読込手段が、紙もしくはマイクロフ
ィルムで用意された原文書上の文書情報を、多数の画素
の集合からなる画像データとして入力する機能を有し、
フォーマット変換手段が、入力した画像データに対して
フォーマット変換処理を施し、この画像データに対応す
る表示を行うための閲覧用文書データを作成する機能を
有し、テキストデータ抽出手段が、閲覧用文書データ内
の所定部分に対して文字認識を行い、認識された文字を
テキストデータに変換することにより検索用文書データ
を作成する機能を有するようにしたものである。(2) The second aspect of the present invention is the above-mentioned first aspect.
In the input support apparatus for a document database system according to the aspect, the data reading means has a function of inputting document information on the original document prepared on paper or microfilm as image data composed of a set of many pixels. And
The format conversion means has a function of performing format conversion processing on the input image data and creating browsing document data for performing display corresponding to the image data. It has a function of performing character recognition on a predetermined portion of the data and converting the recognized characters into text data to create search document data.

【０００８】(3) 本発明の第３の態様は、上述の第１
の態様に係る文書データベースシステムへの入力支援装
置において、データ読込手段が、所定の文字コードを含
む文書情報をデジタル記録媒体もしくは通信回線からデ
ジタルデータとして入力する機能を有し、フォーマット
変換手段が、文字コードを含む文書情報に対してフォー
マット変換処理を施し、この文字コードもしくはこの文
字コードに対応した代替コードを含む閲覧用文書データ
を作成する機能を有し、テキストデータ抽出手段が、閲
覧用文書データ内の文字コードもしくはその代替コード
に基いてテキストデータを抽出し、検索用文書データを
作成する機能を有するようにしたものである。(3) A third aspect of the present invention is the above-described first aspect.
In the input support apparatus for a document database system according to the aspect, the data reading means has a function of inputting document information including a predetermined character code as digital data from a digital recording medium or a communication line, and the format conversion means A function for performing format conversion processing on the document information including the character code to create browsing document data including the character code or an alternative code corresponding to the character code; It has a function of extracting text data based on a character code in the data or its substitute code, and creating search document data.

【０００９】(4) 本発明の第４の態様は、上述の第１
〜第３の態様に係る文書データベースシステムへの入力
支援装置において、テキストデータ抽出手段によって抽
出されたテキストデータに基いて、ディスプレイ画面上
に文字列を表示させる機能と、オペレータに、画面上の
任意の領域を指定させるとともに、所定のタグ情報を入
力させる機能と、オペレータによって指定された領域内
の文字列に対応するテキストデータに対して、入力され
たタグ情報を示すタグデータを付加する機能と、を有す
るタグ情報付加手段を更に設け、データ出力手段から、
タグデータを含んだ検索用文書データが出力されるよう
に構成したものである。(4) The fourth aspect of the present invention is the above-mentioned first aspect.
In the input support device for a document database system according to the third aspect, a function of displaying a character string on a display screen based on the text data extracted by the text data extracting means, And a function for inputting predetermined tag information and a function for adding tag data indicating the input tag information to text data corresponding to a character string in the area specified by the operator. And tag information adding means having
It is configured to output search document data including tag data.

【００１０】(5) 本発明の第５の態様は、上述の第１
〜第４の態様に係る文書データベースシステムへの入力
支援装置において、データ読込手段が、１頁分の文書情
報に対応するデータをそれぞれ１ファイルとして、複数
頁にわたる文書情報を頁単位で読み込み、複数のファイ
ルを１フォルダに収容して格納する機能を有し、フォー
マット変換手段が、オペレータが指定したフォルダ内の
全ファイルに対してフォーマット変換処理を施し、個々
のファイルについての変換後のデータにそれぞれ頁を特
定するための頁データを付加した後、これら変換後のデ
ータを１ファイルにまとめ、複数頁にわたる文書情報に
対応した閲覧用文書データを作成する機能を有するよう
にしたものである。(5) The fifth aspect of the present invention is the above-mentioned first aspect.
In the input support apparatus for a document database system according to the fourth to fourth aspects, the data reading means reads the document information over a plurality of pages in units of pages by using data corresponding to the document information of one page as one file, and Has a function of accommodating and storing the files in one folder, and the format conversion means performs a format conversion process on all the files in the folder designated by the operator, and converts the converted data for each file into After adding page data for specifying a page, the converted data is combined into one file to have a function of creating browsing document data corresponding to document information over a plurality of pages.

【００１１】(6) 本発明の第６の態様は、コンピュー
タを上述の第１〜第５の態様に係る文書データベースシ
ステムへの入力支援装置として機能させるプログラム
を、コンピュータ読取り可能な記録媒体に記録するよう
にしたものである。(6) In a sixth aspect of the present invention, a program for causing a computer to function as an input support device for the document database system according to the first to fifth aspects is recorded on a computer-readable recording medium. It is something to do.

【００１２】[0012]

【発明の実施の形態】以下、本発明を図示する実施形態
に基いて説明する。図１は、本発明の一実施形態に係る
入力支援装置１０の基本構成を示すブロック図である。
この入力支援装置１０は、文書データベースシステム２
０へデータを入力する作業を支援するために利用される
装置である。文書データベースシステム２０は、種々の
文書情報をデータベースとして保存したコンピュータシ
ステムであり、利用者は、所望の文書を検索し、閲覧
し、必要に応じて印刷することができる。また、最近で
は、文に対する形態素解析を行うために、文書データベ
ースシステム２０が利用されることもある。このような
文書データベースシステム２０は、既に公知のシステム
であり、ここではその内部構造についての説明は省略す
る。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention will be described below with reference to the illustrated embodiments. FIG. 1 is a block diagram showing a basic configuration of an input support device 10 according to one embodiment of the present invention.
The input support device 10 includes a document database system 2
This is a device used to support the work of inputting data to 0. The document database system 20 is a computer system in which various document information is stored as a database, and a user can search for, browse, and print a desired document as needed. Recently, the document database system 20 may be used to perform a morphological analysis on a sentence. Such a document database system 20 is a known system, and a description of its internal structure is omitted here.

【００１３】一般に、データベースシステムを用いて所
望の情報を検索する場合、個々の情報にキーワードを定
義しておき、このキーワードを利用して検索を行うこと
になるが、文書のテキスト部分（文字列からなる部分）
については、所定の文字列に対して完全一致もしくは部
分一致を示す部分をテキスト全文について調べる検索
（いわゆる全文検索）を行うことが可能である。ただ、
この全文検索を行うためには、文書データベースシステ
ム内にテキストデータの形式のデータを蓄積しておく必
要がある。一方、文書の電子化フォーマットとしては、
前述したように、米国アドビ社が開発したＰＤＦ形式
（Portable Document Format）が事実上の標準的なフォ
ーマットの地位を確立しつつあるが、このＰＤＦ形式の
フォーマットは、一般的なテキストデータの形式ではな
いので、そのままでは全文検索を行うことができない。
本発明に係る入力支援装置１０は、このような問題を解
決するための入力支援装置であり、いわば文書データベ
ースシステム２０に対するフロントエンドプロセッサと
して機能する装置である。In general, when searching for desired information using a database system, a keyword is defined for each piece of information, and a search is performed using this keyword. Part)
For (2), it is possible to perform a search (so-called full-text search) for examining a portion indicating a complete match or a partial match with a predetermined character string for the entire text. However,
In order to perform this full-text search, it is necessary to accumulate data in the form of text data in the document database system. On the other hand, as the electronic format of documents,
As described above, the PDF format (Portable Document Format) developed by Adobe in the United States is establishing itself as a de facto standard format. However, the format of this PDF format is a general text data format. Because it is not, full-text search cannot be performed as it is.
The input support device 10 according to the present invention is an input support device for solving such a problem, that is, a device that functions as a front-end processor for the document database system 20.

【００１４】この実施形態に示す入力支援装置１０は、
紙による原文書１、マイクロフィルムによる原文書２、
デジタル記録媒体３、通信回線４などによって提供され
た文書情報を、ＰＤＦ形式の閲覧用文書データに変換す
るとともに、この閲覧用文書データからテキストデータ
を抽出して検索用文書データを作成する機能を有してい
る。この機能により、文書データベースシステム２０に
対して、ＰＤＦ形式の閲覧用文書データとともに、テキ
スト形式の検索用文書データを出力する処理を実行する
ことができる。すなわち、入力支援装置１０は、図示の
とおり、データ読込手段１１、フォーマット変換手段１
２、テキストデータ抽出手段１３、タグ情報付加手段１
４、データ出力手段１５から構成されている。もっと
も、これらの各構成要素は、入力支援装置１０を機能ブ
ロックの集合としてとらえた場合の構成要素であり、実
際には、入力支援装置１０は汎用のコンピュータに専用
のソフトウエアを搭載することにより実現することがで
き、図にブロックで示した各構成要素は、いずれもコン
ピュータのハードウエアおよびソフトウエアによって実
現される構成要素である。このうち、ソフトウエアによ
って実現される部分については、コンピュータ読取り可
能な記録媒体に記録して配布することが可能である。以
下、これら各構成要素の機能について説明する。The input support device 10 shown in this embodiment is
Original Document 1 on paper, Original Document 2 on microfilm,
A function for converting document information provided by the digital recording medium 3, the communication line 4, and the like into browsing document data in PDF format, extracting text data from the browsing document data, and creating search document data. Have. With this function, it is possible to execute a process of outputting the text-based search document data to the document database system 20 together with the PDF-format browse document data. That is, as shown, the input support device 10 includes a data reading unit 11 and a format conversion unit 1.
2. Text data extracting means 13, tag information adding means 1
4. Data output means 15 However, each of these components is a component when the input support device 10 is considered as a set of functional blocks. In practice, the input support device 10 is obtained by mounting dedicated software on a general-purpose computer. Each of the components that can be realized and is shown by blocks in the figure is a component that is realized by hardware and software of a computer. Of these, the portion realized by software can be recorded on a computer-readable recording medium and distributed. Hereinafter, the functions of these components will be described.

【００１５】データ読込手段１１は、入力対象となる文
書情報をデジタルデータとして読み込む機能を有してお
り、この実施形態では、文書情報は、紙による原文書
１、マイクロフィルムによる原文書２、デジタル記録媒
体３、通信回線４のいずれの手段によって供給されても
かまわない。紙による原文書１やマイクロフィルムによ
る原文書２として与えられた文書情報を読み込むため
に、データ読込手段１１はいわゆるスキャナ装置を備え
ている。このスキャナ装置によって、紙面もしくはマイ
クロフィルムを走査することができ、原文書上の文書情
報を、多数の画素の集合からなる画像データとして入力
することができる。この場合、文書情報内に含まれる文
字の情報も画像データとして入力されることになる。ま
た、データ読込手段１１は、デジタル記録媒体３（フロ
ッピディスク、磁気ディスク、光ディスク、光磁気ディ
スクなど）に記録された形式で与えられた文書情報を読
み込むためのドライブ装置を有しており、更に、通信回
線４を介して与えられた文書情報を読み込むための通信
機器（モデム、ターミナルアダプタ、ネットワーク接続
機器など）を有している。デジタル記録媒体３や通信回
線４によって文書情報（具体的には、ワープロソフト、
表計算ソフトなどのアプリケーションソフトウエアによ
って作成された文書情報など）が与えられた場合、文書
情報内に含まれる文字の情報は、通常、所定の文字コー
ドとして読み込まれることになる。The data reading means 11 has a function of reading document information to be input as digital data. In this embodiment, the document information includes an original document 1 of paper, an original document 2 of microfilm, It may be supplied by any means of the recording medium 3 and the communication line 4. The data reading means 11 includes a so-called scanner device for reading document information given as an original document 1 on paper or an original document 2 on microfilm. The scanner device can scan a paper surface or a microfilm, and can input document information on an original document as image data including a set of a large number of pixels. In this case, character information included in the document information is also input as image data. The data reading means 11 has a drive device for reading document information given in a format recorded on the digital recording medium 3 (floppy disk, magnetic disk, optical disk, magneto-optical disk, etc.). And a communication device (a modem, a terminal adapter, a network connection device, etc.) for reading the document information given via the communication line 4. Document information (specifically, word processing software,
When document information created by application software such as spreadsheet software) is given, character information included in the document information is usually read as a predetermined character code.

【００１６】こうして、デジタルデータとして読み込ま
れた文書情報は、フォーマット変換手段１２に与えられ
る。このフォーマット変換手段１２は、読み込んだ文書
情報を、文書データベースシステム２０による閲覧に適
した閲覧用フォーマットに変換し、閲覧用文書データを
作成する機能を有する。この例では、文書データベース
システム２０は、ＰＤＦ形式のフォーマットを閲覧用フ
ォーマットとして採用しているため、フォーマット変換
手段１２は、読み込んだ文書情報の内容をＰＤＦファイ
ルに変換し、このＰＤＦファイルを閲覧用文書データと
して出力する機能を有する。すなわち、画像データとし
て読み込まれた文書情報に対しては、この読み込まれた
画像データに対応する表示を行うための閲覧用文書デー
タが作成され、文字コードとして読み込まれた文書情報
に対しては、当該文字コードをそのまま含むか、あるい
は当該文字コードに対応した代替コードを含む閲覧用文
書データが作成されることになる。なお、このようなＰ
ＤＦ形式への変換処理も公知の技術であるため、ここで
はこの変換処理の具体的な方法については言及しない。The document information read as digital data is provided to the format conversion means 12. The format conversion means 12 has a function of converting the read document information into a browsing format suitable for browsing by the document database system 20, and creating browsing document data. In this example, since the document database system 20 employs the PDF format as the browsing format, the format conversion unit 12 converts the content of the read document information into a PDF file, and converts this PDF file into a browsing format. It has the function of outputting as document data. That is, for document information read as image data, browsing document data for performing display corresponding to the read image data is created, and for document information read as character codes, The browsing document data containing the character code as it is or the substitute code corresponding to the character code is created. In addition, such P
Since the conversion process to the DF format is also a known technique, a specific method of this conversion process will not be described here.

【００１７】こうして、フォーマット変換手段１２から
出力されたＰＤＦファイル、すなわち閲覧用文書データ
は、データ出力手段１５に与えられるとともに、テキス
トデータ抽出手段１３にも与えられる。テキストデータ
抽出手段１３は、この閲覧用文書データの中からテキス
トデータを抽出し、抽出したテキストデータを検索用文
書データ（テキストファイル）として出力する機能を有
する。すなわち、もともとの文書情報に含まれていた文
字列がテキストファイルとして抽出されることになる。
上述したように、紙による原文書１やマイクロフィルム
による原文書２として与えられた文書情報内に含まれて
いた文字列は、画像データとして読み込まれることにな
る。テキストデータ抽出手段１３は、多数の画素から構
成された画像データの中から文字を認識し、認識した文
字に対応するテキストデータを出力する機能を有してい
る。もし、画像データが文字以外の情報（絵柄や写真な
ど）を含んでいる場合は、画像データの中のどの部分が
文字の情報で、どの部分が文字以外の情報であるのか
を、オペレータに指定させればよい。たとえば、閲覧用
文書データ（ＰＤＦファイル）に基いて文書情報をディ
スプレイ画面上に表示させ、文字列が含まれている部分
をオペレータに指定させれば、この指定部分に対して文
字認識処理が実行され、認識した文字列に対するテキス
トファイルが検索用文書データとして出力されることに
なる。もちろん、画像データについて、文字情報と文字
以外の情報とを自動認識するアルゴリズムを用意してお
けば、オペレータの指示入力なしに、文字列が含まれて
いる所定部分から自動的にテキストデータを抽出させる
ことも可能である。一方、デジタル記録媒体３や通信回
線４によって与えられた文書情報内に含まれていた文字
列は、所定の文字コードもしくはその代替コードとして
閲覧用文書データ（ＰＤＦファイル）内に取り込まれて
いるので、これらのコードをそのままテキストデータと
して抽出し、検索用文書データを作成すればよい。この
ように、与えられたＰＤＦファイル内から、テキストデ
ータを抽出する技術も、既に公知の技術であるため、こ
こでは詳細な説明は省略する。The PDF file output from the format converter 12, that is, the browsing document data, is supplied to the data output unit 15 and also to the text data extraction unit 13. The text data extracting means 13 has a function of extracting text data from the browsing document data and outputting the extracted text data as search document data (text file). That is, the character string included in the original document information is extracted as a text file.
As described above, the character string included in the document information given as the original document 1 on paper or the original document 2 on microfilm is read as image data. The text data extracting means 13 has a function of recognizing a character from image data composed of a large number of pixels and outputting text data corresponding to the recognized character. If the image data contains information other than characters (such as a picture or a photograph), the operator specifies which part of the image data is character information and which part is non-character information. It should be done. For example, if the document information is displayed on the display screen based on the browsing document data (PDF file) and the operator specifies the portion including the character string, the character recognition process is executed on the specified portion. Then, a text file corresponding to the recognized character string is output as search document data. Of course, if an algorithm for automatically recognizing character information and information other than characters is prepared for image data, text data is automatically extracted from a predetermined portion including a character string without an operator's instruction input. It is also possible to make it. On the other hand, the character string included in the document information provided by the digital recording medium 3 or the communication line 4 is taken into the browsing document data (PDF file) as a predetermined character code or its substitute code. Then, these codes may be directly extracted as text data to create search document data. As described above, the technique of extracting text data from a given PDF file is also a known technique, and a detailed description thereof is omitted here.

【００１８】こうして、テキストデータ抽出手段１３か
ら出力されたテキストファイル、すなわち検索用文書デ
ータは、データ出力手段１５に与えられるとともに、タ
グ情報付加手段１４にも与えられる。タグ情報付加手段
１４は、必要に応じて、検索用文書データ内のテキスト
データにタグ情報を付加する機能を有しており、タグデ
ータを含んだ検索用文書データ（タグ付テキストファイ
ル）を出力する。このタグ情報の付加機能については後
述する。The text file output from the text data extracting means 13, ie, the search document data, is supplied to the data output means 15 and also to the tag information adding means 14. The tag information adding means 14 has a function of adding tag information to text data in the search document data as necessary, and outputs search document data (tag-attached text file) including the tag data. I do. The additional function of the tag information will be described later.

【００１９】かくして、データ出力手段１５には、閲覧
用文書データ（ＰＤＦファイル）と、検索用文書データ
（テキストファイル）もしくはタグデータを含んだ検索
用文書データ（タグ付テキストファイル）とが与えられ
ることになる。データ出力手段１５は、これらのデータ
を文書データベースシステムに対して出力する機能を有
する。結局、入力支援装置１０を用いることにより、種
々の形態で与えられた文書情報の内容がＰＤＦファイル
の形式に変換され、閲覧用文書データとして与えられる
とともに、その中の文字列に関する情報がテキストデー
タの形式で抽出され、検索用文書データとして与えられ
ることになる。文書データベースシステム２０では、こ
うして与えられた閲覧用文書データと検索用文書データ
とを対にして、データベース内に登録する処理を行えば
よい。もちろん、必要に応じて、各文書情報に固有のキ
ーワードを定義し、このキーワードのデータもデータベ
ース内に登録するようにする。Thus, the document output data (PDF file) and the search document data (text file) or the search document data including the tag data (text file with tag) are given to the data output means 15. Will be. The data output means 15 has a function of outputting these data to the document database system. After all, by using the input support device 10, the content of the document information given in various forms is converted into the format of the PDF file and given as the document data for browsing, and the information on the character string therein is converted into the text data. , And given as search document data. In the document database system 20, a process of registering the document data for browsing and the document data for search given in this way in the database may be performed. Of course, a keyword unique to each document information is defined as needed, and the data of this keyword is also registered in the database.

【００２０】このようにして構築した文書データベース
システム２０では、定義したキーワードを用いた文書情
報の検索を行うことができるとともに、所定の文字列に
関する全文検索を行うこともできるようになる。すなわ
ち、所定の文字列を検索対象として、全文検索の指示が
与えられた場合には、検索用文書データ（テキストファ
イル）に対して、当該文字列の検索処理を実行し、ヒッ
トした場合には、これに対応する閲覧用文書データ（Ｐ
ＤＦファイル）を検索結果として提示すればよい。In the document database system 20 constructed in this manner, it is possible to search for document information using the defined keywords, and also to perform full-text search for a predetermined character string. That is, when an instruction for full-text search is given with a predetermined character string as a search target, a search process of the character string is performed on search document data (text file). , The corresponding browsing document data (P
DF file) may be presented as a search result.

【００２１】続いて、タグ情報付加手段１４の機能をよ
り詳細に説明する。タグ情報とは、特定の文字列に対し
て付加される情報であり、当該文字列が特定の属性を帯
びていることを示すための情報である。ここでは、書籍
を管理するための図書カードを文書データベースシステ
ム２０に登録する具体的な場合を例にとって説明しよ
う。いま、書名（誌名）、著者、出版社、発行年、概要
なる５つの項目が記載された図書カードから、入力支援
装置１０によってデータを読み込み、得られた閲覧用文
書データおよび検索用文書データを文書データベースシ
ステム２０に登録する場合を考える。この場合、検索用
文書データは、たとえば、図２に示すようなテキストデ
ータから構成されることになる。この例では、各項目の
区切りはカンマ、各図書カードの区切りはキャリッジリ
ターン「ＣＲ」で示されている。このようなテキストデ
ータが用意されていれば、任意の文字列についての全文
検索が可能であることは既に述べたとおりである。Next, the function of the tag information adding means 14 will be described in more detail. Tag information is information added to a specific character string, and is information indicating that the character string has a specific attribute. Here, a specific case in which a book card for managing books is registered in the document database system 20 will be described as an example. Now, data is read by the input support device 10 from a book card in which five items including a title (journal name), an author, a publisher, a year of publication, and an outline are described, and the obtained reading document data and search document data are obtained. Is registered in the document database system 20. In this case, the search document data is composed of, for example, text data as shown in FIG. In this example, each item is indicated by a comma, and each book card is indicated by a carriage return "CR". As described above, if such text data is prepared, full-text search for an arbitrary character string is possible.

【００２２】しかしながら、文字列そのものではなく、
何らかの属性で全文検索を行いたい場合もある。たとえ
ば、図２の例では、１枚目のカードでは「書名」なる文
言が用いられているが、２枚目のカードでは「誌名」な
る文言が用いられている。いずれも「本のタイトル」を
示す概念であるが、全文検索により、これらの文言を検
索するためには、「書名」ｏｒ「誌名」のような条件検
索を行う必要がある。そこで、検索用文書データとして
文書データベースシステム２０に登録する際に、「本の
タイトル」という属性を、「書名」および「誌名」なる
文字列に付与しておけば、この「本のタイトル」という
属性によって全文検索を行うことができ便利である。タ
グ情報付加手段１４は、このような属性の付与作業を容
易に行う機能を有している。However, instead of the character string itself,
You may want to perform a full-text search with some attributes. For example, in the example of FIG. 2, the word “book title” is used in the first card, while the word “magazine name” is used in the second card. Both are concepts that indicate “book title”, but in order to search for these words by full-text search, it is necessary to perform a conditional search such as “book title” or “magazine name”. Thus, when registering the document data for search in the document database system 20, the attribute of "book title" is added to character strings of "book title" and "magazine name". This attribute is convenient because full text search can be performed. The tag information adding means 14 has a function of easily performing such attribute assignment work.

【００２３】まず、タグ情報付加手段１４は、テキスト
データ抽出手段１３によって抽出されたテキストデータ
に基いて、ディスプレイ画面上に文字列を表示させる機
能を有している。したがって、オペレータは、図２に示
すような文字列をディスプレイ画面上で確認することが
できる。更に、タグ情報付加手段１４は、オペレータ
に、この画面上の任意の領域を指定させる機能を有して
いる。たとえば、マウスなどのポインティングデバイス
を用いて、図３に示すような矩形領域３１を指定させる
ようにすればよい。この領域指定は、特定の文字列の指
定を意味しており、図３に示す例では、矩形領域３１に
よって「書名」なる文字列が指定されたことになる。ま
た、タグ情報付加手段１４は、こうして指定された文字
列について、所定のタグ情報を入力させる機能を有して
いる。たとえば、図３に示すように、「書名」なる文字
列が指定された状態において、オペレータがキーボード
から「TITLE 」というタグ情報を入力したとすると、文
字列「書名」に対して「TITLE 」というタグ情報が付加
されることになる。今度は、オペレータが、図４に示す
ような矩形領域３２を指定し、同じくキーボードから
「TITLE 」というタグ情報を入力したとすると、文字列
「誌名」に対しても「TITLE 」というタグ情報が付加さ
れることになる。First, the tag information adding means 14 has a function of displaying a character string on a display screen based on the text data extracted by the text data extracting means 13. Therefore, the operator can confirm the character string as shown in FIG. 2 on the display screen. Further, the tag information adding means 14 has a function of allowing an operator to specify an arbitrary area on this screen. For example, a rectangular area 31 as shown in FIG. 3 may be designated using a pointing device such as a mouse. This region designation means designation of a specific character string. In the example shown in FIG. 3, the character string “book name” is designated by the rectangular region 31. Further, the tag information adding means 14 has a function of inputting predetermined tag information for the character string specified in this way. For example, as shown in FIG. 3, if the operator inputs tag information "TITLE" from the keyboard in a state where a character string "book title" is specified, "TITLE" is added to the character string "book title". Tag information will be added. Now, assuming that the operator designates a rectangular area 32 as shown in FIG. 4 and also inputs tag information "TITLE" from the keyboard, the tag information "TITLE" is also applied to the character string "magazine name". Will be added.

【００２４】実際には、タグ情報の付加は、テキストデ
ータ中へのタグデータの挿入という形式で行われる。こ
の実施形態では、対象となる文字列に対応するテキスト
データの前後にそれぞれ所定のタグデータを挿入するよ
うにしている。たとえば、「書名」なる文字列にタグ情
報を付加した場合は、図５に示すように、「書名」なる
文字列に対応するテキストデータＤの前後に、それぞれ
所定のタグデータＴ１，Ｔ２が挿入されることになる。
このようなタグ付テキストファイルからなる検索用文書
データを作成し、文書データベースシステム２０内に登
録しておくようにすれば、「TITLE 」なるタグ情報をも
った文字列を指定した全文検索を行うことが可能にな
り、前述の例の場合、文字列「書名」，「誌名」を検索
することができるようになる。もちろん、以上の例は、
タグ情報の使用例のひとつを示すものであり、この他に
もタグ情報は種々の用途に利用可能である。In practice, tag information is added in the form of inserting tag data into text data. In this embodiment, predetermined tag data is inserted before and after text data corresponding to a target character string. For example, when tag information is added to a character string "book title", predetermined tag data T1 and T2 are inserted before and after text data D corresponding to the character string "book title", as shown in FIG. Will be done.
If search document data composed of such a text file with a tag is created and registered in the document database system 20, a full-text search specifying a character string having tag information of "TITLE" is performed. In the case of the above example, the character strings “book title” and “magazine name” can be searched. Of course, the above example
This shows one example of the use of tag information, and in addition to this, tag information can be used for various purposes.

【００２５】最後に、本実施形態の入力支援装置１０の
もつ付加的な機能を述べておく。この付加的な機能によ
れば、特に、複数頁からなる文書情報を最終的に１つの
ＰＤＦファイルにまとめることが可能になる。そのため
に、データ読込手段１１は、１頁分の文書情報に対応す
るデータをそれぞれ１ファイルとして、複数頁にわたる
文書情報を頁単位で読み込み、複数のファイルを１フォ
ルダに収容して格納する機能を有している。たとえば、
全ｎ頁にわたる文書情報が与えられた場合、図６に示す
ように、１頁目の文書情報はファイルＦ１１、２頁目の
文書情報はファイルＦ１２、３頁目の文書情報はファイ
ルＦ１３、…、ｎ頁目の文書情報はファイルＦ１ｎとい
うように、１頁／１ファイルで読み込まれ、このｎ個の
ファイルは１つのフォルダＦＦに収容される。たとえ
ば、ｍ冊の本についての文書情報を入力した場合、この
ようなフォルダが合計ｍ個作成されることになる。Finally, additional functions of the input support device 10 of the present embodiment will be described. According to this additional function, in particular, it is possible to finally combine document information including a plurality of pages into one PDF file. For this purpose, the data reading means 11 has a function of reading document information over a plurality of pages in page units, with data corresponding to document information of one page as one file, and storing a plurality of files in one folder. Have. For example,
When document information for all n pages is given, as shown in FIG. 6, the document information of the first page is file F11, the document information of the second page is file F12, the document information of the third page is file F13,. The document information of the nth page is read in one page / one file as a file F1n, and the n files are stored in one folder FF. For example, when document information on m books is input, m such folders are created in total.

【００２６】フォーマット変換を行う場合、オペレータ
は、変換対象となるフォルダを指定するだけでよい。す
なわち、フォーマット変換手段１２は、オペレータが指
定したフォルダ内の全ファイルに対してフォーマット変
換処理を施す機能を有している。したがって、図６の例
の場合、オペレータがフォルダＦＦを指定して変換指示
を与えると、このフォルダＦＦ内に収容されていたｎ個
のファイルのすべてに対して変換が行われ、ｎ個のＰＤ
ＦファイルＦ２１〜Ｆ２ｎが作成されることになる。し
かも、各ＰＤＦファイルＦ２１〜Ｆ２ｎは、それぞれ頁
を特定するための頁データが付加された後、１つのファ
イルＦ３０にまとめられる。こうして得られるファイル
Ｆ３０は、複数ｎ頁にわたる文書情報に対応した閲覧用
文書データとなる。When performing format conversion, the operator need only specify a folder to be converted. That is, the format conversion means 12 has a function of performing format conversion processing on all files in the folder designated by the operator. Therefore, in the example of FIG. 6, when the operator designates the folder FF and gives a conversion instruction, all the n files contained in the folder FF are converted, and the n PDs are converted.
The F files F21 to F2n are created. Moreover, the PDF files F21 to F2n are combined into one file F30 after page data for specifying the page is added. The file F30 obtained in this manner becomes browsing document data corresponding to document information covering a plurality of n pages.

【００２７】[0027]

【発明の効果】以上のとおり本発明に係る文書データベ
ースシステムへの入力支援装置によれば、閲覧用文書デ
ータとともに検索用文書データが作成されるので、テキ
ストデータに対する全文検索が可能になるような態様で
データを入力する作業を支援することができるようにな
る。As described above, according to the input support apparatus for a document database system according to the present invention, since search document data is created together with browsing document data, it is possible to perform full-text search on text data. The operation of inputting data in an aspect can be supported.

[Brief description of the drawings]

【図１】本発明の一実施形態に係る入力支援装置１０の
基本構成を示すブロック図である。FIG. 1 is a block diagram showing a basic configuration of an input support device 10 according to an embodiment of the present invention.

【図２】図１に示すテキストデータ抽出手段１３によっ
て抽出されたテキストデータの一例を示す図である。FIG. 2 is a diagram showing an example of text data extracted by a text data extraction unit 13 shown in FIG.

【図３】図２に示すテキストデータをディスプレイ画面
上に表示させ、矩形領域３１を指定した状態を示す図で
ある。FIG. 3 is a diagram showing a state in which the text data shown in FIG. 2 is displayed on a display screen and a rectangular area 31 is designated.

【図４】図２に示すテキストデータをディスプレイ画面
上に表示させ、矩形領域３２を指定した状態を示す図で
ある。FIG. 4 is a diagram showing a state in which the text data shown in FIG. 2 is displayed on a display screen and a rectangular area 32 is designated.

【図５】図１に示すタグ情報付加手段１４によるタグデ
ータの付加方法の一例を示す図である。FIG. 5 is a diagram showing an example of a tag data adding method by the tag information adding means 14 shown in FIG. 1;

【図６】図１に示す入力支援装置１０による複数頁の文
書処理を示す図である。FIG. 6 is a view showing document processing of a plurality of pages by the input support device 10 shown in FIG. 1;

[Explanation of symbols]

１…紙による原文書２…マイクロフィルムによる原文書３…デジタル記録媒体４…通信回線１０…入力支援装置１１…データ読込手段１２…フォーマット変換手段１３…テキストデータ抽出手段１４…タグ情報付加手段１５…データ出力手段２０…文書データベースシステム３１，３２…矩形領域Ｄ…テキストデータＦ１１〜Ｆ１ｎ…文書情報ファイルＦ２１〜Ｆ２ｎ…ＰＤＦファイルＦ３０…複数頁をまとめたＰＤＦファイルＦＦ…フォルダＴ１，Ｔ２…タグデータ DESCRIPTION OF SYMBOLS 1 ... Original document by paper 2 ... Original document by microfilm 3 ... Digital recording medium 4 ... Communication line 10 ... Input support apparatus 11 ... Data reading means 12 ... Format conversion means 13 ... Text data extraction means 14 ... Tag information addition means 15 ... Data output means 20 ... Document database system 31,32 ... Rectangular area D ... Text data F11-F1n ... Document information file F21-F2n ... PDF file F30 ... PDF file containing multiple pages FF ... Folder T1, T2 ... Tag data

Claims

[Claims]

An apparatus for supporting an operation of inputting data to a document database system, comprising: data reading means for reading document information to be input as digital data; and reading the read document information into the document database system. Format converting means for converting the data into a browsing format suitable for browsing by the user and creating browsing document data, extracting text data from the browsing document data, and creating search document data including the extracted text data An input support apparatus for a document database system, comprising: a text data extracting unit that outputs the browsing document data and the search document data to the document database system. .

2. The input support device according to claim 1, wherein the data reading means inputs document information on an original document prepared on paper or microfilm as image data composed of a set of a large number of pixels. Wherein the format conversion means has a function of performing format conversion processing on the image data and creating browsing document data for performing display corresponding to the image data, and the text data extraction means has Input to a document database system having a function of performing character recognition on a predetermined portion in the browsing document data and converting the recognized characters into text data to create search document data. Support equipment.

3. The input support device according to claim 1, wherein the data reading means has a function of inputting document information including a predetermined character code as digital data from a digital recording medium or a communication line, and the format conversion means. Has a function of performing a format conversion process on the document information including the character code, and creating browsing document data including the character code or an alternative code corresponding to the character code. An input support apparatus for a document database system having a function of extracting text data based on the character code or the substitute code in the browsing document data and creating search document data.

4. The input support device according to claim 1, wherein a function of displaying a character string on a display screen based on the text data extracted by the text data extracting means, A function for designating an arbitrary area on the screen and inputting predetermined tag information, and adding tag data indicating the tag information to text data corresponding to a character string in the area designated by the operator And a tag information adding unit having a function of: (a) outputting search document data including the tag data from the data output unit.

5. The input support device according to claim 1, wherein the data reading means sets the data corresponding to the document information of one page as one file, and converts the document information over a plurality of pages into page units. Has a function of storing a plurality of files in one folder and storing the files in a folder. The format conversion means performs a format conversion process on all the files in the folder designated by the operator, and converts the format of each file. After adding the page data for specifying the page to each of the data, the converted data is combined into one file, and the document data for browsing corresponding to the document information over a plurality of pages is provided. Input support device for a document database system

6. A computer-readable recording medium in which a program for causing a computer to function as the input support device according to claim 1 is recorded.