JP2002049638A

JP2002049638A - Document information retrieval device, method, document information retrieval program and computer readable recording medium storing document information retrieval program

Info

Publication number: JP2002049638A
Application number: JP2001131097A
Authority: JP
Inventors: Seiichiro Abe; 静一郎阿部
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2000-05-26
Filing date: 2001-04-27
Publication date: 2002-02-15

Abstract

PROBLEM TO BE SOLVED: To retrieve a document similar to a document that is not registered in a retrieval database by a simple operation quickly. SOLUTION: This document information retrieval device retrieves document information in a server 10 and replies on the basis of a retrieval request from a client 12, and when a retrieval condition designating part 26 in the client 12 designates a document file as a retrieval condition, transmits the designated contents of the file through a network. A document retrieving part 30 in a retrieval machine 20 installed on the server 10 side generates a key word from the contents of the file transmitted from the retrieval condition designating part 26, and retrieves a similar document by an index (important word string extracted from a retrieval subject document 25).

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、大量の文書データ
の中から必要な文書を迅速に探し出すための文書情報検
索装置、方法及び文書情報検索プログラムを格納したコ
ンピュータ可読の記録媒体に関し、特に、文書ファイル
そのものを検索条件に指定するという簡単な操作で内容
が類似する文書を捜し出す文書情報検索装置、方法及び
文書情報検索プログラムを格納したコンピュータ可読の
記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document information retrieving apparatus and method for quickly retrieving a required document from a large amount of document data, and a computer-readable recording medium storing a document information retrieving program. The present invention relates to a document information search device and method for searching for documents having similar contents by a simple operation of designating a document file itself as a search condition, and a computer-readable recording medium storing a document information search program.

【０００２】[0002]

【従来の技術】従来、ネットワーク環境を利用した文書
管理システムにあっては、インターネットやイーサネッ
ト（Ｒ）上に存在する大量の文書データから必要な文書
を検索してすばやく参照することのできる文書情報検索
装置を提供している。2. Description of the Related Art Conventionally, in a document management system using a network environment, document information which allows a user to search for a required document from a large amount of document data existing on the Internet or Ethernet (R) and to quickly refer to the document information. Provides a search device.

【０００３】この場合の文書検索は、ユーザが必要とす
る文書に含まれていると思われる１又は複数の適当な単
語や文字列をキーワードとして指定し、この指定したキ
ーワードの単語を含む文書を検索データベースから検索
し、文書一覧を検索結果として表示する。[0003] In this case, in the document search, one or a plurality of appropriate words or character strings that are considered to be included in a document required by the user are designated as keywords, and a document containing the word of the designated keyword is searched. Search from the search database and display a list of documents as search results.

【０００４】この文書情報検索装置にあっては、ネット
ワーク上に存在する検索対象文書について、その内容か
ら重要語を抽出して列挙したインデックスを文書毎に作
成して検索データベースに保存している。そしてユーザ
からキーワードを指定した検索要求があれば、検索デー
タベースのインデックスを検索して文書一覧の検索結果
を出すようにしている。In this document information search apparatus, for a search target document existing on a network, an important word is extracted from the contents and an enumerated index is created for each document and stored in a search database. When a user issues a search request specifying a keyword, an index of a search database is searched to obtain a search result of a document list.

【０００５】更に、従来の文書情報検索装置は、ユーザ
がキーワード指定で検索した文書一覧の中から必要と思
われる文書を検索した後、文書一覧の中から選んだ文書
について類似文書検索を指定すると、検索文書の中に出
現する頻度の高い用語が自動的に抽出され、前回実行さ
れた検索条件に論理和の条件で付加され、類似文書の検
索を行うことができる。Further, the conventional document information search apparatus searches for a document that is deemed necessary from a list of documents searched by a user by specifying a keyword, and then specifies a similar document search for a document selected from the list of documents. A term frequently appearing in a search document is automatically extracted and added to the previously executed search condition by a logical sum condition, so that a similar document can be searched.

【０００６】[0006]

【発明が解決しようとする課題】しかしながら、ユーザ
が電子メールやインターネットで、興味ある文書を入手
し、この文書に類似した内容の文書を検索したい場合、
現状では入手した文書に含まれている単語や文字列を選
んでキーワードにいちいち指定し、まず検索結果として
文書一覧を得る。次に、検索した文書一覧の中から文書
を選択して類似文書検索を指定して類似文書の検索を行
わなければならない。However, if a user obtains a document of interest via e-mail or the Internet and wants to search for a document similar in content to this document,
At present, a word or a character string included in an obtained document is selected and designated as a keyword, and a document list is first obtained as a search result. Next, a similar document must be searched by selecting a document from the searched document list and specifying a similar document search.

【０００７】即ち、電子メールやインターネットで入手
した文書の類似検索を行おうとしても、従来の文書情報
検索装置は、既に検索データベースに登録されている文
書しか、文書を検索条件に指定した類似文書の検索はで
きず、ユーザが電子メールやインターネットで入手した
文書を検索条件に使って直接的に類似文書の検索を行う
ことができない。That is, even if a similar search for a document obtained by e-mail or the Internet is to be performed, the conventional document information search apparatus only searches for a document already registered in the search database, and searches for a similar document that specifies the document as a search condition. Cannot be searched, and a user cannot directly search for a similar document by using a document obtained by e-mail or the Internet as a search condition.

【０００８】このためユーザが電子メールやインターネ
ットで入手した文書の中から、文書検索に必要と思われ
るキーワードを選んで検索条件として入力する必要があ
り、キーワードが多くある場合は入力に手間がかかる。
またキーワードの指定が十分でないと検索漏れを生じ、
期待した検索結果が得られない場合がある。For this reason, it is necessary for the user to select a keyword deemed necessary for document search from the documents obtained by e-mail or the Internet and input it as a search condition. If there are many keywords, it takes time to input. .
If the keyword is not specified enough, search omission will occur,
In some cases, expected search results cannot be obtained.

【０００９】更に文書一覧として得られる検索数が膨大
となることもあり、文書一覧から関連すると思われる文
書を開いて必要な文書を捜し出す大変な手間がかかる場
合がある。Further, the number of searches obtained as a document list may be enormous, and it may take a great deal of trouble to open a document considered to be related from the document list and search for a necessary document.

【００１０】本発明は、検索データベースに登録されて
いない文書に類似した文書の検索を簡単な操作ですばや
くできる文書情報検索装置、方法及び文書情報検索プロ
グラムを格納したコンピュータ可読の記録媒体を提供す
ることを目的とする。SUMMARY OF THE INVENTION The present invention provides a document information search apparatus and method capable of quickly searching for a document similar to a document not registered in a search database by a simple operation, and a computer-readable recording medium storing a document information search program. The purpose is to:

【００１１】[0011]

【課題を解決するための手段】図１は本発明の原理説明
図である。本発明は、クライアント１２等からのネット
ワークを経由した検索要求に基づいてサーバ１０等の検
索側で文書情報を検索して応答する文書情報検索装置で
あって、クライアント１２等の要求元に、検索条件にフ
ァイルを指定した場合に、指定したファイル内容をネッ
トワークを経由して送信する検索条件指定部２６を設
け、サーバ１０等の検索側に、検索条件指定部２６から
送信されたファイル内容からキーワードを生成して類似
文書を検索する検索マシン２０を設けたことを特徴とす
る。FIG. 1 is a diagram illustrating the principle of the present invention. The present invention relates to a document information search apparatus that searches and responds to document information on a search side such as a server 10 based on a search request from a client 12 or the like via a network. When a file is specified as a condition, a search condition specifying unit 26 that transmits the specified file content via a network is provided. A search side such as the server 10 transmits a keyword based on the file content transmitted from the search condition specifying unit 26 to the server 10 or the like. , And a search machine 20 for searching for similar documents is provided.

【００１２】このため電子メールやインターネット等で
興味のある内容を含む文書を入手し、この文書に類似し
た内容の文書を検索したい場合等に、文書の指定により
アップロードされたファイルを検索条件に指定すること
で、内容が類似する文書を検索することができる。この
ためデータベース登録されていない文書であっても自由
に検索条件として指定することができ、手間のかかる文
書内容に基づいたキーワードの入力を不要とし、簡単且
つ迅速に類似文書を探し出すことができる。For this reason, when a document containing the content of interest is obtained by e-mail, the Internet, or the like, and a user wants to search for a document having a content similar to this document, the uploaded file is specified as a search condition by specifying the document. By doing so, documents with similar contents can be searched. Therefore, even a document that is not registered in the database can be freely designated as a search condition, and it is not necessary to input a keyword based on the troublesome document content, and a similar document can be found easily and quickly.

【００１３】検索要求元の検索条件指定部２６は、指定
されたファイル内容の先頭ファイル部分を送信する。通
常、文書検索に必要な重要なキーワードは文書の先頭部
分に多く存在することから、ファイル内容の先頭部分だ
け、例えば先頭の１ＫＢ部分を検索条件として送信す
る。また検索条件に使用する文書ファイルのサイズは様
々であることから、検索条件として送信するファイル容
量を決めることで、通信負荷と検索側の処理を軽減す
る。The search condition specifying unit 26 of the search request source transmits the first file portion of the specified file contents. Normally, since important keywords required for document search are often present at the beginning of a document, only the beginning of the file contents, for example, the first 1 KB portion, is transmitted as a search condition. Further, since the size of the document file used for the search condition varies, the communication load and the processing on the search side are reduced by determining the size of the file to be transmitted as the search condition.

【００１４】検索条件指定部２６は検索条件として指定
するファイルにＨＴＭＬファイル及びエクセルファイル
を含む。勿論、これ以外のファイル形式であっても、テ
キスト文書の抽出が可能なファイルであれぱ、任意のフ
ァイル形式のものを含む。The search condition specification section 26 includes HTML files and Excel files in files specified as search conditions. Of course, other file formats include any file format as long as the text document can be extracted.

【００１５】サーバ１０側の検索マシン２０には、検索
対象文書から抽出した重要語を列挙したインデックス情
報を文書毎に保存したデータベース２２が設けられる。
また検索マシン２０のファイル指定検索部３０は、検索
要求に伴って受信したファイル内容からテキスト文を抽
出するテキスト抽出処理部３６、テキスト文の形態素解
析により名詞を抽出する形態素回析部３８、名詞の中か
ら重要語を抽出して論理和でつなげたキーワードを生成
するキーワード生成部４０、及びキーワードによる検索
データベース２２の検索で類似する文書を検索してクラ
イアントに検索結果を通知する検索実行部４２を備え
る。The search machine 20 on the server 10 side is provided with a database 22 in which index information listing key words extracted from documents to be searched is stored for each document.
The file designation search unit 30 of the search machine 20 includes a text extraction processing unit 36 that extracts a text sentence from the file content received in response to the search request, a morphological analysis unit 38 that extracts a noun by morphological analysis of the text sentence, A keyword generation unit 40 that extracts keywords from among the keywords and generates a keyword connected by a logical sum, and a search execution unit 42 that searches for a similar document by searching the search database 22 using the keyword and notifies the client of the search result. Is provided.

【００１６】キーワード生成部４０は、各名詞が検索デ
ータベース２２に格納した検索文書毎のインデックス中
の何文書に出現するかの出現数Ｈをカウントし、所定範
囲の出現数Ｈをもつ上位の所定数の単語を選択してキー
ワードを生成する。The keyword generation section 40 counts the number of appearances H of each document in the index of each search document stored in the search database 22 and determines the number of occurrences H in a predetermined range. Select a number of words to generate keywords.

【００１７】キーワード生成部４０は、インデックス中
の文書数Ｎとした場合、例えば出現数Ｈが２Ｎ／３≧Ｈ≧１の範囲の出現数をもつ上位の１０個の単語を選択してキ
ーワードを生成する。これによりデータベースのインデ
ックスに登録している既存文書の類似検索に必要な重要
語を絞り込み、類似検索の精度を高める。When the number of documents in the index is N, the keyword generation unit 40 selects, for example, the top 10 words having the number of appearances H in the range of 2N / 3 ≧ H ≧ 1 and selects the keyword as a keyword. Generate. This narrows down important words necessary for similarity search of existing documents registered in the index of the database, and improves the accuracy of similarity search.

【００１８】更にキーワード生成部４０は、検索要求に
伴って受信したファイルから抽出したプロパティ情報を
キーワードに含めて検索させる。この場合のプロパティ
情報は、検索要求に伴って受信したファイルの作成者、
文書タイトル等である。このように検索条件に、ファイ
ルのプロパティ情報を加えることで、例えば作成者等を
特定したい場合の類似文書の絞り込みが適切にできる。Further, the keyword generation section 40 causes the property information extracted from the file received along with the search request to be included in the keyword and searched. In this case, the property information includes the creator of the file received with the search request,
Document title, etc. By adding the property information of the file to the search condition in this way, it is possible to appropriately narrow down the similar documents when it is desired to specify the creator, for example.

【００１９】検索要求元の検索条件指定部２６はクライ
アント１２のＷＷＷブラウザ１６で提供され、ＷＷＷブ
ラウザ１６の検索要求画面で指定したファイル内容をネ
ットワークを介してＷＷＷサーバ１８に送信して検索マ
シン２０に引き渡す。The search condition specifying unit 26 of the search request source is provided by the WWW browser 16 of the client 12, transmits the file contents specified on the search request screen of the WWW browser 16 to the WWW server 18 via the network, and Hand over to

【００２０】本発明は、またサーバ等の検索側の文書情
報検索装置となる検索マシン２０を提供する。この検索
マシン２０としての文書情報検索装置は、検索対象文書
から抽出した重要語を列挙したインデックス情報を文書
毎に保存している検索データベース２２、文書ファイル
を検索条件に指定したネットワークからの検索要求によ
って受信したファイル内容からテキスト文を抽出するテ
キスト抽出処理部３６、テキスト文の形態素解析により
名詞を抽出する形態素解析部２８、名詞の中から重要語
を抽出して論理和でつなげたキーワードを生成するキー
ワード生成部４０、及びキーワードによる検索データベ
ースの検索で類似する文書検索して要求元に検索結果を
通知する検索実行部４２を備える。The present invention also provides a search machine 20 serving as a document information search device on the search side such as a server. The document information search device as the search machine 20 includes a search database 22 storing index information listing key words extracted from a search target document for each document, and a search request from a network specifying a document file as a search condition. A text extraction processing unit 36 that extracts a text sentence from the file content received by the morphological analysis unit 28, a morphological analysis unit 28 that extracts a noun by morphological analysis of the text sentence, and extracts a key word from the noun to generate a keyword connected by a logical sum And a search execution unit 42 that searches for a similar document by searching a search database using a keyword and notifies the request source of the search result.

【００２１】本発明は、クライアント等の検索要求元か
らのネットワークを経由した検索要求に基づいてサーバ
等の検索マシン側で文書情報を検索して応答する文書情
報検索方法を提供する。この文書情報検索方法は、検索
対象文書から抽出した重要語を列挙したインデックス情
報を文書毎にサーバの検索データベースに保存し；文書
ファイルを検索条件に指定した場合に、指定したファイ
ル内容を検索要求と共にネットワークを経由して検索側
に送信し；検索側で、検索要求に伴って受信したファイ
ル内容からテキスト文を抽出すると共にテキスト文の形
態素解析により名詞を抽出し、次に名詞の中から重要語
を抽出して論理和でつなげたキーワードを生成し、該キ
ーワードによる検索データベースの検索で類似する文書
を検索してクライアントに検索結果を通知することを特
徴とする。この文書情報検索方法の詳細は装置構成と基
本的に同じになる。The present invention provides a document information search method for searching and responding to document information on a search machine such as a server based on a search request via a network from a search request source such as a client. In this document information search method, index information in which important words extracted from a search target document are listed is stored in a search database of a server for each document; when a document file is specified as a search condition, a search request for the specified file content is made. With the searcher via the network; the searcher extracts the text sentence from the file content received in response to the search request, and extracts the noun by morphological analysis of the text sentence. The method is characterized in that keywords are extracted and a keyword connected by a logical sum is generated, a similar document is searched by searching the search database using the keyword, and a search result is notified to a client. The details of this document information search method are basically the same as the device configuration.

【００２２】更に、本発明は、文書情報検索プログラム
を格納したコンピュータ可読の記録媒体を提供するもの
で、この文書情報検索プログラムは、文書ファイルを検
索条件に指定した検索要求を受信するステップと、検索
要求に伴って受信したファイル内容からテキスト文を抽
出するステップと、テキスト文の形態素解析により名詞
を抽出するステップと、名詞の中から重要語を抽出して
論理和でつなげたキーワードを生成するステップと、キ
ーワードによるデータベースの検索で類似する文書を検
索して要求元に検索結果を通知するステップとを備え
る。Further, the present invention provides a computer-readable recording medium storing a document information search program, the document information search program receiving a search request specifying a document file as a search condition; Extracting a text sentence from the file content received in response to the search request, extracting a noun by morphological analysis of the text sentence, extracting an important word from the noun and generating a keyword connected by a logical sum And a step of searching for a similar document by searching the database using a keyword and notifying the request source of the search result.

【００２３】更に本発明は、文書情報検索プログラムを
提供するものであり、このプログラムは、コンピュータ
に、文書ファイルを検索条件に指定した検索要求を受信
するステップと、検索要求に伴って受信したファイル内
容からテキスト文を抽出するステップと、テキスト文の
形態素解析により名詞を抽出するステップと、名詞の中
から重要語を抽出して論理和でつなげたキーワードを生
成するステップと、キーワードによるデータベースの検
索で類似する文書を検索して要求元に検索結果を通知す
るステップとを実行させることを特徴とする。Further, the present invention provides a document information search program, the program comprising: a step of receiving a search request in which a document file is specified as a search condition; Extracting a text sentence from the content, extracting a noun by morphological analysis of the text sentence, extracting an important word from the noun and generating a keyword connected by a logical sum, and searching the database by the keyword And searching for similar documents and notifying the requester of the search result.

【００２４】[0024]

【発明の実施の形態】図２は、本発明による文書情報検
索装置のシステム構成であり、インターネットやイーサ
ネット（Ｒ）を利用したサーバクライアント型の検索シ
ステムとして構築した場合を例にとっている。FIG. 2 shows a system configuration of a document information retrieval apparatus according to the present invention, taking as an example a case where the document information retrieval apparatus is constructed as a server-client type retrieval system using the Internet or Ethernet (R).

【００２５】図２において、サーバ１０に対しては、ユ
ーザ側のクライアント１２がインターネット／イントラ
ネット１４を介して接続される。クライアント１２には
検索用のＷＷＷブラウザ１６が設けられており、このＷ
ＷＷブラウザ１６を利用してサーバ１０に対し文書情報
の検索要求を行い、サーバ１０側の検索結果を表示す
る。In FIG. 2, a client 12 on the user side is connected to the server 10 via the Internet / intranet 14. The client 12 is provided with a WWW browser 16 for searching.
A search request for document information is made to the server 10 using the WW browser 16, and the search result on the server 10 side is displayed.

【００２６】サーバ１０には、ＷＷＷサーバ１８、検索
マシン２０、文書データベース２４が設けられている。
検索マシン２０には検索データベース２２が格納されて
いる。また文書データベース２４には検索対象文書２５
が格納されている。更にＷＷＷサーバ１８に対して外部
の文書管理サーバ４４，４８が接続され、この文書管理
サーバ４４，４８にも文書データベース４６，５０が設
けられており、それぞれ検索対象文書２５を格納してい
る。The server 10 includes a WWW server 18, a search machine 20, and a document database 24.
The search machine 20 stores a search database 22. The document database 24 has a search target document 25.
Is stored. Further, external document management servers 44 and 48 are connected to the WWW server 18, and the document management servers 44 and 48 are also provided with document databases 46 and 50, respectively, and store the search target documents 25.

【００２７】サーバ１０に設けているＷＷＷサーバ１８
は、ブラウザ１６からの検索要求を受信して検索マシン
２０に対し検索を依頼する。また検索マシン２０から返
ってきた検索結果をブラウザ１６に返して表示させる。The WWW server 18 provided in the server 10
Receives the search request from the browser 16 and requests the search machine 20 for the search. The search result returned from the search machine 20 is returned to the browser 16 for display.

【００２８】検索データベース２２は、全文検索を高速
に処理するために、検索対象となる文書に記述されてい
る重要な単語の集合で作られたインデックスを管理する
保管庫として機能する。このインデックスには文書の文
書名やその保管場所が記録されており、ブラウザ１６か
ら検索要求を受けた際には、検索データベース２２のイ
ンデックスを対象に検索マシン２０が検索処理を実行す
る。The search database 22 functions as a repository for managing an index made up of a set of important words described in a document to be searched in order to process a full-text search at high speed. The index records the document name of the document and its storage location. When a search request is received from the browser 16, the search machine 20 executes a search process on the index of the search database 22.

【００２９】文書データベース２４には、文書管理サー
バ４４，４８から収集した検索対象文書２５が格納され
ており、この文書データベース検索対象文書２５を対象
に検索データベース２２のインデックスが作成されてい
る。The document database 24 stores search target documents 25 collected from the document management servers 44 and 48, and an index of the search database 22 is created for the document database search target documents 25.

【００３０】このようなサーバクライアント型の検索シ
ステムにあっては、クライアント１２のブラウザ１６を
使用して、ユーザが指定した検索条件をインターネット
／イントラネット１４を経由してサーバ１０側のＷＷＷ
サーバ１８に送る。ＷＷＷサーバ１８で受信された検索
要求に含まれる指定された検索条件が、ＷＷＷサーバ１
８から検索マシン２０に送られる。In such a server-client type search system, using the browser 16 of the client 12, the search condition specified by the user is transmitted to the WWW of the server 10 via the Internet / intranet 14.
Send to server 18. The specified search condition included in the search request received by the WWW server 18 is the WWW server 1
8 to the search machine 20.

【００３１】検索マシン２０は検索データベース２２か
ら検索条件にあった文書を検索し、検索結果をＷＷＷサ
ーバ１８に通知する。ＷＷＷサーバ１８は検索マシン２
０からの検索結果をクライアント１２のブラウザ１６に
送って表示させる。The search machine 20 searches the search database 22 for documents that meet the search conditions, and notifies the WWW server 18 of the search results. WWW server 18 is search machine 2
The search result from 0 is sent to the browser 16 of the client 12 and displayed.

【００３２】ユーザはブラウザ１６で処理された検索結
果を見て、検索結果に記述されたリンクを選択すること
で、選択された文書の中からユーザが希望する検索対象
文書２５をＷＷＷサーバ１８経由でアップロードして内
容を見ることができる。The user looks at the search result processed by the browser 16 and selects a link described in the search result, so that the search target document 25 desired by the user is selected from the selected documents via the WWW server 18. You can upload and view the contents.

【００３３】図３は図２の検索システムにおける機能構
成のブロック図である。まずユーザ側となるＷＷＷブラ
ウザ１６には検索条件指定部２６が設けられている。本
発明の検索条件指定部２６は、検索条件としてユーザが
インターネットや電子メールなどで入手した文書ファイ
ルを直接、検索条件として指定し、指定したファイル内
容をインターネット／イントラネット１４経由でＷＷＷ
サーバ１８を経由して検索マシン２０の文書検索部３０
に送信する。FIG. 3 is a block diagram of a functional configuration in the search system of FIG. First, the WWW browser 16 on the user side is provided with a search condition specifying unit 26. The search condition specifying unit 26 according to the present invention directly specifies a document file obtained by the user via the Internet or e-mail as a search condition, and specifies the specified file content via the Internet / intranet 14 via the WWW.
Document search unit 30 of search machine 20 via server 18
Send to

【００３４】また検索条件指定部２６は、本発明で新た
に提供されるファイル指定の検索条件とする以外に、
（１）キーワード検索、（２）文書のタイトル、作成
者、本文ごとにキーワードを指定して検索する詳細検
索、（３）日常的な言葉や文章を入力することにより本
文内容を関連する文書を検索する文章検索、更に、
（４）検索データベース２２に登録済みの既存文書を検
索条件に使用した類似文書検索、などの検索条件の指定
も可能である。The search condition designating section 26 includes a search condition for file designation newly provided in the present invention,
(1) Keyword search, (2) Detailed search to search by specifying keywords for each document title, creator, text, and (3) Document related to text content by inputting everyday words and sentences Sentence search to search, furthermore
(4) It is also possible to specify search conditions such as similar document search using an existing document registered in the search database 22 as a search condition.

【００３５】ＷＷＷサーバ１８側に設けられた検索マシ
ン２０には、検索データベース作成部２８、文書検索部
３０及び文書参照部３２が設けられている。検索データ
ベース作成部２８は検索データベース２２にインデック
スを作成して登録する。The search machine 20 provided on the WWW server 18 is provided with a search database creation unit 28, a document search unit 30, and a document reference unit 32. The search database creation unit 28 creates and registers an index in the search database 22.

【００３６】即ち検索データベース作成部２８は、文書
データベース２４に収集されて保存されている検索対象
文書２５の１つ１つについて、検索対象文書２５に記述
されている重要語を抽出し、抽出された単語の集合で構
成されたインデックスを作成して保存する。もちろん、
このインデックスには検索対象文書の文書名や保管場所
などが併せて記録されている。That is, the search database creation unit 28 extracts the key words described in the search target document 25 from each of the search target documents 25 collected and stored in the document database 24, and Create and save an index composed of a set of words. of course,
The index also records the document name and storage location of the search target document.

【００３７】文書検索部３０は、ＷＷＷブラウザ１６の
検索条件指定部２６から送信された検索条件としてファ
イルを指定した際のファイル内容からキーワードを生成
し、検索データベース２２のインデックスに含まれてい
る重要単語の集合との検索照合を行い、ＷＷＷブラウザ
１６で検索条件として指定したファイルの文書に類似す
る文書を検索し、検索結果をＷＷＷサーバ１８からＷＷ
Ｗブラウザ１６に返して表示させる。The document search unit 30 generates a keyword from the file content when a file is specified as the search condition transmitted from the search condition specification unit 26 of the WWW browser 16, and generates an important keyword included in the index of the search database 22. A search and collation with a set of words are performed, a document similar to the document of the file specified as the search condition is searched by the WWW browser 16, and the search result is sent from the WWW server 18 to the WWW server 18.
It is returned to the W browser 16 and displayed.

【００３８】文書参照部３２は、ＷＷＷブラウザ１６で
送出された検索結果としての文書一覧から参照したい文
書を選択すると、ＷＷＷサーバ１８を介して文書参照部
３２に通知されると、文書データベース２４の中から要
求された参照文書を取り出してＷＷＷブラウザ１６に返
す。When the document reference unit 32 selects a document to be referenced from the document list as a search result sent from the WWW browser 16 and notifies the document reference unit 32 via the WWW server 18, the document reference unit 32 The requested reference document is extracted from the inside and returned to the WWW browser 16.

【００３９】図４は、図３の検索マシン２０に設けた本
発明の文書検索部３０の機能構成の詳細である。FIG. 4 shows details of the functional configuration of the document search unit 30 of the present invention provided in the search machine 20 of FIG.

【００４０】図４において、文書検索部３０には、検索
指定ファイル格納部３４、テキスト抽出処理部３６、形
態素解析部３８、キーワード作成部４０及び検索実行部
４２が設けられている。また検索データベース２２内に
は、図３の検索データベース作成部２８で作成された文
書データベース２４内の検索対象文書２５のそれぞれの
重要単語の集合、文書名、保管場所などで構成されたイ
ンデックス５２が格納されている。In FIG. 4, the document search unit 30 includes a search designation file storage unit 34, a text extraction processing unit 36, a morphological analysis unit 38, a keyword creation unit 40, and a search execution unit 42. In the search database 22, an index 52 including a set of important words, a document name, a storage location, and the like of each of the search target documents 25 in the document database 24 created by the search database creation unit 28 of FIG. Is stored.

【００４１】文書検索部３０の検索指定ファイル格納部
３４には、図３のＷＷＷブラウザ１６における検索条件
指定部２６のファイル指定により送信されたファイル内
容が格納される。The file content transmitted by the file designation of the search condition designation unit 26 in the WWW browser 16 of FIG. 3 is stored in the search designation file storage unit 34 of the document search unit 30.

【００４２】ここでＷＷＷブラウザ１６側からのファイ
ル内容の転送は、検索条件として指定した文書ファイル
の先頭ファイル部分、例えば先頭の１ＫＢを切り出して
ＷＷＷサーバ１８側に検索要求と共に送信する。Here, the transfer of the file contents from the WWW browser 16 is performed by cutting out the first file portion of the document file designated as the search condition, for example, the first 1 KB, and transmitting it to the WWW server 18 together with the search request.

【００４３】このように検索条件として送信するファイ
ル容量を例えば１ＫＢというように固定容量とすること
で、検索条件として指定している文書ファイルのサイズ
の大小に関わらず、検索マシン２０側に対する文書内容
の転送負荷を一定にし、また検索マシン２０におけるフ
ァイル指定部検索部３０による検索処理の安定化と迅速
化を図る。As described above, by setting the file capacity to be transmitted as the search condition to a fixed capacity, for example, 1 KB, the content of the document to the search machine 20 can be determined regardless of the size of the document file specified as the search condition. In this case, the transfer load on the search machine 20 is made constant, and the search processing by the file specifying unit search unit 30 in the search machine 20 is stabilized and speeded up.

【００４４】テキスト抽出処理部３６は、検索指定ファ
イル格納部３４に格納された検索条件として指定された
ファイル内容からテキスト文書を抽出する。ＷＷＷブラ
ウザ１６における検索条件として指定される文書ファイ
ルの形式としては、電子メールのテキストファイル、イ
ンターネットにおけるＨＴＭＬファイル、更には集計リ
ストのエクセルファイルなどの様々なファイル形式があ
ることから、これらのファイル形式の相違に対して検索
機能を提供可能とするため、各種の形式の文書ファイル
の中からテキスト抽出処理部３６によりテキスト文書の
みを抽出して検索条件に使用するようにしている。The text extraction processing unit 36 extracts a text document from the contents of a file specified as a search condition stored in the search specification file storage unit 34. As a format of a document file specified as a search condition in the WWW browser 16, there are various file formats such as an e-mail text file, an HTML file on the Internet, and an Excel file of a summary list. In order to be able to provide a search function with respect to the differences, only the text documents are extracted from the document files of various formats by the text extraction processing unit 36 and used as search conditions.

【００４５】続いて設けた形態素解析部３８は、抽出さ
れたテキスト文書の中に含まれる名詞を形態素解析を用
いて抽出する。形態素解析部３８で抽出された文書内容
の中の名詞はキーワード作成部４０に送られ、キーワー
ド作成部４０においては重要な名詞をキーワード作成の
ために抽出する。Subsequently, a morphological analysis unit 38 extracts nouns included in the extracted text document by using morphological analysis. Nouns in the document content extracted by the morphological analysis unit 38 are sent to the keyword creation unit 40, and the keyword creation unit 40 extracts important nouns for keyword creation.

【００４６】キーワード作成部４０における重要語の抽
出は、まず各名詞が検索データベース２２のインデック
ス５２の中に登録している文書数Ｎの内の何文書で出現
するかの出現数Ｈのカウントを行う。The keyword extraction unit 40 first extracts the important words by counting the number H of occurrences of each noun in the number N of documents registered in the index 52 of the search database 22. Do.

【００４７】そして、インデックス５２中における文書
出現数Ｈが求められたならば、出現数Ｈが予め定めた範
囲内、例えば（２Ｎ／３）≧Ｈ≧１となる出現数の単語を選択する。このように選択された
単語の内の出現数Ｈが大きい上位１０個の単語をキーワ
ード作成のために選択する。そして選択した重要単語１
０個を論理和で繋げたクエリ式を作成して検索実行部４
２に提供する。When the document appearance number H in the index 52 is obtained, a word having the appearance number H within a predetermined range, for example, (2N / 3) ≧ H ≧ 1 is selected. Among the words selected in this way, the top 10 words having a large number of appearances H are selected for keyword creation. And the selected important word 1
Create a query expression by connecting 0 items with a logical sum, and execute the search execution unit 4
2 provided.

【００４８】検索実行部４２はキーワード作成部４０か
ら与えられたクエリ式に基づいて検索データベース２２
のインデックス５２との検索照合を行い、所定の類似度
を満たすインデックスを検索結果として抽出し、検索結
果をＷＷＷサーバ１８によりＷＷＷブラウザ１６側に送
信し、検索結果の文書一覧の形でユーザに参照できるよ
うにする。The search execution section 42 searches the search database 22 based on the query formula given from the keyword creation section 40.
Of the search 52, the index satisfying a predetermined similarity is extracted as a search result, the search result is transmitted to the WWW browser 16 side by the WWW server 18, and the search result is referred to the user in the form of a document list. It can be so.

【００４９】更に文書検索部３０にあっては、検索指定
ファイル格納部３４に格納された検索条件として指定さ
れたファイルのプロパティ情報を利用した文書検索もで
きる。このためＷＷＷブラウザ１６の検索条件指定部２
６は、検索条件として文書ファイルを指定した際に、指
定した文書ファイルのプロパティ情報を抽出し、検索条
件として指定した文書の先頭ファイル部分、例えば先頭
ファイル部分１ＫＢと共にプロパティ情報を検索マシン
２０側に送信する。Further, the document search unit 30 can perform a document search using property information of a file specified as a search condition stored in the search specification file storage unit 34. For this reason, the search condition specifying unit 2 of the WWW browser 16
6 extracts the property information of the specified document file when the document file is specified as the search condition, and sends the property information to the search machine 20 together with the first file portion of the document specified as the search condition, for example, 1 KB of the first file portion. Send.

【００５０】図１４の文書検索部３０にあっては、ファ
イル内容からのテキスト文の抽出、形態素解析による名
詞抽出、名詞について重要語の選択によるキーワード作
成に加え、検索指定ファイル格納部３４に格納されてい
るファイル内容に付加されたプロパティ情報から例えば
作成日や作成者、題名などを抽出し、キーワード作成部
４０でプロパティ情報をキーワードに含め、検索実行部
４２で検索データベース２２のインデックス５２の検索
を行う。The document search unit 30 shown in FIG. 14 extracts text sentences from file contents, extracts nouns by morphological analysis, creates keywords by selecting important words for nouns, and stores them in a search designation file storage unit 34. For example, a creation date, a creator, a title, and the like are extracted from the property information added to the contents of the file, and the keyword creation unit 40 includes the property information in the keyword, and the search execution unit 42 searches the index 52 of the search database 22. I do.

【００５１】図５は、図３の検索マシン２０に設けてい
る検索データベース作成部２８によるインデックス作成
処理の説明図である。この検索データベース作成部２８
にあっては、ロボット５４が外部の文書データベース４
６，５０から文書６６を収集してテンポラリファイル６
２に格納し、同時に収集文書リストファイル６４に収集
した文書６６のリストを加える。FIG. 5 is an explanatory diagram of index creation processing by the search database creation unit 28 provided in the search machine 20 of FIG. This search database creation unit 28
, The robot 54 is connected to the external document database 4
Collect documents 66 from 6,50 and create temporary file 6
2 and a list of collected documents 66 is added to the collected document list file 64 at the same time.

【００５２】続いてロボット５４はテキスト抽出部５６
に処理を渡し、テキスト抽出部５６は収集文書リストフ
ァイル６４から収集文書６６を取り出し、抽出テキスト
ファイル６８に格納する。Subsequently, the robot 54 sets the text extraction unit 56
The text extracting unit 56 extracts the collected document 66 from the collected document list file 64 and stores it in the extracted text file 68.

【００５３】次に重要語抽出部５８に処理を渡し、重要
語抽出部５８は抽出テキストファイル６８の該当テキス
ト文書の中から形態素解析により名詞を抽出し、名詞に
ついてそれぞれ出現頻度をカウントし、例えば出現頻度
の高い単語の上位１０個を重要語として抽出して重要語
ファイル７０に格納する。Next, the process is passed to an important word extraction unit 58, which extracts nouns from the corresponding text document of the extracted text file 68 by morphological analysis, counts the appearance frequency of each noun, and The top 10 words that appear frequently are extracted as important words and stored in the important word file 70.

【００５４】次にインデックス作成部６０に処理を渡
し、インデックス作成部６０は重要語ファイル７０か
ら、その文書について例えば上位１０個の重要語の集合
を取り出し、更に文書名と保管場所を加えたインデック
スを作成し、検索データベース２２にインデックス情報
として保存する。Next, the process is passed to the index creation unit 60, and the index creation unit 60 extracts a set of, for example, the top 10 important words for the document from the keyword file 70, and further adds the document name and the storage location to the index. Is created and stored in the search database 22 as index information.

【００５５】図６は、図３のＷＷＷブラウザ１６による
検索条件の指定と検索結果の表示を行うブラウザ処理の
フローチャートである。ユーザがＷＷＷブラウザ１６の
検索機能を開くと、ステップＳ１で検索画面が表示さ
れ、この検索画面を表示して、ステップＳ２で文書ファ
イルを指定した検索条件の指定操作を行う。FIG. 6 is a flowchart of a browser process for specifying search conditions and displaying search results by the WWW browser 16 of FIG. When the user opens the search function of the WWW browser 16, a search screen is displayed in step S1, the search screen is displayed, and in step S2, a search condition specifying a document file is performed.

【００５６】続いてステップＳ３で検索起動の有無をチ
ェックしており、検索起動を判別すると、ステップＳ４
でファイル指定検索か否かチェックする。ファイル指定
検索であればステップＳ５に進み、ユーザが指定したフ
ァイルを読み出し、ステップＳ６で指定ファイルの先頭
１ＫＢを検索要求メッセージと共にサーバに送信する。Subsequently, in step S3, the presence / absence of a search start is checked.
Check if it is a file specified search. If the search is a file designation search, the process proceeds to step S5, where a file designated by the user is read, and in step S6, the first 1 KB of the designated file is transmitted to the server together with a search request message.

【００５７】ファイル指定検索でなければ、ステップＳ
７で、それ以外の検索例えばキーワード検索に対応した
検索要求メッセージをサーバに送信する。ステップＳ６
で指定ファイルの先頭部分をサーバに送信すると、ステ
ップＳ８で検索結果の受信待ちとなる。If it is not a file designation search, step S
At 7, a search request message corresponding to another search, for example, a keyword search, is transmitted to the server. Step S6
When the head portion of the specified file is transmitted to the server in step, the process waits for reception of the search result in step S8.

【００５８】ステップＳ８でサーバから検索結果が受信
されると、ステップＳ９に進み、検索結果の表示操作処
理を行ってユーザは検索内容を見る。このようなステッ
プＳ１〜Ｓ９の処理を、ステップＳ１０で検索画面を閉
じる検索終了指示があるまで繰り返す。When the search result is received from the server in step S8, the process proceeds to step S9, where the display result of the search operation is performed, and the user views the search content. Such processing of steps S1 to S9 is repeated until there is a search end instruction to close the search screen in step S10.

【００５９】図７は、図６のブラウザ処理において検索
条件として文書ファイルを指定した場合の具体的な手順
と画面の様子を表わしている。FIG. 7 shows a concrete procedure and a screen when a document file is specified as a search condition in the browser processing of FIG.

【００６０】図７において、まずユーザは検索条件に指
定しようとする文書ファイル７２を例えばインターネッ
トから取得している。そしてユーザは文書ファイル７２
の内容を見て、この文書ファイル７２に類似する文書検
索を行うため、文書ファイル７２の内容を予め指定した
ファイル、例えばファイル「ｎｅｗｓ．ｔｘｔ」に保存
する。In FIG. 7, the user first obtains a document file 72 to be specified as a search condition from, for example, the Internet. Then, the user enters the document file 72
The contents of the document file 72 are saved in a file designated in advance, for example, a file “news.txt” in order to perform a document search similar to the document file 72 by looking at the contents of the document file 72.

【００６１】続いてユーザはキーワード入力画面７４を
開く。キーワード入力画面７４にはキーワード入力部７
６、ファイル指定部７８、参照ボタン８０及び検索実行
ボタン８２が設けられている。そこで、ユーザがキーワ
ード入力画面７４の参照ボタン８０を押すことでファイ
ル選択ダイアログ８４を表示する。Subsequently, the user opens the keyword input screen 74. The keyword input screen 74 has a keyword input section 7
6, a file designation section 78, a reference button 80, and a search execution button 82 are provided. Therefore, when the user presses the reference button 80 on the keyword input screen 74, a file selection dialog 84 is displayed.

【００６２】このファイル選択ダイアログ８４の中に
は、検索条件として指定したい文書ファイル７２が保存
されていることから、ファイル名「ｎｅｗｓ．ｔｘｔ」
をマウスクリックして選択すると、キーワード入力画面
７４のファイル指定部７８に選択したファイル名「ｎｅ
ｗｓ．ｔｘｔ」が設定される。Since the file selection dialog box 84 stores the document file 72 to be specified as a search condition, the file name is “news.txt”.
Is selected by clicking with the mouse, and the selected file name “ne” is displayed in the file specification section 78 of the keyword input screen 74.
ws. txt ”is set.

【００６３】このようにしてファイル指定部７８による
ファイル指定が済んだならば、検索実行ボタン８２を押
すことで、検索条件として指定された文書ファイル「ｎ
ｅｗｓ．ｔｘｔ」の文書内容の先頭１ＫＢが検索要求と
共にサーバに対し送信される。When the file designation by the file designation section 78 is completed in this way, by pressing the search execution button 82, the document file "n" designated as the search condition is pressed.
ews. The first 1 KB of the document content of “txt” is transmitted to the server together with the search request.

【００６４】図８は、図４の文書検索部３０によって実
現されるサーバ検索処理のフローチャートである。この
サーバ検索処理は、ステップＳ１で検索条件として指定
された文書ファイルを読み込み、ステップＳ２で文書フ
ァイルからテキスト文書の抽出処理を行う。次にステッ
プＳ３で、抽出したテキスト文書の内容について形態素
解析を用いて名詞を抽出する。次にステップＳ４で、名
詞として抽出した各単語が検索データベース２２に設け
ているインデックス５２の中の文書数Ｎの内の何文書に
出現するかの出現数Ｈのカウント処理を行う。FIG. 8 is a flowchart of a server search process realized by the document search unit 30 of FIG. In this server search process, a document file specified as a search condition is read in step S1, and a text document is extracted from the document file in step S2. Next, in step S3, nouns are extracted from the contents of the extracted text document using morphological analysis. Next, in step S4, a process of counting the number of occurrences H of the number of documents in the index 52 provided in the search database 22 in which each word extracted as a noun appears in the search database 22 is performed.

【００６５】各単語のインデックス中の出現数Ｈがカウ
ントできたならば、ステップＳ５で出現数Ｈが（２Ｎ／
３）以下で１以上となる範囲の単語をまず選択し、この
選択した単語のうち出現数Ｈが大きい上位１０個の単語
をキーワードに使用する重要語として選択する。続いて
ステップＳ６で、重要語として選択した１０個の単語を
論理和で繋げたクエリ式を生成する。If the number of appearances H in the index of each word has been counted, the number of appearances H is calculated as (2N /
3) First, words in a range of 1 or more are selected first, and among the selected words, the top 10 words having a large number of appearances H are selected as important words to be used as keywords. Subsequently, in step S6, a query expression in which the ten words selected as important words are connected by a logical sum is generated.

【００６６】そしてステップＳ７で、検索キーワードと
して生成されたクエリ式による検索データベースのイン
デックスの検索を行い、生成したキーワードに対し所定
の類似度を持つインデックスの内容を検索文書として一
覧表にまとめ、ステップＳ８で検索結果をブラウザに送
信する。In step S7, the search of the index of the search database is performed using the query expression generated as the search keyword, and the contents of the index having a predetermined similarity to the generated keyword are summarized in a list as a search document. In S8, the search result is transmitted to the browser.

【００６７】図９は、図８のステップＳ２のテキスト抽
出処理の詳細である。このテキスト抽出処理にあって
は、ステップＳ１で文書ファイルの拡張子を解読する。
ファイル拡張子からステップＳ２でＨＴＭＬ文書である
ことが認識されると、ステップＳ３に進み、ＨＴＭＬ文
書におけるボディタグ内のデータをテキストデータ本文
として抽出し、タグデータは取り除く。FIG. 9 shows details of the text extraction process in step S2 of FIG. In this text extraction process, the extension of the document file is decoded in step S1.
If it is recognized from the file extension in step S2 that the document is an HTML document, the process proceeds to step S3, in which the data in the body tag in the HTML document is extracted as the text data body, and the tag data is removed.

【００６８】例えば図１０（Ａ）のようなＨＴＭＬファ
イルを例にとると、＜＞で挟まれたボディ単語の中の
データをテキストデータ本文として取り出して、このタ
グデータは取り除くことで、図１０（Ｂ）のような抽出
テキスト文書が得られる。For example, taking an HTML file as shown in FIG. 10A as an example, the data in the body word sandwiched between <> is extracted as the text data body, and this tag data is removed. An extracted text document as shown in (B) is obtained.

【００６９】次にステップＳ４で、ＯＳで管理している
ファイルのプロパティ情報を獲得する。このプロパティ
情報は、例えばファイル所有者や文書タイプなどを含ん
でいる。Next, in step S4, the property information of the file managed by the OS is obtained. This property information includes, for example, the file owner and the document type.

【００７０】図１１は、インターネットから入手した文
書ファイルのプロパティ情報の例であり、このプロパテ
ィ情報にあっては文書タイトル「文書管理システムにつ
いて」や作成日、変更日などが存在し、これらのプロパ
ティデータをキーワード生成のために獲得する。FIG. 11 shows an example of property information of a document file obtained from the Internet. The property information includes a document title “about the document management system”, a creation date, a change date, and the like. Acquire data for keyword generation.

【００７１】一方、ステップＳ２でＨＴＭＬ文書ではな
く例えばエクセル文書などであった場合には、ステップ
Ｓ５で文書ライブラリにファイルを渡し、テキストデー
タを獲得する。続いてステップＳ６で、プロパティ情報
獲得関数により文書ごとに設定されているファイルプロ
パティ情報例えば作成者や文書タイトルなどを獲得す
る。On the other hand, if the document is not an HTML document but an Excel document, for example, in step S2, the file is transferred to the document library in step S5 to acquire text data. Subsequently, in step S6, file property information, such as a creator and a document title, set for each document is obtained by a property information obtaining function.

【００７２】図１２は本発明で検索条件として指定する
ＨＴＭＬファイル以外のファイルとしてエクセルファイ
ルを示している。この図１２のエクセルファイルについ
て、文書ライブラリに渡してテキストデータを獲得する
と、図１３の抽出テキスト文書に示すようなエクセル文
書中に書き込まれているテキスト文書を抽出した結果が
得られる。FIG. 12 shows an Excel file as a file other than the HTML file designated as a search condition in the present invention. When the text file is obtained by passing the Excel file of FIG. 12 to the document library, a result obtained by extracting the text document written in the Excel document as shown in the extracted text document of FIG. 13 is obtained.

【００７３】このようなテキスト抽出処理で得られたＨ
ＴＭＬ文書やエクセル文書からのテキスト文書、更には
プロパティ情報から得られたテキスト文書をひとまとめ
にして、図８のステップＳ３で形態素解析を用いて名詞
を抽出し、ステップＳ４，Ｓ５で、データベースのイン
デックスの参照で重要語の上位１０個をキーワードに選
択してクエリ式を作り、データベースのインデックス検
索を行って検索結果を得ることができる。The H obtained by such a text extraction process
A text document from a TML document or an Excel document, and a text document obtained from property information are put together, and a noun is extracted using morphological analysis in step S3 of FIG. 8, and a database index is extracted in steps S4 and S5. , A query formula is created by selecting the top 10 important words as keywords, and a search result can be obtained by performing an index search of the database.

【００７４】尚、図９のテキスト抽出処理におけるステ
ップＳ４，Ｓ６のプロパティ情報の獲得は、ＷＷＷブラ
ウザ１６におけるユーザ側の指定によってプロパティ情
報を使用するか否かの選択が可能であり、プロパティ情
報を使うか否かは検索結果をどの程度絞り込むかのユー
ザ判断に依存する。The property information acquisition in steps S4 and S6 in the text extraction processing in FIG. 9 can be selected by the user on the WWW browser 16 as to whether or not to use the property information. Whether to use it depends on the user's judgment on how narrow the search results should be.

【００７５】本発明はまた、図４の検索マシン２０に文
書検索部３０の処理機能を実行させる文書情報検索プロ
グラムを記録したコンピュータ読取り可能な記録媒体を
提供する。この記録媒体の実施形態としては、ＣＤ−Ｒ
ＯＭやフロッピディスクなどのリムーバブルな可搬型記
録媒体、回線によりプログラムを提供するプログラム提
供者の記憶装置、更にプログラムをインストールした処
理装置のＲＡＭやハードディスクなどのメモリ装置を含
む。The present invention also provides a computer-readable recording medium in which a document information search program for causing the search machine 20 of FIG. 4 to execute the processing function of the document search unit 30 is recorded. As an embodiment of this recording medium, a CD-R
It includes a removable portable recording medium such as an OM or a floppy disk, a storage device of a program provider that provides a program via a line, and a memory device such as a RAM or a hard disk of a processing device in which the program is installed.

【００７６】また記録媒体によって提供された図４の文
書検索部３０の機能を実現する文書情報検索プログラ
ム、具体的には図８及び図９のフローチャートの処理を
実行するステップを備えた文書情報検索プログラムは、
サーバなどの処理装置にローディングされ、その主メモ
リ上で実行される。A document information search program for realizing the function of the document search unit 30 shown in FIG. 4 provided by the recording medium, more specifically, a document information search program having steps for executing the processing in the flowcharts shown in FIGS. The program is
It is loaded into a processing device such as a server and executed on its main memory.

【００７７】またサーバ側にローディングされた本発明
の文書情報検索プログラムは、クライアント側からサー
ビス要求を受けると、クライアント１２側にファイル指
定による検索条件の指定を行うＷＷＷブラウザ機能をア
ップロードし、ユーザによる検索システムの利用を可能
とする。The document information retrieval program of the present invention loaded on the server side, upon receiving a service request from the client side, uploads a WWW browser function for designating a retrieval condition by specifying a file to the client 12 side. Enable the use of a search system.

【００７８】尚、上記の実施形態はサーバクライアント
型の検索システムを例にとるものであったが、本発明は
これに限定されず、ホスト端末型や適宜のシステム形態
をとることができる。また本発明は上記の実施形態に限
定されず、その目的と利点を損なわない適宜の変形を含
む。更にまた本発明は上記の実施形態に示した数値によ
る限定は受けない。Although the above embodiment has been described with reference to the server-client type search system as an example, the present invention is not limited to this, and may take the form of a host terminal or an appropriate system. In addition, the present invention is not limited to the above-described embodiments, and includes appropriate modifications that do not impair the objects and advantages thereof. Furthermore, the present invention is not limited by the numerical values shown in the above embodiments.

【００７９】（付記１）ネットワークを経由した検索要
求に基づいて文書情報を検索して応答する文書情報検索
装置に於いて、検索要求元に、検索条件としてファイル
を指定し、指定したファイル内容をネットワークを経由
して送信する検索条件指定部を設け、検索側に、前記検
索条件指定部から送信されたファイル内容からキーワー
ドを生成してデータベースから類似文書を検索する文書
検索部を設けたことを特徴とする文書情報検索装置。
（１）(Supplementary Note 1) In a document information search apparatus that searches for and responds to document information based on a search request via a network, a file is specified as a search condition as a search request source, and the content of the specified file is specified. A search condition specifying unit for transmitting the data via the network; and a search unit for generating a keyword from the file content transmitted from the search condition specifying unit and searching the database for a similar document. Characteristic document information search device.
(1)

【００８０】（付記２）付記１記載の文書情報検索装置
に於いて、前記検索条件指定部は、指定されたファイル
内容の先頭ファイル部分を送信することを特徴する文書
情報検索装置。(Supplementary Note 2) In the document information search device according to Supplementary Note 1, the search condition specifying unit transmits a first file portion of the specified file content.

【００８１】（付記３）付記１記載の文書情報検索装置
に於いて、前記検索条件指定部は検索条件として指定す
るファイルにＨＴＭＬファイル及びエクセルファイルを
含むことを特徴とする文書情報検索装置。(Supplementary Note 3) The document information search device according to Supplementary Note 1, wherein the search condition specifying unit includes an HTML file and an Excel file in files specified as search conditions.

【００８２】（付記４）付記１記載の文書情報検索装置
に於いて、前記データベースは、検索対象文書から抽出
した重要語を列挙したインデックス情報を文書毎に保存
し、サーバの文書検索部は、検索要求に伴って受信した
ファイル内容からテキスト文を抽出するテキスト抽出処
理部と、前記テキスト文の形態素解析により名詞を抽出
する形態素回析部と、前記名詞の中から重要語を抽出し
て論理和でつなげたキーワードを生成するキーワード生
成部と、前記キーワードによる検索データベースの検索
で類似する文書を検索してクライアントに検索結果を通
知する検索実行部と、を備えたことを特徴とする文書情
報検索装置。（２）(Supplementary Note 4) In the document information search device according to Supplementary Note 1, the database stores, for each document, index information that lists important words extracted from the search target document, and the document search unit of the server includes: A text extraction processing unit that extracts a text sentence from the file content received along with the search request; a morpheme analysis unit that extracts a noun by morphological analysis of the text sentence; Document information, comprising: a keyword generation unit that generates a keyword connected by a sum; and a search execution unit that searches for a similar document by searching a search database using the keyword and notifies a client of a search result. Search device. (2)

【００８３】（付記５）付記４記載の文書情報検索装置
に於いて、前記キーワード生成部は、各名詞が前記文書
データベースに格納した検索文書毎のインデックス中の
何文書に出現するかの出現数をカウントし、所定範囲の
出現数をもつ上位の所定数の単語を選択してキーワード
を生成することを特徴とする文書情報検索装置。（３）(Supplementary Note 5) In the document information search device according to Supplementary Note 4, the keyword generation unit may determine the number of occurrences of each document in the index of each search document stored in the document database. A document information search apparatus, which counts a number of words and selects a predetermined number of upper words having a number of occurrences within a predetermined range to generate a keyword. (3)

【００８４】（付記６）付記５記載の文書情報検索装置
に於いて、前記キーワード生成部は、インデックス中の
文書数Ｎとした場合、出現数Ｈが２Ｎ／３≧Ｈ≧１の範
囲の出現数をもつ上位の１０個の単語を選択してキーワ
ードを生成することを特徴とする文書情報検索装置。
（４）(Supplementary Note 6) In the document information search device according to Supplementary Note 5, when the number of documents in the index is N, the keyword generation unit may include an appearance number H in the range of 2N / 3 ≧ H ≧ 1 A document information search apparatus characterized by selecting top 10 words having a number and generating a keyword.
(4)

【００８５】（付記７）付記５記載の文書情報検索装置
に於いて、前記キーワード生成部は検索要求に伴って受
信したファイルから抽出したプロパティ情報を前記キー
ワードに含めて検索させることを特徴とする文書情報検
索装置。（５）(Supplementary Note 7) In the document information search device according to Supplementary Note 5, the keyword generation unit may perform a search by including property information extracted from a file received along with the search request in the keyword. Document information retrieval device. (5)

【００８６】（付記８）付記７記載の文書情報検索装置
に於いて、前記プロパティ情報は、検索要求に伴って受
信したファイルの作成者、文書タイトル等であることを
特徴とする文書情報検索装置。(Supplementary Note 8) In the document information search device according to supplementary note 7, the property information is a creator of a file, a document title, and the like received in response to the search request. .

【００８７】（付記９）付記１記載の文書情報検索装置
に於いて、前記検索要求元の検索条件指定部はクライア
ントのＷＷＷブラウザで提供され、前記ＷＷＷブラウザ
の検索要求画面で指定したファイル内容をネットワーク
を介してＷＷＷサーバの検索マシンに送信して前記文書
検索部に引き渡すことを特徴とする文書情報検索装置。(Supplementary Note 9) In the document information search device according to Supplementary Note 1, the search condition specifying unit of the search request source is provided by a WWW browser of a client, and a file content specified on a search request screen of the WWW browser is read A document information retrieval apparatus, which transmits the document information to a retrieval machine of a WWW server via a network and delivers the document to the document retrieval unit.

【００８８】（付記１０）検索対象文書から抽出した重
要語を列挙したインデックス情報を文書毎に保存してい
るデータベースと、文書ファイルを検索条件に指定した
ネットワークからの検索要求によって受信したファイル
内容からテキスト文を抽出するテキスト抽出処理部と、
前記テキスト文の形態素解析により名詞を抽出する形態
素解析部と、前記名詞の中から重要語を抽出して論理和
でつなげたキーワードを生成するキーワード生成部と、
前記キーワードによるデータベースの検索で類似する文
書を検索して要求元に検索結果を通知する検索実行部
と、を備えたことを特徴とする文書情報検索装置。
（６）(Supplementary Note 10) A database storing index information listing important words extracted from a search target document for each document, and a file content received by a search request from a network specifying a document file as a search condition. A text extraction processing unit for extracting a text sentence,
A morphological analysis unit that extracts a noun by morphological analysis of the text sentence, and a keyword generation unit that extracts an important word from the noun and generates a keyword connected by OR.
A document information search device, comprising: a search execution unit that searches for a similar document by searching the database using the keyword and notifies a search result to a request source.
(6)

【００８９】（付記１１）付記１０記載の文書情報検索
装置に於いて、前記キーワード生成部は、各名詞が前記
文書データベースに格納した検索文書毎のインデックス
中の何文書に出現するかの出現数をカウントし、所定範
囲の出現数をもつ上位の所定数の単語を選択してキーワ
ードを生成することを特徴とする文書情報検索装置。(Supplementary note 11) In the document information search device according to supplementary note 10, the keyword generation unit may determine the number of occurrences of each document in the index of each search document stored in the document database. A document information search apparatus, which counts a number of words and selects a predetermined number of upper words having a number of occurrences within a predetermined range to generate a keyword.

【００９０】（付記１２）付記１０記載の文書情報検索
装置に於いて、前記データベースにインデックス情報と
共に検索対象文書から抽出したプロパティ情報を保存
し、前記キーワード生成部は検索要求に伴って受信した
ファイルから抽出したプロパティ情報を前記キーワード
に含めて検索することを特徴とする文書情報検索装置。
（７）(Supplementary note 12) In the document information search device according to supplementary note 10, the database stores the index information and the property information extracted from the search target document, and the keyword generation unit stores the file received in response to the search request. A document information search device for searching by including the property information extracted from the keyword in the keyword.
(7)

【００９１】（付記１３）ネットワークを経由した検索
要求に基づいて文書情報を検索して応答する文書情報検
索方法に於いて、検索対象文書から抽出した重要語を列
挙したインデックス情報を文書毎にデータベースに保存
し、検索要求元で検索条件にファイルを指定した場合
に、指定したファイル内容を検索要求と共にネットワー
クを経由してサーバに送信し、検索側で、検索要求に伴
って受信したファイル内容からテキスト文を抽出すると
共にテキスト文の形態素解析により名詞を抽出し、次に
名詞の中から重要語を抽出して論理和でつなげたキーワ
ードを生成し、該キーワードによるデータベースの検索
で類似する文書を検索して検索結果を応答することを特
徴とする文書情報検索方法。（８）(Supplementary Note 13) In a document information retrieval method for retrieving and responding to document information based on a retrieval request via a network, index information listing key words extracted from a retrieval target document is stored in a database for each document. When a file is specified in the search conditions at the search request source, the specified file content is transmitted to the server via the network together with the search request, and the search side receives the file content from the file content received with the search request. A text sentence is extracted and a noun is extracted by morphological analysis of the text sentence. Then, an important word is extracted from the noun to generate a keyword connected by a logical sum, and a similar document is searched by searching the database using the keyword. A document information search method characterized by performing a search and responding a search result. (8)

【００９２】（付記１４）付記１３記載の文書情報検索
方法に於いて、前記キーワードの生成として、各名詞が
前記データベースに格納した文書毎のインデックス中の
何文書に出現するかの出現数をカウントし、所定範囲の
出現数をもつ上位の所定数の単語を選択してキーワード
を生成することを特徴とする文書情報検索方法。(Supplementary Note 14) In the document information search method according to supplementary note 13, as the generation of the keyword, the number of appearances of each document in the index of each document stored in the database is counted. A document information search method, wherein a keyword is generated by selecting a predetermined number of upper words having the number of appearances in a predetermined range.

【００９３】（付記１５）付記１４記載の文書情報検索
方法に於いて、検索要求に伴って受信したファイルから
抽出したプロパティ情報を前記キーワードに含めて検索
することを特徴とする文書情報検索方法。（９）(Supplementary note 15) The document information search method according to supplementary note 14, characterized in that property information extracted from a file received in response to the search request is included in the keyword and searched. (9)

【００９４】（付記１６）文書ファイルを検索条件に指
定した検索要求を受信するステップと、検索要求に伴っ
て受信したファイル内容からテキスト文を抽出するステ
ップと、テキスト文の形態素解析により名詞を抽出する
ステップと、名詞の中から重要語を抽出して論理和でつ
なげたキーワードを生成するステップと、前記キーワー
ドによるデータベースの検索で類似する文書を検索して
要求元に検索結果を通知するステップと、を備えた文書
情報検索プログラムを格納したコンピュータ可読の記録
媒体。（１０）(Supplementary Note 16) A step of receiving a search request specifying a document file as a search condition, a step of extracting a text sentence from the content of the file received in response to the search request, and a step of extracting a noun by morphological analysis of the text sentence Performing a keyword extraction from the noun to generate a keyword connected by logical OR, and searching a database based on the keyword for a similar document and notifying the request source of the search result. A computer-readable recording medium storing a document information search program comprising: (10)

【００９５】（付記１７）付記１６記載の記録媒体に於
いて、前記文書情報検索プログラムのキーワードを生成
するステップは、各名詞が前記データベースに格納した
文書毎のインデックス中の何文書に出現するかの出現数
をカウントし、所定範囲の出現数をもつ上位の所定数の
単語を選択してキーワードを生成することを特徴とする
記録媒体。(Supplementary Note 17) In the recording medium according to Supplementary Note 16, the step of generating a keyword of the document information search program may include determining which document in the index of each document stored in the database for each noun. A recording medium characterized by counting the number of occurrences of a word, and selecting a predetermined number of upper words having a number of occurrences within a predetermined range to generate a keyword.

【００９６】（付記１８）付記１４記載の記録媒体に於
いて、前記文書情報検索プログラムは、更に検索要求に
伴って受信したファイルから抽出したプロパティ情報を
前記キーワードに含めて検索するステップを備えたこと
を特徴とする記録媒体。(Supplementary Note 18) In the recording medium according to Supplementary Note 14, the document information search program further includes a step of searching the property information extracted from the file received in response to the search request by including the property information in the keyword. A recording medium characterized by the above-mentioned.

【００９７】（付記１９）コンピュータに、文書ファイ
ルを検索条件に指定した検索要求を受信するステップ
と、検索要求に伴って受信したファイル内容からテキス
ト文を抽出するステップと、テキスト文の形態素解析に
より名詞を抽出するステップと、名詞の中から重要語を
抽出して論理和でつなげたキーワードを生成するステッ
プと、前記キーワードによるデータベースの検索で類似
する文書を検索して要求元に検索結果を通知するステッ
プと、を実行させることを特徴とする文書情報検索プロ
グラム。（１１）(Supplementary Note 19) A step of receiving a search request in which a document file is specified as a search condition in a computer, a step of extracting a text sentence from the file content received along with the search request, and a morphological analysis of the text sentence A step of extracting a noun, a step of extracting a keyword from the noun to generate a keyword connected by a logical sum, and a step of searching for a similar document by searching the database using the keyword and notifying the request source of the search result. And a step of executing the document information search program. (11)

【００９８】[0098]

【発明の効果】以上説明してきたように本発明によれ
ば、ユーザが電子メールやインターネットなどで興味の
ある内容を含む文書を入手した際に、この文書に類似し
た内容の文書検索を文書ファイルを直接検索条件として
指定することで、内容が類似する文書を簡単且つ素早く
検索することができ、手間の掛かる文書内容に基づいた
キーワードの入力を不要とし、ユーザによる類似文書の
探し出しが極めて効率的に実現できる。As described above, according to the present invention, when a user obtains a document containing contents of interest via e-mail, the Internet, or the like, a document search similar to this document is performed in a document file. Is directly specified as a search condition, it is possible to easily and quickly search for documents with similar contents, eliminating the need to enter a keyword based on the time-consuming document contents, and making it extremely efficient for the user to search for similar documents. Can be realized.

【００９９】またファイル指定による文書検索に必要な
キーワードの生成において、文書内容から重要な単語を
抽出する以外に、文書ファイルの持っているプロパティ
情報からも重要な単語を抽出してキーワードに含めるこ
とで、データベースに登録している既存文書の類似検索
の絞り込みが、より適切に行われ、検索の精度を高める
ことができる。In addition, in generating a keyword necessary for a document search by specifying a file, in addition to extracting an important word from the document content, an important word is also extracted from property information of the document file and included in the keyword. Thus, the similarity search of the existing document registered in the database can be narrowed down more appropriately, and the accuracy of the search can be improved.

[Brief description of the drawings]

【図１】本発明の原理説明図FIG. 1 is a diagram illustrating the principle of the present invention.

【図２】本発明のシステム構成の説明図FIG. 2 is an explanatory diagram of a system configuration of the present invention.

【図３】本発明の機能構成のブロック図FIG. 3 is a block diagram of a functional configuration of the present invention.

【図４】本発明による文書検索部のブロック図FIG. 4 is a block diagram of a document search unit according to the present invention.

【図５】図３の検索データベース作成部の処理説明図FIG. 5 is an explanatory diagram of a process performed by a search database creation unit in FIG. 3;

【図６】図３のブラウザ処理のフローチャートFIG. 6 is a flowchart of a browser process of FIG. 3;

【図７】本発明の検索条件に文書ファイルを指定する検
索要求操作の説明図FIG. 7 is an explanatory diagram of a search request operation for designating a document file as a search condition according to the present invention.

【図８】本発明のサーバ検索処理のフローチャートFIG. 8 is a flowchart of a server search process according to the present invention.

【図９】図８のテキスト抽出処理のフローチャートFIG. 9 is a flowchart of a text extraction process in FIG. 8;

【図１０】図８の処理によりＨＴＭＬファイルからのテ
キスト文書を抽出する説明図FIG. 10 is an explanatory diagram for extracting a text document from an HTML file by the processing of FIG. 8;

【図１１】本発明の検索に使用するＨＴＭＬファイルに
設けたプロパティ情報の説明図FIG. 11 is an explanatory diagram of property information provided in an HTML file used for a search according to the present invention.

【図１２】図８の処理によりテキスト抽出対象とするＥ
ｘｃｅｌ文書の説明図12 is a diagram illustrating an example of a text extraction target E by the processing of FIG.
Illustration of xcel document

【図１３】図１２のＥｘｃｅｌ文書から抽出したテキス
ト文書の説明図13 is an explanatory diagram of a text document extracted from the Excel document of FIG.

[Explanation of symbols]

１０：サーバ１２：クライアント１４：インターネット／イントラネット１６：ＷＷＷブラウザ１８：ＷＷＷサーバ２０：検索マシン２２：検索データベース２４，４６，５０：文書データベース２５：検索対象文書２６：検索条件指定部２８：検索データベース作成部３０：文書検索部３２：文書参照部３４：検索指定ファイル格納部３６：テキスト抽出処理部３８：形態素解析部４０：キーワード作成部４２：検索実行部４４，４８：文書管理サーバ５４：ロボット５６：テキスト抽出部５８：重要語抽出部６０：インデックス作成部６２：テンポラリファイル６４：収集文書リストフアァイル６６：文書６８：抽出テキストファイル７０：重要語ファイル 10: server 12: client 14: Internet / intranet 16: WWW browser 18: WWW server 20: search machine 22: search database 24, 46, 50: document database 25: search target document 26: search condition specification unit 28: search database Creation unit 30: Document search unit 32: Document reference unit 34: Search designation file storage unit 36: Text extraction processing unit 38: Morphological analysis unit 40: Keyword creation unit 42: Search execution unit 44, 48: Document management server 54: Robot 56: Text extraction unit 58: Key word extraction unit 60: Index creation unit 62: Temporary file 64: Collected document list file 66: Document 68: Extracted text file 70: Key word file

Claims

[Claims]

1. A document information retrieval apparatus which retrieves and responds to document information based on a retrieval request via a network, wherein a file is designated as a retrieval condition to the retrieval request source.
A search condition specifying unit for transmitting the specified file content via the network is provided, and a search machine for generating a keyword from the file content transmitted from the search condition specifying unit and searching for a similar document from the database is provided on the search side. A document information search device characterized by being provided.

2. The document information search device according to claim 1, wherein the database stores, for each document, index information listing important words extracted from the search target document, and the search machine responds to the search request. A text extraction processing unit for extracting a text sentence from the file content received therewith; a morphological analysis unit for extracting a noun by morphological analysis of the text sentence; an important word extracted from the noun and connected by a logical sum A document information search device, comprising: a keyword generation unit that generates a keyword; and a search execution unit that searches for a similar document by searching the search database using the keyword and notifies a client of a search result.

3. The document information search device according to claim 2, wherein the keyword generation unit determines the number of occurrences of each noun in an index for each search document stored in the document database. A document information search device which counts and selects a predetermined number of upper words having a number of occurrences within a predetermined range to generate a keyword.

4. The document information search device according to claim 3, wherein the keyword generation unit sets the number of occurrences H in the range of 2N / 3 ≧ H ≧ 1 where N is the number of documents in the index. A document information search apparatus characterized in that a keyword is generated by selecting the top 10 words having the following.

5. The document information retrieval apparatus according to claim 3, wherein the keyword generation unit causes the keyword to include property information extracted from a file received in response to a search request and search the document. Information retrieval device.

6. A database storing index information listing key words extracted from a search target document for each document, and a search request from a network that specifies a document file not registered in the search database as a search condition. A text extraction processing unit for extracting a text sentence from the file content received by the above, a morphological analysis unit for extracting a noun by morphological analysis of the text sentence, and a keyword obtained by extracting important words from the noun and connecting them by logical OR And a search execution unit that searches for a similar document by searching the database using the keyword and notifies a search result to a request source.

7. The document information search apparatus according to claim 6, wherein said keyword generation unit searches for the property information extracted from the file received in response to the search request by including the property information in the keyword. Information retrieval device.

8. A document information retrieval method for retrieving and responding to document information based on a retrieval request via a network, wherein index information listing important words extracted from a retrieval target document is stored in a database for each document. When a file is specified in the search conditions at the search request source, the specified file content is transmitted to the search destination via the network together with the search request, and the search side converts the text from the file content received with the search request to text. Extract a sentence and extract a noun by morphological analysis of the text sentence, then extract key words from the noun, generate keywords connected by logical OR, and search for similar documents by searching the database using the keywords And responding a search result.

9. The document information search method according to claim 8, wherein property information extracted from a file received along with the search request is included in the keyword and searched.

10. A step of receiving a search request specifying a document file as a search condition, a step of extracting a text sentence from the file content received with the search request, and a step of extracting a noun by morphological analysis of the text sentence Extracting key words from the nouns to generate keywords connected by logical OR; searching a database based on the keywords for similar documents and notifying the requester of the search result;
A computer-readable recording medium storing a document information search program comprising:

11. A computer, comprising: a step of receiving a search request in which a document file is specified as a search condition; a step of extracting a text sentence from the content of the file received in accordance with the search request; Extracting, extracting key words from the nouns to generate keywords connected by logical OR, searching a database based on the keywords to search for similar documents and notifying the request source of the search result When,
A document information search program characterized by executing the following.