JPH06348757A

JPH06348757A - Device and method for retrieving document

Info

Publication number: JPH06348757A
Application number: JP5135590A
Authority: JP
Inventors: Sachiko Koyama; 幸子小山; Tadahiro Kiyama; 忠博木山; Hiroshi Tsuji; 洋辻; Satoshi Asakawa; 悟志浅川
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1993-06-07
Filing date: 1993-06-07
Publication date: 1994-12-22

Abstract

PURPOSE:To utilize frequency information in the main body file of a keyword designated by a retrieving person for high-speed retrieval concerning document retrieval with the device and method for retrieving document so as to equivalently retrieve all the sentences at high speed by preparing a compressed file from the main body file. CONSTITUTION:This device is provided with a word division part 1, appearance frequency detection part 2 and frequency header preparation part 3, at the time of data base registration, the frequency information of a document is provided, a data managing data file (e) and a compressed data file (g) with frequency information are registered on a data base, and a document information acquisition part 11 is provided. Since only a part provided with the compressed file matched with frequency designated by a user is defined as a retrieval target out of the data base, all the sentences can be retrieved at high speed, the total frequency of the provided document between the documents of words excepting for the keyword or the number of documents to appear can be provided as the retrieved result, and retrieval noise can be reduced.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は利用者から指定されたキ
ーワードを含む文書を文書データベースから抽出する文
書検索方法および装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document search method and apparatus for extracting a document containing a keyword designated by a user from a document database.

【０００２】[0002]

【従来の技術】計算機の処理速度向上に伴って、文書検
索方式はインデックス検索から自由語による全文検索シ
ステムが利用されるようになってきた。代表的なシステ
ムとして、特開Ｈ０３−０５８３１１フルテキストサー
チ方式および装置（中研受付番号３１９００１４８
４）、第４５回情報処理学会全国大会講演論文集（３）
３−２３９−２４４に記載されている階層型プリサーチ
方式によるフルテキストサーチシステム（日立、Ｂｉｂ
ｌｉｏｔｈｅｃａ／ＴＳ）や電気情報通信学会技術研究
報告ＤＥ９０−３４に記載されているフルテキストデー
タベースシステム（松下、検蔵君）がある。2. Description of the Related Art With the increase in computer processing speed, a full-text search system using free words has come to be used as a document search method from index search. As a typical system, Japanese Patent Laid-Open No. H03-058311 full text search system and device (Juken reception number 31900148)
4), Proc. Of the 45th IPSJ National Convention (3)
Full-text search system by the hierarchical pre-search method described in 3-239-244 (Hitachi, Bib
liotheca / TS) and the full text database system (Matsushita, Kenzo-kun) described in Technical Research Report DE 90-34 of the Institute of Electrical, Information and Communication Engineers.

【０００３】前述の階層型プリサーチ方式では、１文書
が文書中に出現する各文字についてその存在を１ビット
の情報で現した文字成分表、テキストファイルから繰り
返し現れる単語の重複を排除して作成された圧縮デー
タ、および文書の本体データの３つのファイルから構成
される。キーワードが与えられるとまず文字成分表が検
索され、キーワードとして指定された文字を含まない文
書は検索の対象から除外される。In the above-described hierarchical pre-search method, one document is created by eliminating the duplication of repeatedly appearing words from a character component table showing the existence of each character appearing in the document by 1-bit information, and a text file. It is composed of three files of compressed data and document body data. When a keyword is given, the character component table is searched first, and documents that do not include the character designated as the keyword are excluded from the search target.

【０００４】次に圧縮テキストで単語レベルの検索が行
われ、さらに絞り込みが行われ、必要が生じた場合のみ
本体データであるテキスト内の検索が行われる。例え
ば、文字列のみの指定であれば圧縮テキストを参照する
のみで、本文データを参照せずに検索は終了するが、２
つのキーワードの間の文字数の指定がある場合（近傍検
索）は、圧縮テキストの検索で２つのキーワードを含む
文書を絞りこんでから、本文の検索を行い２つのキーワ
ードの文字間隔をチェックする。Next, a word-level search is performed on the compressed text, further narrowing down is performed, and a search in the text that is the main data is performed only when necessary. For example, if only a character string is specified, only the compressed text is referenced, and the search ends without referring to the body data.
When the number of characters between two keywords is specified (neighborhood search), the text containing the two keywords is narrowed down by the compressed text search, and then the text is searched to check the character spacing between the two keywords.

【０００５】特開Ｈ０３−０５８３１１フルテキストサ
ーチ方式および装置では登録文書の本文文字列をひらか
な、漢字等の文字種ごとに分割し、分割した各部文文字
列間で相互に、文字列の包含関係を調べ、他の文字列を
排除した部分文字列の集合を圧縮データとする。In the full-text search system and apparatus, the character string of the registered document is divided into character types such as hiragana and kanji, and the character strings are mutually included in the divided partial character strings. And the set of partial character strings excluding other character strings is used as compressed data.

【０００６】[0006]

【発明が解決しようとする課題】従来の階層プリサーチ
方式では（１）指定されたキーワードを多数含む文書も１個しか
含まない文書も同等に扱われる。（２）指定されたキーワードに単語の一部が一致する場
合に利用者が意図しない単語を含む文書もヒットする。
例えば、”コメ”をキーワードとして指定した場合、”
コメント”を含む文書も検索結果に含まれる。（３）ヒット件数が膨大な数になった場合、利用者は検
索結果を絞り込むことを所望するが、検索結果集合につ
いては件数以外の情報が得られない。（２）については前述の検蔵君では検索結果として得ら
れたテキストを後処理として解析しているが、検索時間
の著しい増加を招くという問題がある。In the conventional hierarchical pre-search method, (1) a document containing a large number of designated keywords and a document containing only one keyword are treated equally. (2) A document containing a word not intended by the user is also hit when a part of the word matches the specified keyword.
For example, if you specify "rice" as a keyword,
Documents including "comments" are also included in the search results. (3) When the number of hits becomes enormous, the user wants to narrow down the search results, but information other than the number of search results can be obtained. Regarding (2), the above-mentioned Kenzo-kun analyzes the text obtained as a search result as post-processing, but has the problem of significantly increasing the search time.

【０００７】本発明の目的は、データベースに文書を登
録する際に、単語辞書を用いた単語分割を行い、単語頻
度を算出し、それを圧縮テキストに反映することによっ
て（１）文書に含まれるキーワードの個数を検索条件に
含めることを許容し、（２）検索時の応答時間を劣化す
ることなく検索ノイズを減らす、さらに、上記手段に加
えて文書情報取得手段を加えることによって、（３）検
索結果を絞り込むための手がかりを利用者に与える、シ
ステムを提供することにある。The object of the present invention is to include a word in a document by registering the document in a database by dividing the word using a word dictionary, calculating the word frequency, and reflecting the word frequency in the compressed text. By allowing the number of keywords to be included in the search condition, (2) reducing search noise without degrading the response time at the time of search, and further adding document information acquisition means in addition to the above means, (3) It is to provide a system that gives users a clue for narrowing down search results.

【０００８】[0008]

【課題を解決するための手段】上記目的を達成するため
に、本発明の文書検索装置は、単語分割手段によって本
文データの単語分割（形態素解析）を行ったのち、頻度
検出手段によって単語の繰り返しを除き単語が頻度順に
並んだ圧縮テキストを作成する。In order to achieve the above object, the document retrieval apparatus of the present invention performs word division (morphological analysis) of text data by word division means and then repeats words by frequency detection means. Creates a compressed text with words excluding.

【０００９】次に、ヘッダ作成手段によって、圧縮テキ
ストと共に各頻度の単語が圧縮テキストのどこにあるか
示した頻度ヘッダを作成し、登録手段によって、本文デ
ータ、圧縮データ、頻度ヘッダをデータベースに登録す
る。さらに、本発明の文書検索装置はヒット文書を構成
する情報を取得するための文書情報取得手段を具備し、
得られた検索結果集合に含まれる単語の頻度情報を出現
頻度検出手段によって求めて、利用者に提示する。Next, the header creating means creates a frequency header indicating where in the compressed text the words of each frequency are located together with the compressed text, and the registration means registers the body data, the compressed data and the frequency header in the database. . Furthermore, the document search device of the present invention comprises a document information acquisition unit for acquiring information that constitutes a hit document,
The frequency information of the words included in the obtained search result set is obtained by the appearance frequency detecting means and presented to the user.

【００１０】[0010]

【作用】本発明による文書検索装置は、まず、登録対象
文書に対して単語分割を行った後、文書中の単語の出現
頻度を算出し、単語を出現頻度順にソートした状態の単
語出現頻度テーブルを作成する。次に、各頻度以上の単
語が単語出現頻度テーブルのどこに位置するのが示す頻
度ヘッダを作成する（例えば、頻度１０以上の単語は圧
縮テキストの６番目までといった情報を持つインデック
スファイル）。各文書ごとに、頻度ヘッダと単語出現頻
度テーブルの見出し語を頻度情報付き圧縮テキストとし
て、テキストデータを全文テキストファイルとして文書
データベースに登録する。The document retrieval apparatus according to the present invention first performs word division on the document to be registered, calculates the frequency of appearance of words in the document, and sorts the words in the order of appearance frequency. To create. Next, a frequency header indicating where in the word appearance frequency table the words having the respective frequencies or more are located is created (for example, an index file having information such that words having a frequency of 10 or more are up to the sixth compressed text). For each document, the frequency header and the entry word of the word appearance frequency table are registered as compressed text with frequency information, and the text data is registered as a full-text text file in the document database.

【００１１】上記方法で作成された文書データベースを
検索する場合は、キーワードの出現頻度の指定があれば
（例えば、”コメ”を５回以上含む文書を探せ）、圧縮
テキストの頻度ファイルを参照して、該当する頻度の位
置を圧縮ファイルの参照終了位置として取得し、次に検
索実行部が圧縮テキスト読み込み終了位置まで圧縮テキ
スト参照し、キーワードが含まれるかどうか判断する。When searching the document database created by the above method, if a keyword appearance frequency is specified (for example, search for a document containing "rice" five times or more), the compressed text frequency file is referred to. Then, the position of the corresponding frequency is acquired as the reference end position of the compressed file, and then the search execution unit refers to the compressed text up to the end position of reading the compressed text to determine whether or not the keyword is included.

【００１２】以上により頻度条件が指定された場合に従
来方式よりも高速に検索可能である。特に、指定された
キーワードの頻度が高い場合は圧縮テキストの参照量が
減少するために高速な検索が可能になるので、利用者は
要求に応じた検索をきめ細かに行うことができる。検索
結果に対しても表示の際に指定されたキーワードを反映
したソートを行うことも高速にできる。When the frequency condition is designated as described above, the search can be performed faster than the conventional method. In particular, when the frequency of the designated keyword is high, the reference amount of the compressed text is reduced, so that the high-speed search can be performed, so that the user can finely perform the search according to the request. It is also possible to sort the search results at high speed by reflecting the keywords specified at the time of display.

【００１３】さらに、本発明による文書検索装置は”検
索結果集合１に含まれる単語を出現文書数の昇順で３０
個みせて”といったような利用者の要求を受け付ける。
このとき、利用者の要求に応じて、ヒット文書に対応す
る圧縮ファイルから各見出し語、頻度及び、データベー
ス内で文書に付与された番号が１レコードとして文書番
号付き出現頻度テーブルに書き出される。次に、ヒット
した全文書から得られた文書番号付き出現頻度テーブル
を対象に、各単語の文書間での総出現頻度と出現文書数
が求め、利用者の指示に従って表示する。Further, the document search apparatus according to the present invention is arranged such that the words included in the search result set 1 are 30 in ascending order of the number of appearing documents.
It accepts user requests such as "show me personally".
At this time, in response to the user's request, each entry word, frequency, and number assigned to the document in the database from the compressed file corresponding to the hit document are written as one record in the document number-appearing frequency table. Next, with respect to the appearance frequency table with document numbers obtained from all the hit documents, the total appearance frequency between documents of each word and the number of appearance documents are obtained, and displayed according to the instruction of the user.

【００１４】以上により、利用者はヒット文書の本文を
個別に参照することなく、ヒット文書全体の概要を把握
することが可能で、絞り込み検索を行う場合の適切な手
掛かりを得ることができる。As described above, the user can grasp the outline of the entire hit document without referring to the text of the hit document individually, and can obtain an appropriate clue when performing a narrowed search.

【００１５】[0015]

【実施例】以下、本発明の第１実施例を図１から図２４
を用いてより詳細に説明する。図１は本発明の第１実施
例の概要を示した図である。頻度情報付き圧縮データ作
成部は単語分割部１、出現頻度検出部２、ヘッダ作成部
３から構成される。まず、単語分割部１が入力された文
書データａを単語分割し、単語分割結果を単語分割テー
ブルｂに登録する。出現頻度検出部２は単語分割テーブ
ルｂを参照し、単語の出現頻度を算出して、単語を出現
頻度順に単語を並べ変えて、単語出現頻度テーブルｃに
登録する。ヘッダ作成部３は単語出現頻度テーブルｃを
参照して、頻度の累積度数分布表を作成して、頻度分布
テーブルｄに登録する。登録部は文書データａ、単語出
現頻度テーブルｃ、頻度分布テーブルｄを参照して、文
書データを全文テキストファイルｆに、頻度分布テーブ
ルｄと単語出現頻度テーブルｃを頻度情報付き圧縮デー
タｇに登録する。さらに、管理データファイルｅに各文
書番号、全文テキスト先頭アドレス、単語出現頻度テー
ブルｅの先頭アドレス、頻度分布表の先頭アドレスを登
録する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A first embodiment of the present invention will be described below with reference to FIGS.
Will be described in more detail. FIG. 1 is a diagram showing an outline of a first embodiment of the present invention. The compressed data creation unit with frequency information is composed of a word division unit 1, an appearance frequency detection unit 2, and a header creation unit 3. First, the word division unit 1 divides the input document data a into words, and the word division result is registered in the word division table b. The appearance frequency detection unit 2 refers to the word division table b, calculates the appearance frequencies of the words, rearranges the words in the order of appearance frequencies, and registers the words in the word appearance frequency table c. The header creating unit 3 creates a cumulative frequency distribution table by referring to the word appearance frequency table c and registers it in the frequency distribution table d. The registration unit refers to the document data a, the word appearance frequency table c, and the frequency distribution table d to register the document data in the full text file f, and the frequency distribution table d and the word appearance frequency table c in the compressed data g with frequency information. To do. Further, each document number, the full text text start address, the start address of the word appearance frequency table e, and the start address of the frequency distribution table are registered in the management data file e.

【００１６】検索コマンド作成部５は利用者の入力する
検索指示文字列ｈを取得し、検索コマンドに変換して、
検索コマンドテーブルｉに登録する。検索実行部６は検
索コマンドテーブルｉを参照して、管理データファイル
ｅ、全文テキストデータファイルｆ、頻度情報付き圧縮
データファイルｇからなる文書データベースを検索す
る。検索結果は検索結果テーブルｊに登録される。結果
整理部７は検索結果テーブルｊをキーワードの出現頻度
をキーにをソートし、ソートされた検索結果テーブルｋ
に登録する。The search command creating section 5 acquires the search instruction character string h input by the user, converts it into a search command,
Register in the search command table i. The search execution unit 6 refers to the search command table i and searches the document database including the management data file e, the full-text text data file f, and the compressed data file g with frequency information. The search result is registered in the search result table j. The result organizing unit 7 sorts the search result table j using the appearance frequency of the keyword as a key, and the sorted search result table k
Register with.

【００１７】図から明らかなように、単語分割部１、出
現頻度検出部２、頻度ヘッダ作成部３、登録部４、検索
コマンド作成部５、検索実行部６、検索結果整理部７は
処理を示し、文書データａ、単語分割テーブルｂ、単語
出現頻度テーブルｃ、頻度分布テーブルｄ、文書データ
ベース（管理ファイルｅ、全文テキストファイルｆ、頻
度情報付き圧縮ファイルｇ）、検索結果テーブルｊ、ソ
ートされた検索結果テーブルｋはファイルである（テー
ブルとも呼ぶ）。このように本実施例によれば、各機能
ブロックがプログラム論理によって構成されている。そ
のため、各機能ブロック単位にＬＳＩ化が可能であり、
文書処理装置として高速化を図ることができる。As is apparent from the figure, the word division unit 1, the appearance frequency detection unit 2, the frequency header creation unit 3, the registration unit 4, the search command creation unit 5, the search execution unit 6, and the search result arrangement unit 7 perform processing. Document data a, word division table b, word appearance frequency table c, frequency distribution table d, document database (management file e, full-text text file f, compressed file with frequency information g), search result table j, sorted The search result table k is a file (also called a table). As described above, according to this embodiment, each functional block is configured by the program logic. Therefore, it is possible to make LSI for each functional block unit,
It is possible to increase the speed of the document processing device.

【００１８】図２は図１における文書検索装置の全体的
なハードウェア構成図を示すブロック図である。入出力
装置８はデータの入力および各種情報の表示を行う。プ
ロセッサ９は、プログラムに基づき、図１における処理
を実行する。記憶装置１０は図１における各種データや
プログラムを格納する。記憶装置１０はプロセッサ９の
各実行処理用のメモリであるワーキングエリアａ、ｂ、
ｃ、ｄ、ｈ、ｉ、ｊ、、ｋ、単語分割部格納エリア１０
０、出現頻度検出部格納エリア２００、ヘッダ作成部格
納エリア３００、登録部格納エリア４００、文書データ
ベース格納エリアｅ、ｆ、ｇ、検索コマンド作成エリア
５００、検索実行部格納エリア６００、検索結果整理部
格納エリア７００の記憶部を持っている。記憶装置１０
に格納される各プログラムはプロセッサにおいて実行さ
れる。その実行に際し、必要に応じて入出力装置８が用
いられる。FIG. 2 is a block diagram showing an overall hardware configuration diagram of the document retrieval apparatus in FIG. The input / output device 8 inputs data and displays various information. The processor 9 executes the processing in FIG. 1 based on the program. The storage device 10 stores various data and programs shown in FIG. The storage device 10 includes working areas a, b, which are memories for each execution process of the processor 9.
c, d, h, i, j, k, word division storage area 10
0, appearance frequency detection section storage area 200, header creation section storage area 300, registration section storage area 400, document database storage areas e, f, g, search command creation area 500, search execution section storage area 600, search result organization section It has a storage unit for the storage area 700. Storage device 10
Each program stored in is executed in the processor. When executing the operation, the input / output device 8 is used as necessary.

【００１９】図３は図１における単語分割部１の処理手
順を表すＰＡＤ図（Problem Analysis Diagram）で、文
書データａを取得し、単語分割テーブルｂに格納するま
での処理を示したものである。以下、この処理をＰＡＤ
図に従って説明する。文書データａを参照し、先頭文書
データから末尾文書データまで以下の処理を行う（ステ
ップ１０１）。まず、１文書分のデータを取得し（ステ
ップ１０２）、文書データを単語分割して（ステップ１
０３）、見出し文字列、相対的な文書番号を単語分割テ
ーブルに格納し（ステップ１０４）、次に処理の対象を
次の文書データに移動する（ステップ１０５）。以上の
ステップにより図４に示す文書データａを図５に示す単
語分割テーブルｂに格納する。FIG. 3 is a PAD diagram (Problem Analysis Diagram) showing the processing procedure of the word division unit 1 in FIG. 1, and shows the processing until the document data a is acquired and stored in the word division table b. . Below, this process
It will be described with reference to the drawing. With reference to the document data a, the following processing is performed from the first document data to the last document data (step 101). First, data for one document is acquired (step 102), and the document data is divided into words (step 1).
03), the headline character string and the relative document number are stored in the word division table (step 104), and then the processing target is moved to the next document data (step 105). Through the above steps, the document data a shown in FIG. 4 is stored in the word division table b shown in FIG.

【００２０】図４は文書データａの例である。図５は単
語分割テーブルｂの例であり、文書番号ｂ１、見出し語
ｂ２の項目から構成されている。図６は出現頻度検出部
２の処理手順を示すＰＡＤ図である。単語分割テーブル
ｂからデータを取得し、単語出現頻度テーブルｃにデー
タを格納するまでの処理を示したものである。単語分割
テーブルｂの文書データの先頭文書のデータから末尾文
書のデータまで（ステップ２０１）以下の処理を行う。
まず、単語分割テーブルの先頭見出しから末尾見出しま
で（ステップ２０２）、１レコード分のデータを読み込
む（ステップ２０３）。次にステップ２０４に進み、単
語分割ファイルを対象に見出しが同じレコードの検索を
行い、頻度を算出する。ステップ２０５で、各見出し語
の出現頻度に従って降順にソート行い、ステップ２０６
でソートされた単語出現頻度レコードが単語出現頻度テ
ーブルｃに格納される。ステップ２０７で処理対象を次
のレコードに移す。FIG. 4 shows an example of the document data a. FIG. 5 shows an example of the word division table b, which is composed of items of a document number b1 and a headword b2. FIG. 6 is a PAD diagram showing a processing procedure of the appearance frequency detection unit 2. It shows a process of acquiring data from the word division table b and storing the data in the word appearance frequency table c. From the first document data to the last document data of the document data of the word division table b (step 201), the following processing is performed.
First, data for one record is read from the heading to the end of the word division table (step 202) (step 203). Next, in step 204, records having the same heading are searched for in the word division file, and the frequency is calculated. In step 205, sorting is performed in descending order according to the appearance frequency of each entry word, and step 206
The word appearance frequency records sorted by are stored in the word appearance frequency table c. In step 207, the processing target is moved to the next record.

【００２１】図７は出現頻度検出部２によって作成され
た出現頻度テーブルｃの例で、見出し文字列ｃ１、頻度
ｃ２、文書番号ｃ３の項目から構成される。図８は頻度
ヘッダ作成部３の処理手順を示すＰＡＤ図で、単語出現
頻度テーブルｃからデータを取得し、頻度分布テーブル
ｄにデータを格納するまでの処理を示したものである。
以下の処理を文書データの先頭文書から末尾文書まで行
う（ステップ３０１）。ステップ３０２で頻度の累積度
数を取得するために用いる変数ｃｏｕｎｔの初期化を行
う。次に、各文書の単語頻度テーブルの最初のレコード
から最後のレコードまで（ステップ３０３）、同じ頻度
のレコードが続く間（ステップ３０４）、レコードを読
み込み（ステップ３０５）、変数ｃｏｕｎｔに１を加え
（ステップ３０６）、処理対象を次のレコードに移動す
る（ステップ３０７）。ステップ３０４でレコードの頻
度が変わっていたら（単語出現頻度を降順にデータが並
べられているので、頻度が減少したら）、ステップ３０
８で頻度と変数ｃｏｕｎｔの値を頻度分布テーブルに書
き込む。変数ｃｏｕｎｔの値は頻度の累積度数の値を示
す。次にステップ３０９で処理対象を次のレコードに移
動する。ステップ３１０で処理対象を次文書に移動す
る。FIG. 7 shows an example of the appearance frequency table c created by the appearance frequency detection unit 2, which is composed of items of a headline character string c1, a frequency c2, and a document number c3. FIG. 8 is a PAD diagram showing the processing procedure of the frequency header creation unit 3, showing the processing from acquiring data from the word appearance frequency table c to storing the data in the frequency distribution table d.
The following processing is performed from the first document to the last document of the document data (step 301). In step 302, the variable count used to acquire the cumulative frequency is initialized. Next, from the first record to the last record of the word frequency table of each document (step 303), while records of the same frequency continue (step 304), the records are read (step 305) and 1 is added to the variable count ( In step 306, the processing target is moved to the next record (step 307). If the frequency of the record has changed in step 304 (if the frequency decreases because the data is arranged in descending order of word appearance frequency), step 30
In step 8, the frequency and the value of the variable count are written in the frequency distribution table. The value of the variable count indicates the value of cumulative frequency. Next, in step 309, the processing target is moved to the next record. In step 310, the processing target is moved to the next document.

【００２２】図９はヘッダ作成部によって作成された頻
度分布テーブルｄの例で頻度見出しｄ１、累積度数ｄ
２、文書番号ｄ３の項目からなっている。この例は図７
の例について作成した頻度分布テーブルで文書１には出
現頻度が７の単語が１個、出現頻度が６の単語が０、出
現頻度が５の単語が１個、頻度が４の単語が２つあるこ
とを示している。FIG. 9 shows an example of the frequency distribution table d created by the header creating section, with a frequency index d1 and a cumulative frequency d.
2. Document number d3. This example is shown in Figure 7.
In the frequency distribution table created for the above example, in the document 1, one word with an appearance frequency of 7, one word with an appearance frequency of 0, one word with an appearance frequency of 5 and two words with a frequency of 4 It indicates that there is.

【００２３】図１０は登録部４の処理手順を示したＰＡ
Ｄ図で文書データａ、単語出現頻度テーブルｃ、頻度分
布テーブルｄを取得して文書データベースの各ファイル
ｅ、ｆ、ｇに格納するまでの処理を示したものである。
以下、順に説明する。ステップ４０１で登録用データを
取得する。登録用データの先頭データから末尾データま
で以下の処理を行う（ステップ４０２）。まず、ステッ
プ４０３で文書番号を取得し、次に、ステップ４０４で
データの種類を判定し、文書データａであれば全文テキ
ストデータファイルに文書データを格納し（ステップ４
０５）、単語出現頻度テーブルｃであれば圧縮データフ
ァイルｇに見出し文字列ｃ１、および、頻度ｃ２を格納
し（ステップ４０６）、頻度分布テーブルｄであれば圧
縮データファイルの頻度情報として頻度ヘッダファイル
ｇへ頻度ｄ１と累積度数ｄ２を登録する（ステップ４０
７）。ステップ４０８で格納したファイルの先頭アドレ
スを管理データファイルｅの文書番号の一致したレコー
ド欄に書き込む。ステップ４０９で処理対象を次文書へ
移す。FIG. 10 is a PA showing the processing procedure of the registration unit 4.
FIG. 6D shows the process of acquiring the document data a, the word appearance frequency table c, and the frequency distribution table d and storing them in the files e, f, and g of the document database.
Hereinafter, they will be described in order. In step 401, registration data is acquired. The following processing is performed from the first data to the last data of the registration data (step 402). First, in step 403, the document number is acquired, then in step 404, the type of data is determined, and if the document data is a, the document data is stored in the full-text text data file (step 4
05), if it is the word appearance frequency table c, the header character string c1 and the frequency c2 are stored in the compressed data file g (step 406), and if it is the frequency distribution table d, the frequency header file as the frequency information of the compressed data file. The frequency d1 and the cumulative frequency d2 are registered in g (step 40).
7). The head address of the file stored in step 408 is written in the record column of the management data file e with the matching document number. In step 409, the processing target is moved to the next document.

【００２４】図１１は登録部４によって登録された管理
データファイルの例で、１レコードは文書番号ｅ１、圧
縮先頭アドレスｅ２、頻度ヘッダ先頭アドレスｅ３、全
文テキストデータ先頭アドレスｅ４の項目から構成され
ている。データベースに登録されている文書数分のレコ
ードがある。FIG. 11 shows an example of a management data file registered by the registration unit 4. One record is composed of items of a document number e1, a compression start address e2, a frequency header start address e3, and a full text text data start address e4. There is. There are records for the number of documents registered in the database.

【００２５】図１２は登録部４によって登録された頻度
情報付き圧縮データｇの例を示す図で、各文書ごとに、
文書番号ｇ１、頻度ヘッダｇ２、圧縮テキストｇ３から
構成されている。データベースに登録されている文書数
分のレコードがある。図１３は入出力装置８を介して表
示された入力用画面に利用者が入力した検索文字列の例
を示す。FIG. 12 is a diagram showing an example of the compressed data g with frequency information registered by the registration unit 4. For each document,
It is composed of a document number g1, a frequency header g2, and a compressed text g3. There are records for the number of documents registered in the database. FIG. 13 shows an example of a search character string input by the user on the input screen displayed via the input / output device 8.

【００２６】図１４は入出力装置８を介して表示された
入力用画面に利用者が入力した検索文字列の例を示す。
本例では利用者が頻度情報の指定を行っている。キーワ
ード”ネットワーク”と”コンピュータ”をそれぞれ２
個以上含む文書を検索しろという意味である。FIG. 14 shows an example of a search character string input by the user on the input screen displayed via the input / output device 8.
In this example, the user specifies the frequency information. The keywords "network" and "computer" are each 2
This means searching for documents that contain more than one document.

【００２７】図１５は入出力装置８を介して表示された
入力用画面に利用者が入力した検索文字列の例を示す。
本例では利用者は語を単位として近傍検索をしている。FIG. 15 shows an example of a search character string input by the user on the input screen displayed via the input / output device 8.
In this example, the user performs a neighborhood search in units of words.

【００２８】図１６は検索コマンド作成部５の処理手順
を示すＰＡＤ図で、検索指示文字列ｈを入力として、検
索コマンドテーブルｉに検索コマンドを格納するまでの
処理を行う。ステップ５０１で入力モードがコマンドモ
ードであるか、自然語モードであるか判定し、コマンド
モードの場合はステップ５０２で文法のチェックを行
い、エラーがあった場合にはステップ５０３で終了処理
を行う。ステップ５０１で入力モードが自然語と判定さ
れた場合はステップ５０４で意味解析を行い、ステップ
５０５でコマンド生成を行う。FIG. 16 is a PAD diagram showing the processing procedure of the search command creating section 5, which carries out the processing until the search command character string h is input and the search command is stored in the search command table i. In step 501, it is determined whether the input mode is the command mode or the natural language mode. If the input mode is the command mode, the grammar is checked in step 502, and if there is an error, the ending process is performed in step 503. When the input mode is determined to be natural language in step 501, semantic analysis is performed in step 504, and command generation is performed in step 505.

【００２９】次にステップでコマンドタイプ５０６の判
定を行い、検索に関するコマンドであれば検索コマンド
テーブル（ステップ５０７）に、文書情報取得に関する
コマンドであれば文書情報取得テーブルに格納する（ス
テップ５０８）。図１３から図１５に示される検索指示
文字列から作成されたコマンドはいずれも検索に関する
コマンドである。Next, the command type 506 is determined in a step, and if it is a command related to retrieval, it is stored in the retrieval command table (step 507), and if it is a command relating to document information acquisition, it is stored in the document information acquisition table (step 508). The commands created from the search instruction character strings shown in FIGS. 13 to 15 are all commands related to the search.

【００３０】図１７に検索コマンドテーブルの例を示
す。３つのコマンドはそれぞれ図１３、図１４、図１５
の検索指示入力に対応している。図１８は検索実行部６
における処理を示すＰＡＤ図である。検索実行部６では
検索コマンドテーブルｉに格納された検索コマンドに従
って、文書データベースｅ、ｆ、ｇを検索し、その結果
を検索結テーブルｊに格納するまでの処理を行う。以
下、処理を順に説明する。まず、ステップ６０１で検索
コマンドテーブルから検索コマンドを取得し、ステップ
６０２でヒット件数を初期化し、ステップ６０３でヒッ
ト文書番号格納配列を初期化する。次に、ステップ６０
４でヒット文書をカウントする変数の初期化を行い、ス
テップ６０５ではヒット候補の文書番号を格納する配列
を初期化する。文書データベースの最初の文書から最後
の文書まで以下の処理を行う（ステップ６０６）。ま
ず、検索コマンドに含まれる利用者から指定された最初
のキーワードから最後のキーワードまで（ステップ６０
７）頻度情報付き圧縮データファイルを検索する（ステ
ップ６０８）。ステップ６０９でヒット候補の件数が０
より大きい場合はステップ６１０に進み、指定されたキ
ーワード数が１の場合はステップ６１１でヒット文書確
定処理６１１を行う。ステップで指定されたキーワード
数が２つ以上ある場合は、ステップ６１２で複数キーワ
ード処理（キーワード間の論理演算）に進み、ステップ
６１３でヒット候補カウント変数の値が０より大きけれ
ばステップ６１１でヒット文書確定処理を行う。FIG. 17 shows an example of the search command table. The three commands are shown in FIGS. 13, 14 and 15, respectively.
It corresponds to the search instruction input of. FIG. 18 shows the search execution unit 6
It is a PAD figure which shows the process in. The search execution unit 6 searches the document databases e, f, and g according to the search command stored in the search command table i, and stores the result in the search result table j. The processing will be described below in order. First, in step 601, a search command is acquired from the search command table, the number of hits is initialized in step 602, and the hit document number storage array is initialized in step 603. Next, step 60
In step 4, a variable for counting hit documents is initialized, and in step 605, an array for storing the document numbers of hit candidates is initialized. The following processing is performed from the first document to the last document in the document database (step 606). First, from the first keyword to the last keyword specified by the user included in the search command (step 60).
7) The compressed data file with frequency information is searched (step 608). In step 609, the number of hit candidates is 0.
If it is larger, the process proceeds to step 610, and if the designated number of keywords is 1, the hit document confirmation process 611 is performed in step 611. If the number of keywords specified in the step is two or more, the process proceeds to a multi-keyword process (logical operation between keywords) in step 612, and if the value of the hit candidate count variable is larger than 0 in step 613, the hit document is found in step 611. Perform confirmation processing.

【００３１】図１９は頻度情報付き圧縮データ検索処理
６０８の手順を示すＰＡＤ図である。頻度情報付き圧縮
データ検索処理では利用者に指定されたキーワードの有
無を頻度情報付き圧縮データを対象に検索し、キーワー
ドをデータ内に含む文書をヒット文書の候補としてその
文書番号をヒット候補文書番号配列に格納し、ヒット候
補カウント変数を用いて、ヒット候補となった文書数を
求める。このとき利用者により頻度の指定があれば、圧
縮ファイルの頻度に対応した部分のみを検索する。ま
ず、ステップ６０８０１で１文書分の圧縮データを取得
する。ステップ６０８０２で頻度の指定があった場合に
はステップ６０８０３に進み、頻度ヘッダを読み込み、
ステップ６０８０４で圧縮データの参照開始位置を読み
込み、ステップ６０８０５で参照終了位置を取得する。
例えば、キーワードの指定個数が３個より大きくて、８
個より小さい場合に図１２に示され文書番号１の文書で
は圧縮ファイルの検索開始位置１個目で、検索終了位置
は４個目となり、文書番号２の文書では圧縮ファイルの
検索開始位置は３個目で、検索終了位置は１０個目とな
る。頻度の指定がない場合はステップ６０８０６で圧縮
データの最初の見出し語が圧縮データの参照開始位置
に、ステップ６０８０７で最後の見出し語が最終参照位
置に設定される。次に、圧縮ファイルの参照開始指定
位置から、参照終了指定位置まで（ステップ６０８０
８）、ステップ６０８０９で圧縮ファイルの見出し語を
１語読みだして、ステップ６０８１０で検索キーワード
と見出し語の比較を行い、両者が完全一致すれば、ステ
ップ６０８１１に進む。ステップ６０８１１でヒット候
補カウント変数を１増やし、ステップ６０８１２で文書
番号をヒット候補配列のヒット候補カウント変数の値が
示す行の１列目に格納し、ステップ６０８１３で、頻度
ヘッダを参照してキーワードの頻度を求め、この値をヒ
ット候補配列のヒット候補カウント変数の値が示す行の
２列目に格納する。例えば、キーワードが”ユーザ”で
あった場合、図１２に示される圧縮ファイルがあった場
合、ヒット候補配列の１行１列目には文書番号１を示す
１が格納され、１行２列目には頻度を示す３が格納さ
れ、ヒット候補配列の２行１列目には文書番号を示す２
が２行２列目には頻度を示す５が格納される。ステップ
６０８１４でステップ６０９８に進み、圧縮テキストの
参照を中断する。ステップ６０８１０で検索キーワード
と圧縮ファイルの見出し語が一致しなかった場合には、
ステップ６０８１８に進み、処理対象を圧縮ファイルの
次の見出しに移す。FIG. 19 is a PAD showing the procedure of the compressed data search processing 608 with frequency information. In the compressed data with frequency information search process, the compressed data with frequency information is searched for the presence or absence of a keyword specified by the user, and the document containing the keyword in the data is used as a candidate for a hit document, and the document number is used as the hit candidate document number. Stored in an array and use the hit candidate count variable to determine the number of hit candidate documents. At this time, if the frequency is specified by the user, only the portion corresponding to the frequency of the compressed file is searched. First, in step 60801, compressed data for one document is acquired. If the frequency is specified in step 60802, the flow advances to step 60803 to read the frequency header,
The reference start position of the compressed data is read in step 60804, and the reference end position is acquired in step 60805.
For example, if the number of keywords specified is greater than 3,
If the number is smaller than the number of documents, the search start position of the compressed file is the first and the search end position is the fourth in the document of document number 1 shown in FIG. 12, and the search start position of the compressed file is 3 in the document of document number 2. The search end position is the 10th item. If the frequency is not specified, the first headword of the compressed data is set to the reference start position of the compressed data in step 60806, and the last headword is set to the final reference position in step 60807. Next, from the reference start designated position of the compressed file to the reference end designated position (step 6080).
8) In step 60809, one keyword is read from the compressed file. In step 60810, the search keyword and the keyword are compared. If they match completely, the process proceeds to step 60811. The hit candidate count variable is incremented by 1 in step 60811, the document number is stored in the first column of the row indicated by the value of the hit candidate count variable in the hit candidate array in step 60812, and in step 60813 the frequency header is referenced to store the keyword The frequency is calculated, and this value is stored in the second column of the row indicated by the value of the hit candidate count variable of the hit candidate sequence. For example, when the keyword is “user”, and when there is a compressed file shown in FIG. 12, 1 indicating the document number 1 is stored in the first row and first column of the hit candidate sequence, and the first row and second column is stored. 3 indicates the frequency, and 2 indicates the document number in the 2nd row and 1st column of the hit candidate sequence.
5 is stored in the second row and the second column. In step 60814, the flow advances to step 6098 to interrupt the reference of the compressed text. If the search keyword and the entry word of the compressed file do not match in step 60810,
Proceeding to step 60818, the processing target is moved to the next heading of the compressed file.

【００３２】次にステップ６０８１６でヒット候補件数
が１件以上あるか否か判定し、ヒット候補が０件の場合
はステップ６０８１７に進み検索を終了する。最後にス
テップ６０８１８で処理の対象を次のキーワードに移
す。Next, in step 60816, it is determined whether or not the number of hit candidates is one or more. If there are 0 hit candidates, the flow advances to step 60817 to end the search. Finally, in step 60818, the processing target is moved to the next keyword.

【００３３】以上のような凝縮ファイルの検索方式では
キーワードと見出しの完全一致を持ってヒット候補とす
るので図１３に示すように利用者が”コメ”をキーワー
ドとして指定した場合に”コメント”を含む文書をヒッ
トすることがない。また、圧縮ファイルが頻度に従っ
て、格納されており、頻度分布テーブルも用意されてい
るために高速に頻度を条件とした検索を行うことができ
る。In the condensed file search method as described above, a hit candidate is obtained by having an exact match between the keyword and the heading. Therefore, as shown in FIG. 13, when the user specifies "rice" as the keyword, "comment" is given. Does not hit the containing document. Further, since the compressed files are stored according to the frequency and the frequency distribution table is prepared, the search can be performed at high speed based on the frequency.

【００３４】図２０はヒット文書確定処理６１１の手順
を示すＰＡＤ図である。ヒット文書確定処理６１１はヒ
ット候補文書をヒット文書に確定する処理で、検索結果
を検索結果テーブルに書き出す。ステップ６１１１でヒ
ット候補番号配列の値をヒット文書番号配列として、ス
テップ６１１２でヒット候補カウント変数の値をヒット
件数とする。ステップ６１１３でヒット文書番号、全文
テキストの先頭アドレス、頻度、本文データ中のキーワ
ードを最初に出現する位置からあらかじめ決められたバ
イト数分、を１レコードとして検索結果テーブルに格納
する。FIG. 20 is a PAD showing the procedure of the hit document confirmation processing 611. The hit document confirmation process 611 is a process of confirming the hit candidate document as a hit document, and writes the search result in the search result table. In step 6111, the value of the hit candidate number array is set as the hit document number array, and in step 6112, the value of the hit candidate count variable is set as the number of hits. In step 6113, the hit document number, the start address of the full text, the frequency, and the number of bytes determined in advance from the position where the keyword in the body data first appears are stored as one record in the search result table.

【００３５】図２１は複数キーワード処理６１３の手順
を示すＰＡＤである。複数キーワード処理６１３では利
用者が複数のキーワードを指定しており、各キーワード
に対して得られたヒット候補文書間で論理演算処理が必
要な場合の処理である。ステップ６１３０１でヒット候
補用配列間の第１列にある文書番号を用いて検索者が指
定する論理演算を実行する。ステップ６１３０２で論理
演算結果に基づいて、ヒット候補件数を書換え、ステッ
プ６１３０３で論理演算結果に基づいてヒット候補配列
を書き書換える。ステップ６１３０４で近傍条件の有無
を判定し、ない場合はヒット文書確定処理６１１に進
む。図１５に示されるような近傍条件の指定があれば、
ヒット候補配列の第１行から最終行まで（ステップ６１
３０５）以下の処理を行う。まず、ステップ６１３０５
でヒット候補配列の第１列に格納されている文書番号を
参照する。ステップ６１３０７で文書番号の全文テキス
トデータ取得する。ステップ６１３０８で単語分割部で
単語分割を行う。ステップ６１３０９で単語分割処理結
果が格納されている単語分割テーブルｂを参照してキー
ワードの位置関係が指定されたものに一致するか否か判
断し、一致しない場合はステップ６１３１０で文書番号
をヒット候補配列から外す。最後にステップ６１３１１
で処理対象を次の配列行に移す。FIG. 21 is a PAD showing the procedure of the multiple keyword processing 613. The multi-keyword process 613 is a process when the user specifies a plurality of keywords and a logical operation process is required between the hit candidate documents obtained for each keyword. In step 61301, a logical operation designated by the searcher is executed using the document number in the first column between the hit candidate sequences. In step 61302, the number of hit candidates is rewritten based on the logical operation result, and in step 61303, the hit candidate sequence is rewritten based on the logical operation result. In step 61304, the presence / absence of the neighborhood condition is determined. If not, the process proceeds to the hit document confirmation process 611. If a neighborhood condition is specified as shown in FIG. 15,
From the first line to the last line of the hit candidate sequence (step 61
305) Perform the following processing. First, step 61305
Refers to the document number stored in the first column of the hit candidate sequence. In step 61307, full-text text data of the document number is acquired. In step 61308, word division is performed by the word division unit. In step 61309, the word division table b in which the word division processing result is stored is referred to, and it is determined whether or not the positional relationship of the keywords matches the specified one. If they do not match, in step 61310, the document number is a hit candidate. Remove from array. Finally step 61311
Moves the processing target to the next array line with.

【００３６】図２２は検索結果テーブルの一例を示す図
で、文書番号ｊ１、先頭アドレスｊ２、頻度ｊ３、テキ
スト部分ｊ４の項目から構成される。FIG. 22 is a diagram showing an example of the search result table, which is composed of items of a document number j1, a start address j2, a frequency j3, and a text portion j4.

【００３７】図２３は検索結果整理部７の処理手順を示
すＰＡＤ図で、検索結果整理部７では検索結果テーブル
ｊを読み込んで、頻度情報に基づくソートを行い、ソー
トされた検索結果テーブルｊに格納する。まず、検索結
果テーブルの先頭のレコードから最終レコードまで（ス
テップ７０１）、ステップ７０２でレコードを取得し、
ステップ７０３でソートキーを取得する。入力が検索結
果テーブルの場合は頻度をソートキーとする。ステップ
７０４でキーに従ってソートを実施し、ステップ７０５
で結果をソートされた検索結果テーブルｊとして格納す
る。FIG. 23 is a PAD diagram showing the processing procedure of the search result organizing unit 7. The search result organizing unit 7 reads the search result table j, sorts it based on the frequency information, and puts it in the sorted search result table j. Store. First, from the first record to the last record of the search result table (step 701), records are acquired in step 702,
In step 703, the sort key is acquired. If the input is a search result table, the frequency is used as the sort key. In step 704, sorting is performed according to the key, and step 705
The results are stored as a sorted search result table j.

【００３８】図２４は検索結果整理部７によって図２２
に示す入力がソートされた結果を入出力装置８に出力し
た例を示す。以下図２５から図３７を用いて本発明の第
２実施例を詳細に説明する。図１から図３５まで同じ参
照番号は同じものを指す。図２５に本発明の第２実施例
を示す。第２実施例では第１実施例で作成した圧縮デー
タファイルから検索結果として得られた文書に含まれる
単語の頻度情報を利用者の指定に従って提示することを
実現している。FIG. 24 is shown in FIG.
An example in which the result obtained by sorting the inputs shown in is output to the input / output device 8 is shown. The second embodiment of the present invention will be described in detail below with reference to FIGS. 25 to 37. 1 to 35, the same reference numerals refer to the same things. FIG. 25 shows a second embodiment of the present invention. In the second embodiment, the frequency information of the words included in the document obtained as the search result from the compressed data file created in the first embodiment is presented according to the user's designation.

【００３９】以下図２５から図３７を用いて本発明の第
２実施例を詳細に説明する。図１から図３５まで同じ参
照番号は同じものを指す。図２５に本発明の第２実施例
を示す。第２実施例では第１実施例で作成した圧縮デー
タファイルから検索結果として得られた文書に含まれる
単語の頻度情報を利用者の指定に従って提示することを
実現している。The second embodiment of the present invention will be described in detail below with reference to FIGS. 25 to 37. 1 to 35, the same reference numerals refer to the same things. FIG. 25 shows a second embodiment of the present invention. In the second embodiment, the frequency information of the words included in the document obtained as the search result from the compressed data file created in the first embodiment is presented according to the user's designation.

【００４０】図２５は第２実施例の概略を示す機能図で
ある。コマンド作成部５は利用者からの文書情報取得指
示入力ｐをインタフェース制御部２０を介して参照し
て、文書情報取得コマンド作成テーブルｑを格納する。
文書情報取得部１１は文書情報取得コマンドテーブルｑ
と検索結果テーブルｊを参照して文書データベースの各
ファイルから情報を取得する。利用者がキーワードが出
現する文を情報取得の対象に指定している場合は、全文
テキストファイルｆから全文テキストを取得して、キー
ワードが出現している文書を切りだして、その結果をテ
キストデータテーブルｒに格納する。利用者がヒットし
文書全体を情報取得の対象に指定している場合は圧縮デ
ータファイルｇを単語出現頻度テーブルの形式に変換し
て単語出現頻度テーブルｃに格納する。テキストデータ
テーブルｒに格納されたデータは単語分割部１で単語分
割され、その結果は単語分割テーブルｂに格納される。
出現頻度検出部２は単語分割テーブルｂを参照して、文
書内での各文の中に出現する単語の頻度を求め文書数付
出現頻度テーブルｓに格納する。文書数検出部１２は単
語出現頻度テーブルｃを参照して、各単語の文書間の総
頻度と出現文書数を求める。文書出力制御部１３は文書
情報取得コマンドテーブルｑを参照して、文書数付出現
頻度テーブルｓから利用者の指定する情報を抽出し、文
書情報テーブルｔに格納し、その結果はインタフェース
制御部２０を介して提示される。提示された文書情報テ
ーブルのレコードをマウスやタッチパネルなどの入力装
置によって選択すると、絞り込み指示ｕとしてインタフ
ェース制御部２０を介してコマンド作成部に送られ、コ
マンド作成部は文書情報テーブルｔを参照してコマンド
を作成して、検索コマンドテーブルｇに格納する。絞り
込み指示の場合は絞り込みフラグを立てる。検索実行部
６は検索コマンドテーブルｇを参照して検索を行う。絞
り込みフラグが立っている場合は検索結果テーブルｊを
参照してヒットしている文書を対象に検索を行う。FIG. 25 is a functional diagram showing the outline of the second embodiment. The command creation unit 5 refers to the document information acquisition instruction input p from the user via the interface control unit 20 and stores the document information acquisition command creation table q.
The document information acquisition unit 11 uses the document information acquisition command table q
Information is acquired from each file of the document database by referring to the search result table j. When the user has designated the sentence in which the keyword appears as the target of information acquisition, the full text is acquired from the full text file f, the document in which the keyword appears is cut out, and the result is the text data. Store in table r. When the user hits and the entire document is designated as the information acquisition target, the compressed data file g is converted into the word appearance frequency table format and stored in the word appearance frequency table c. The data stored in the text data table r is word-divided by the word division unit 1, and the result is stored in the word division table b.
The appearance frequency detection unit 2 refers to the word division table b, finds the frequency of the word appearing in each sentence in the document, and stores it in the appearance frequency table s with the number of documents. The document number detection unit 12 refers to the word appearance frequency table c to obtain the total frequency between documents of each word and the number of appearing documents. The document output control unit 13 refers to the document information acquisition command table q, extracts information specified by the user from the document number-appearance frequency table s, stores the information in the document information table t, and the result is the interface control unit 20. Will be presented via. When a record of the presented document information table is selected by an input device such as a mouse or a touch panel, it is sent as a narrowing-down instruction u to the command creating unit via the interface control unit 20, and the command creating unit refers to the document information table t. A command is created and stored in the search command table g. In the case of a narrowing down instruction, a narrowing down flag is set. The search execution unit 6 refers to the search command table g and performs a search. When the narrow-down flag is set, the search result table j is referred to and the hit document is searched.

【００４１】図２６は図２５における文書検索装置の全
体的なハードウェア構成図を示すブロック図である。入
出力装置８はデータの入力および各種情報の表示を行
う。プロセッサ９は、プログラムに基づき、図２４にお
ける処理を実行する。記憶装置１０は図２４における各
種データやプログラムを格納する。さらに、記憶装置１
０はプロセッサ９の各実行処理用のメモリであるワーキ
ングエリアｂ、ｃ、ｈ、ｊ、ｐ、ｑ、ｒ。ｓ、ｔ、単語
分割部格納エリア１００、出現頻度検出部格納エリア２
００、文書データベース格納エリアｅ、ｆ、ｇ、検索コ
マンド作成エリア５００、検索実行部格納エリア６０
０、文書情報取得部格納エリア１１００、文書数検出部
格納エリア１２００、文書情報出力制御部格納エリア１
３００、インタフェース制御部格納エリアの記憶部を持
っている。記憶装置１００に格納される各プログラムは
プロセッサにおいて実行される。その実行に際し、必要
に応じて入出力装置８が用いられる。FIG. 26 is a block diagram showing an overall hardware configuration diagram of the document retrieval apparatus in FIG. The input / output device 8 inputs data and displays various information. The processor 9 executes the processing in FIG. 24 based on the program. The storage device 10 stores various data and programs shown in FIG. Furthermore, the storage device 1
0 is a working area b, c, h, j, p, q, r which is a memory for each execution process of the processor 9. s, t, word division part storage area 100, appearance frequency detection part storage area 2
00, document database storage areas e, f, g, search command creation area 500, search execution unit storage area 60
0, document information acquisition unit storage area 1100, document number detection unit storage area 1200, document information output control unit storage area 1
300, has a storage unit of the interface control unit storage area. Each program stored in the storage device 100 is executed by the processor. When executing the operation, the input / output device 8 is used as necessary.

【００４２】図２７は文書情報取得指示入力例ｐであ
る。図２８は文書情報取得指示入力例ｐである。図２９
は図２７、図２８に示された文書情報取得指示入力例ｐ
がコマンド作情報取得ｋｏｎａｍｍ成部５に入力され、
文書情報取得コマンドテーブルｑに出力された例であ
る。FIG. 27 shows a document information acquisition instruction input example p. FIG. 28 is a document information acquisition instruction input example p. FIG. 29
Is a document information acquisition instruction input example p shown in FIGS. 27 and 28.
Is input to the command creation information acquisition konamm component 5,
It is an example output to the document information acquisition command table q.

【００４３】図３０は文書情報取得部１１における処理
の概要を示すＰＡＤ図である。文書情報取得１１は図２
５における文書テーブルｑを参照して、検索結果テーブ
ルｊおよび文書データベースの各ファイルｅ、ｆ、ｇか
ら情報を取得して、単語頻度テーブルｃに単語の頻度デ
ータを格納するまでの処理を行う。FIG. 30 is a PAD diagram showing an outline of processing in the document information acquisition unit 11. Document information acquisition 11 is shown in FIG.
The document table q in 5 is referred to, information is obtained from the search result table j and the files e, f, and g of the document database, and the process of storing the word frequency data in the word frequency table c is performed.

【００４４】まず、検索結果の先頭文書から末尾文書ま
で（ステップ１１０１）、検索結果テーブルｃから文書
番号取得を行う（ステップ１１０２）。次に、ステップ
１１０３で文書情報取得コマンドテーブルｑに指定され
ている対象が頻度であるか、テキストであるか判断し
て、頻度であればステップ１１０４に進み、頻度の指定
が絶対値であるか、検索に用いられたキーワードである
か判断する。頻度の指定がキーワードである場合は頻度
情報付き圧縮データｅを参照して検索に用いられたキー
ワードの各文書における頻度を参照する（ステップ１１
０５）。ステップ１１０６で圧縮データ読み込み終了位
置を管理データファイルｅから取得する。次に圧縮デー
タの最初から読み込み終了位置まで（ステップ１１０
７）で、単語の読み込み（ステップ１１０８）、単語頻
度の取得（ステップ１１０９）を行い、１１１０で文書
番号、見出し語、頻度を１レコードとして出現頻度テー
ブルに書き込む。ステップ１１０３で情報取得対象がテ
キストであったならば、ステップ１１１１で全文テキス
トファイルからテキストを読み込み、テキストに出現す
る最初のキーワードから最後のキーワードまで（ステッ
プ１１１２）、ステップ１１１３でキーワードを含む文
（句点から句点）を取得して、ステップ１１１４で文書
番号と文をテキストデータテーブルｒに格納する。この
テキストデータは単語分割部１によって単語分割され、
単語分割テーブルｂに格納され、次に出現頻度検出部２
によって出現頻度算出され、その結果が単語出現頻度テ
ーブルｃに格納される。First, from the first document to the last document of the search result (step 1101), the document number is acquired from the search result table c (step 1102). Next, in step 1103, it is determined whether the target specified in the document information acquisition command table q is a frequency or a text, and if it is a frequency, the process proceeds to step 1104, and whether the frequency is designated as an absolute value. , It is determined whether the keyword is used in the search. If the frequency is designated by a keyword, the frequency-compressed data e is referenced to refer to the frequency of the keyword used in the search in each document (step 11).
05). In step 1106, the compressed data read end position is acquired from the management data file e. Next, from the beginning of the compressed data to the read end position (step 110
In 7), the word is read (step 1108), the word frequency is acquired (step 1109), and in 1110, the document number, the headword, and the frequency are written as one record in the appearance frequency table. If the information acquisition target is text in step 1103, the text is read from the full-text file in step 1111 and the first keyword to the last keyword appearing in the text (step 1112), and the sentence containing the keyword in step 1113 ( The phrase) is acquired from the phrase, and the document number and the sentence are stored in the text data table r in step 1114. This text data is word-divided by the word division unit 1,
It is stored in the word division table b, and next, the appearance frequency detection unit 2
The appearance frequency is calculated by, and the result is stored in the word appearance frequency table c.

【００４５】図３１はテキストデータテーブルｒの一例
で、文書番号ｒ１とテキストｒ２から構成される。図３
２は文書数検出部１２における処理の概要を示すＰＡＤ
図である。文書数検出部１２では単語出現頻度テーブル
ｃを参照して、見出し語の総出現頻度および出現文書数
を求め、その結果を文書数付き出現頻度テーブルｓに格
納する。単語頻度テーブルｃの最初のレコードから最後
のレコードまで（ステップ１２０１）、ステップ１２０
２で同じ見出し語を持つレコードを検索し、ステップ１
２０３で同じ見出し語を持つ文書数をカウントし、ステ
ップ１２０４で見出しの文書間での総頻度数をカウント
し、ステップ１２０５で見出し、総頻度、文書数を１レ
コードとして文書数付出現頻度テーブルに格納する。ス
テップ１２０６で検索されたレコードを以後の処理対象
から外し、ステップ１２０７で処理対象を次のレコード
に移動する。FIG. 31 shows an example of the text data table r, which is composed of a document number r1 and a text r2. Figure 3
2 is a PAD showing an outline of processing in the document number detection unit 12
It is a figure. The document number detection unit 12 refers to the word appearance frequency table c to obtain the total appearance frequency of the entry word and the number of appearance documents, and stores the result in the appearance frequency table with document number s. From the first record to the last record of the word frequency table c (step 1201), step 120
Search for records that have the same headword in step 2, then step 1
In 203, the number of documents having the same headword is counted, in step 1204, the total number of frequencies between documents of the headline is counted, and in step 1205, the headline, the total frequency, and the number of documents are set as one record in the appearance frequency table with the number of documents. Store. The record retrieved in step 1206 is excluded from the subsequent processing targets, and the processing target is moved to the next record in step 1207.

【００４６】図３３は文書数付き出現頻度テーブルの一
例で、見出し語ｓ１、総頻度ｓ２、文書数ｓ３の３項目
からなる。図３４は文書出力制御部１３における処理の
概要を示すＰＡＤ図で、文書数付き出現頻度テーブルｓ
を参照して、その結果を文書情報取得コマンドテーブル
ｑに従って文書情報テーブルｔに格納する。ステップ１
３０１で文書数付き出現頻度テーブルを読み込み、ステ
ップ１３０２で文書情報取得コマンドテーブルからソー
トキーを取得して、ステップ１３０３でソートを行う。
文書情報取得指示ｐで文書数順に単語を表示するように
指示があれば文書数順にソートし、総頻度順に表示する
ように指示があれば総頻度をキーとしてソートを行う。
ステップ１３０４で文書情報取得コマンドテーブルから
表示する単語数を求め、ステップ１３０５でステップ１
３０４で求めた個数の見出し語と、総頻度、文書数を文
書情報格納テーブルｔに格納する。FIG. 33 shows an example of an appearance frequency table with the number of documents, which is composed of three items: headword s1, total frequency s2, and number of documents s3. FIG. 34 is a PAD diagram showing the outline of the processing in the document output control unit 13, and the appearance frequency table s with the number of documents.
Is stored in the document information table t according to the document information acquisition command table q. Step 1
In step 301, the appearance frequency table with the number of documents is read, in step 1302 a sort key is acquired from the document information acquisition command table, and in step 1303 sorting is performed.
If the document information acquisition instruction p instructs to display the words in the document number order, the document information is sorted in the document number order, and if the document information acquisition instruction p instructs to display the words in the total frequency order, the total frequency is used as a key for sorting.
In step 1304, the number of words to be displayed is calculated from the document information acquisition command table, and in step 1305, step 1
The number of headwords obtained in 304, the total frequency, and the number of documents are stored in the document information storage table t.

【００４７】図３５は文書情報テーブルｔの内容を入力
装置８に表示した例である。この例では利用者がタッチ
パネルやマウスなどの指示装置を介して表示されたキー
ワードを絞り込みに用いるキーワードとして指定でき
る。利用者が指定したキーワードはインタフェース制御
部２０に絞り込み指示として送られる。FIG. 35 shows an example in which the contents of the document information table t are displayed on the input device 8. In this example, the user can specify a keyword displayed via a pointing device such as a touch panel or a mouse as a keyword used for narrowing down. The keyword specified by the user is sent to the interface control unit 20 as a narrowing-down instruction.

【００４８】図３６は第２実施例のコマンド作成部の処
理の概要を示すＰＡＤ図で、検索指示文字列ｈあるいは
絞り込み指示ｕをインタフェース制御部２０を介して参
照して、コマンドを作成し、そのコマンドを検索コマン
ドテーブル６に格納するまでの処理を示す。ステップ５
１０からステップ５１１が図１６に示す第１実施例の検
索コマンド作成部の処理とは異なっている。他の同じス
テップは同一の処理を表す。すなわち、ステップ５１０
で入力モードの判断を行い、コマンドであればステップ
５０２に進み、文法チェックを行う。自然語であればス
テップ５０４に進み、意味解析を行い。ステップ５１０
で入力モードが絞り込み指示であれば、ステップ５１１
で絞り込みフラグを立て（この処理が第１実施例にはな
い）、ステップ５０５でコマンドを作成する。ステップ
５０６以下では図１６に示す第１実施例と同様の処理を
行う。FIG. 36 is a PAD diagram showing the outline of the processing of the command creating section of the second embodiment. The command is created by referring to the search instruction character string h or the narrowing down instruction u via the interface control section 20. The process until the command is stored in the search command table 6 is shown. Step 5
Steps 10 to 511 are different from the processing of the search command creation unit of the first embodiment shown in FIG. The other same steps represent the same process. That is, step 510
The input mode is determined with, and if it is a command, the process proceeds to step 502 to check the grammar. If it is a natural language, the process proceeds to step 504 to perform semantic analysis. Step 510
If the input mode is a narrow-down instruction in step, step 511
Then, the narrowing-down flag is set (this processing is not in the first embodiment), and a command is created in step 505. From step 506 onward, the same processing as in the first embodiment shown in FIG. 16 is performed.

【００４９】図３７は検索実行部の処理の概要を示すＰ
ＡＤ図で図１８に示す第１実施例の検索実行部の処理の
変形例である。ステップ６２０からステップ６２２で示
す絞り込み処理が第１実施例とは異なる処理で、ステッ
プが同一の処理は同じ処理を示す。FIG. 37 shows the outline of the processing of the search execution unit P.
19 is a modified example of the processing of the search execution unit of the first embodiment shown in FIG. 18 in the AD diagram. The narrowing-down processing shown in steps 620 to 622 is different from the first embodiment, and the processing having the same step shows the same processing.

【００５０】まず、ステップ６０１で検索コマンドテー
ブルｇを参照して検索コマンドを取得し、ステップ６０
２からステップ６０５で第１実施例と同様に変数の初期
化を行う。ステップ６２０で絞り込みフラグのｏｎ／ｏ
ｆｆを判断し、ｏｆｆの場合は文書データベースの最初
の文書から最後の文書までを対象にステップ６０７から
ステップ６１３の処理を行う。ステップ６２０で絞り込
みフラグがｏｎの場合は検索結果テーブルｊからヒット
文書を参照してその最初から最後までを対象にステップ
６１３からステップ６２０までの処理を行う、ステップ
６２２で絞り込みフラグをｏｆｆに設定する。First, in step 601, a search command is obtained by referring to the search command table g, and in step 60
From step 2 to step 605, variables are initialized as in the first embodiment. On / o of the narrowing flag in step 620
ff is determined, and if it is off, the processes from step 607 to step 613 are performed for the first document to the last document in the document database. When the narrowing-down flag is on in step 620, the hit document is referenced from the search result table j, and the processing from step 613 to step 620 is performed for the first to last hits. In step 622, the narrowing-down flag is set to off. .

【００５１】[0051]

【発明の効果】本発明による文書検索装置によれば、利
用者が検索に指定したキーワードの文書内の出現頻度情
報を高速に反映でき、検索者が指定するキーワードの出
現頻度が高いほど検索時間を短縮することができる。さ
らに、検索結果集合に含まれる単語の出現総頻度情報
や、出現文書数情報を高速に算出するので、利用者は検
索結果の絞り込みを行うためのキーワード情報を得て、
容易に絞り込み検索を行うことができるという効果があ
る。According to the document search apparatus of the present invention, the appearance frequency information in the document of the keyword designated by the user can be reflected at high speed, and the higher the frequency of appearance of the keyword designated by the searcher, the longer the retrieval time. Can be shortened. Furthermore, since the total appearance frequency information of words included in the search result set and the number-of-appearing-documents information are calculated at high speed, the user obtains keyword information for narrowing down the search results,
There is an effect that it is possible to easily perform a narrowed search.

[Brief description of drawings]

【図１】本発明を施した文書検索装置の第一実施例を示
す機能ブロック図。FIG. 1 is a functional block diagram showing a first embodiment of a document search device according to the present invention.

【図２】図１における文書検索装置のハードウェアの実
施例を示すハードウェア構成を示すブロック図。FIG. 2 is a block diagram showing a hardware configuration showing an embodiment of hardware of the document search device in FIG.

【図３】図１における単語分割プログラムのＰＡＤ図。FIG. 3 is a PAD diagram of the word division program in FIG.

【図４】図１における文書データの例。FIG. 4 is an example of document data in FIG.

【図５】図１における単語分割テーブルの例。5 is an example of a word division table in FIG.

【図６】図１における出現頻度検出プログラムのＰＡＤ
図。6 is a PAD of the appearance frequency detection program in FIG.
Fig.

【図７】図１における単語出現頻度テーブルの例。7 is an example of a word appearance frequency table in FIG.

【図８】図１における頻度ヘッダ作成プログラムのＰＡ
Ｄ図。FIG. 8 is a PA of the frequency header creation program in FIG.
Figure D.

【図９】図１における頻度分テーブルの例。9 is an example of a frequency-minute table in FIG.

【図１０】図１における登録プログラムのＰＡＤ図。10 is a PAD diagram of the registration program in FIG.

【図１１】図１における管理データファイルの例。11 is an example of a management data file in FIG.

【図１２】図１における頻度情報付き圧縮データファイ
ルの例。12 is an example of a compressed data file with frequency information in FIG.

【図１３】図２の入出力装置を介して入力された図１に
おける検索指示文字列の例。13 is an example of the search instruction character string in FIG. 1 input via the input / output device in FIG.

【図１４】図２の入出力装置を介して入力された図１に
おける検索指示文字列の例。FIG. 14 is an example of the search instruction character string in FIG. 1 input via the input / output device in FIG.

【図１５】図２の入出力装置を介して入力された図１に
おける検索指示文字列の例。15 is an example of the search instruction character string in FIG. 1 input via the input / output device in FIG.

【図１６】図１における検索コマンド作成プログラムの
ＰＡＤ図。16 is a PAD diagram of the search command creation program in FIG.

【図１７】図１における検索コマンドテーブルの例。17 is an example of a search command table in FIG.

【図１８】図１における検索実行プログラムのＰＡＤ
図。18 is a PAD of the search execution program in FIG.
Fig.

【図１９】図１８における圧縮データ検索プログラムの
ＰＡＤ図。FIG. 19 is a PAD diagram of the compressed data search program in FIG.

【図２０】図１８におけるヒット文書確定プログラムの
ＰＡＤ図。FIG. 20 is a PAD diagram of the hit document confirmation program in FIG.

【図２１】図１８における複数キーワードプログラムの
ＰＡＤ図。FIG. 21 is a PAD diagram of the multi-keyword program in FIG. 18.

【図２２】図１における検索結果テーブルの例。22 is an example of a search result table in FIG.

【図２３】図１における検索結果整理プログラムのＰＡ
Ｄ図。FIG. 23 is a PA of the search result arrangement program in FIG.
Figure D.

【図２４】図１におけるソートされた検索結果を入出力
装置に出力した例。FIG. 24 is an example of outputting the sorted search results in FIG. 1 to an input / output device.

【図２５】本発明を施した文書検索装置の第二実施例を
示す機能ブロック図。FIG. 25 is a functional block diagram showing a second embodiment of the document search device according to the present invention.

【図２６】図２５における文書検索装置のハードウェア
の実施例を示すハードウェア構成を示すブロック図。FIG. 26 is a block diagram showing a hardware configuration showing an embodiment of hardware of the document search device in FIG. 25.

【図２７】図２における入出力装置を介して入力された
図２５における文書情報出力指示入力例。27 is an example of inputting a document information output instruction in FIG. 25 input through the input / output device in FIG.

【図２８】図２にける入出力装置を介して入力された図
２５における文書情報出力指示入力例。28 is an example of inputting a document information output instruction in FIG. 25 input through the input / output device in FIG.

【図２９】図２５における文書情報取得コマンドテーブ
ルの例。FIG. 29 is an example of a document information acquisition command table in FIG. 25.

【図３０】図２５における文書情報取得プログラムのＰ
ＡＤ図。[FIG. 30] P of the document information acquisition program in FIG.
AD diagram.

【図３１】図２５におけるテキストデータの例。FIG. 31 is an example of text data in FIG. 25.

【図３２】図２５における文書数検出プログラムのＰＡ
Ｄ図。FIG. 32 is a PA of the document number detection program in FIG.
Figure D.

【図３３】図２５における文書数付き出現頻度テーブル
の例。FIG. 33 is an example of an appearance frequency table with the number of documents in FIG. 25.

【図３４】図２５における文書情報出力制御プログラム
のＰＡＤ図。34 is a PAD diagram of the document information output control program in FIG. 25.

【図３５】図２５における文書情報テーブルを入出力装
置に表示した例。FIG. 35 is an example of displaying the document information table in FIG. 25 on an input / output device.

【図３６】図２５における検索コマンド作成プログラム
のＰＡＤ図。FIG. 36 is a PAD diagram of the search command creation program in FIG. 25.

【図３７】図２５における検索実行プログラムのＰＡＤ
図。FIG. 37 is a PAD of the search execution program in FIG.
Fig.

[Explanation of symbols]

１…単語分割部、２…出現頻度検出部、３…頻度ヘッダ
作成部、４…登録部、５…コマンド作成部、６…検索実
行部、７…検索結果整理部、１１…文書情報取得部、１
２…文書数検出部、１３…文書情報出力制御部、２０…
インタフェース制御部、ａ…文書データ、ｂ…単語分割
テーブル、ｃ…単語出現頻度テーブル、ｄ…頻度分布テ
ーブル、ｅ…管理データファイル、ｆ…全文テキストフ
ァイル、ｇ…頻度情報付き圧縮データファイル、ｐ…文
書情報取得指示、ｑ…文書情報取得コマンドテーブル、
ｓ…文書情報付出現頻度テーブル、ｔ…文書情報テーブ
ル、ｕ…絞り込み指示。1 ... Word division unit, 2 ... Appearance frequency detection unit, 3 ... Frequency header creation unit, 4 ... Registration unit, 5 ... Command creation unit, 6 ... Search execution unit, 7 ... Search result arrangement unit, 11 ... Document information acquisition unit 1
2 ... Document number detection unit, 13 ... Document information output control unit, 20 ...
Interface control unit, a ... Document data, b ... Word division table, c ... Word appearance frequency table, d ... Frequency distribution table, e ... Management data file, f ... Full text file, g ... Compressed data file with frequency information, p ... document information acquisition instruction, q ... document information acquisition command table,
s ... Appearance frequency table with document information, t ... Document information table, u ... Instruction to narrow down.

───────────────────────────────────────────────────── フロントページの続き (72)発明者浅川悟志神奈川県横浜市戸塚区戸塚町5030番地株式会社日立製作所ソフトウェア開発本部内 ─────────────────────────────────────────────────── ─── Continuation of the front page (72) Inventor Satoshi Asagawa 5030 Totsuka-cho, Totsuka-ku, Yokohama-shi, Kanagawa Prefecture Hitachi Ltd. Software Development Division

Claims

[Claims]

1. A device for accumulating a large number of documents for the purpose of retrieval, comprising means for dividing each document data into words, means for calculating the appearance frequency of the divided words, and the frequency information in descending order. And a means for accumulating the cumulative frequency distribution information in the document database, and a means for executing a search with reference to the accumulated cumulative frequency distribution information. Document retrieval device.

2. A document retrieval apparatus according to claim 1, wherein the document appearance information includes the total appearance frequency of words included in all document data matching the search condition and the total number of appearance documents for each word. A document characterized by comprising document information acquisition instruction means for obtaining a user's instruction regarding the number of appearing documents, and a document information output control section for selecting a word that matches the instruction of the user by referring to the document information. Search device.

3. A document retrieval apparatus, wherein the document information acquisition means according to claim 2 acquires information only in a sentence containing a keyword or a paragraph containing a keyword in document data.

4. A document search characterized in that the search execution means according to claim 1 sends the contents of a full-text text file to be searched to a word division unit and refers to the contents of a word division table which is the result of word division. apparatus.

5. A document retrieval method using a computer divides each document data into words, calculates the appearance frequency of the divided words, rearranges the frequency information in descending order, and obtains the cumulative frequency distribution information of the words. A document search method characterized by calculating, accumulating the cumulative frequency distribution information in a document database, and executing a search by referring to the accumulated cumulative frequency distribution information.