JPH1074210A

JPH1074210A - Method and device for supporting document retrieval and document retrieving service using the method and device

Info

Publication number: JPH1074210A
Application number: JP9178500A
Authority: JP
Inventors: Yoshiki Niwa; 芳樹丹羽; Hirobumi Sakurai; 博文櫻井
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1996-07-05
Filing date: 1997-07-03
Publication date: 1998-03-17
Anticipated expiration: 2017-07-03
Also published as: JP3614618B2

Abstract

PROBLEM TO BE SOLVED: To provide a retrieving method for enabling a user to have a look at the whole image of a retrieved document group and to attain retrieval as service. SOLUTION: A feature word displaying means 22 is displayed on a display means 2, a word group characteristically appearing in a document group retrieved by a user's request is extracted, mutual relation among feature words is checked, a graph setting the feature words as nodes is prepared, and the whole image of retrieved results is displayed on the means 22. When a user selects his (or her) interested word or uninterested word while observing the displayed feature word graph, succeeding retrieval strategy can be effectively prepared.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文献検索における
対話的なガイダンス機能を実現するためのユーザインタ
フェイスを持つ文献検索支援方法及び装置およびこれを
用いた文献検索サービスに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document search support method and apparatus having a user interface for realizing an interactive guidance function in document search, and a document search service using the same.

【０００２】[0002]

【従来の技術】文献検索においては、ユーザーが所望す
る文献集合に早く容易に到達できるように、文献検索装
置とユーザーとのさまざまなインタフェイスが考案、開
発されている。その中の主なものとしてはフィードバッ
クとガイダンスがある。フィードバックとは検索結果の
いくつかのアイテムに対してユーザーが「当たり／はず
れ」の判定を下すと、その判定を反映した検索結果を得
ることができるしくみである。またガイダンスとは検索
作業の各段階でその検索作業と関連のあると思われる情
報、したがって利用者が検索条件を工夫したり改良した
りするのに参考となると思われる情報を提供する機能で
ある。2. Description of the Related Art In document retrieval, various interfaces between a document retrieval apparatus and a user have been devised and developed so that a user can quickly and easily reach a desired document set. The main ones are feedback and guidance. Feedback is a mechanism by which, when a user makes a "hit / miss" determination for some items in a search result, a search result reflecting the determination is obtained. Guidance is a function that provides information that appears to be relevant to the search operation at each stage of the search operation, and thus is useful for users to devise or improve search conditions. .

【０００３】ガイダンス機能については、従来一般に、
入力された検索条件に対してその関連情報を提示する方
法が行われている。例えば、シソーラスなど単語間の関
連性を示すデータベースを保持しておき、検索条件とし
て入力された語と関連のある語をデータベースから取り
出して提示する方法である。シソーラスの場合には主に
単語間の上位−下位関係を示す木構造のデータである
が、共起統計を用いて関連語データを自動生成しそれを
用いる方法もある（例えば、B. R. Schatz et al, Inte
ractive term suggestion for users of digital libra
ries: Using subject thesauri and co-occurrence lis
ts for information retrieval. Proc. ACM DL96.p.126
-133）。また、単語間の共起統計データに基づき検索語
とその関連語をネットワーク状に表示する方法も提案さ
れている（例えば、R.H. Fowler, D. W. Dearholt, Inf
ormation Retrieval Using Pathfinder Networks. In P
athfinder Associative Networks, Ablex, article 12,
Edited by R. W. Schvaneveldt(1990)）。[0003] Conventionally, a guidance function is generally used.
A method of presenting related information with respect to input search conditions has been performed. For example, there is a method in which a database indicating the relevance between words such as a thesaurus is held, and words related to the word input as a search condition are extracted from the database and presented. In the case of a thesaurus, the data is mainly tree-structured data indicating the superordinate-subordinate relationship between words. However, there is also a method of automatically generating related word data using co-occurrence statistics and using the data (for example, BR Schatz et al. , Inte
ractive term suggestion for users of digital libra
ries: Using subject thesauri and co-occurrence lis
ts for information retrieval. Proc. ACM DL96.p.126
-133). Also, a method of displaying a search word and its related words in a network based on co-occurrence statistical data between words has been proposed (for example, RH Fowler, DW Dearholt, Inf).
ormation Retrieval Using Pathfinder Networks. In P
athfinder Associative Networks, Ablex, article 12,
Edited by RW Schvaneveldt (1990)).

【０００４】しかしながら、検索条件に対してその関連
情報を提示する方法では、検索語が複数になった場合や
否定が使われた場合の対処が難しく、またキーワードを
用いない書類の検索（連想検索など）にも対処が難しい
という問題がある。これを克服する方法として、検索結
果から関連情報を自動抽出してユーザに提供する手法が
ある。例えば、スキャター・ギャザー法（D.Cutting他
(1992). Scatter/Gather : A Cluster-based Approach
to Browsing Large Document Collections. Proc. ACM
SIGIR'92,p.318-329）では検索された文書群を自動分類
（クラスタリング）して各クラスごとの特徴語を表示す
るものである。しかし、クラスタリングは文書数が増え
ると計算量が２乗あるいは３乗のオーダーで大きくなる
のでリアルタイムでの反応が難しくなり、また一般に検
索作業が進んで行くとクラス間の違いが微妙になり、ク
ラスの特徴語からそのクラスの性格を把握しにくくなる
という問題があった。However, in the method of presenting related information with respect to a search condition, it is difficult to cope with a case where a plurality of search words are used or a case where negation is used, and a search for a document that does not use a keyword (associative search). Etc.) are also difficult to deal with. As a method of overcoming this, there is a method of automatically extracting relevant information from a search result and providing it to a user. For example, the scatter-gather method (D. Cutting, etc.)
(1992). Scatter / Gather: A Cluster-based Approach
to Browsing Large Document Collections. Proc. ACM
In SIGIR '92, p.318-329), the retrieved documents are automatically classified (clustered) and characteristic words for each class are displayed. However, in clustering, when the number of documents increases, the computational complexity increases in the order of the square or cube, making it difficult to react in real time. In general, as the search operation progresses, the difference between classes becomes more subtle, There is a problem that it is difficult to grasp the character of the class from the characteristic words of the class.

【０００５】[0005]

【発明が解決しようとする課題】本発明は、前述の問題
を解消して、検索された文書群に含まれる話題群をリア
ルタイムで一覧できるよう文書群に特徴的に現れる語群
の特徴語をグラフ形式またはリスト形式で画面表示する
こと、さらには、文書群に特徴的に現れる語群を低頻度
語から高頻度語までバランス良く抽出することのできる
文献検索支援方法及び装置、さらには、この文献検索を
希望するユーザが遠隔地からも行えるようにすることを
目的とする。SUMMARY OF THE INVENTION The present invention solves the above-mentioned problem and removes the characteristic words of the words that appear characteristically in the documents so that the topics included in the retrieved documents can be listed in real time. A screen display in a graph format or a list format, and a document search support method and apparatus capable of extracting a word group characteristically appearing in a document group from low-frequency words to high-frequency words in a well-balanced manner. An object of the present invention is to enable a user who wants to search a document to be able to perform the search from a remote place.

【０００６】[0006]

【課題を解決するための手段】このため、検索された文
書群に含まれる話題群をリアルタイムで一覧できるよ
う、文書群に特徴的に出現する語群をノードとし、さら
に特徴語間に強い共起関係がある場合、すなわち同一文
書中に出現しやすい度合が高い場合、その単語対にリン
クを張ることによりグラフを構成し、そのグラフを画面
表示するとともに、特徴語のグラフ表示の際に、一般的
な語と特殊性の高い語を一目で見分けることができるよ
うに縦軸方向が特徴語の文書頻度を表すようにする。リ
ストの例で言えば、特徴語を頻度クラスで分類し、文書
頻度の高いものを上段に配列して一覧できるようにして
特殊性の高い語を一目で見分けることができるようにす
る。検索された文書群から特徴語を選ぶ際に、低頻度の
語から高頻度の語までバランス良く特徴語を抽出するた
めには、特徴語を出現頻度によってクラス分けを行な
い、それぞれのクラスから頻度比、すなわち当該文書群
における文書頻度と検索対象全体における文書頻度の比
が大きいものから順に抽出する。For this reason, a word group characteristically appearing in a document group is used as a node so that topics included in the searched document group can be listed in real time, and strong words between characteristic words are used. If there is an origin relationship, that is, if the degree of occurrence in the same document is high, a graph is constructed by linking the word pair, and the graph is displayed on the screen. The vertical axis represents the document frequency of the characteristic word so that general words and highly specific words can be distinguished at a glance. In the case of a list, for example, characteristic words are classified by frequency class, and words having a high document frequency are arranged in the upper row so that the words having a high degree of specialty can be identified at a glance. When selecting characteristic words from a group of retrieved documents, in order to extract characteristic words from low-frequency words to high-frequency words in a well-balanced manner, the characteristic words are classified according to the frequency of appearance, and the frequency is calculated from each class. Ratios are extracted in descending order of the ratio of the document frequency in the document group to the document frequency in the entire search target.

【０００７】[0007]

BEST MODE FOR CARRYING OUT THE INVENTION

実施例１以下、本発明の第１の実施例を図１−２０に従って説明
する。本実施例は、独立に使用されるコンピュータによ
る検索装置の構成例である。本実施例では、検索結果を
グラフ表示とする場合を主体に説明する。図１に本実施
例の文献検索装置の全体構成を示す。１は入力手段、２
は表示手段、３はＣＰＵ、４は計算プログラム保持手
段、５は計算プログラムを動作させるためのワークエリ
ア、６はデータベース保持手段であり、これらの手段あ
るいは装置は、これらの間で相互に信号のやり取りをす
るためのバス１００で連携される。入力手段１はキーボ
ード１１、マウス１２、ペン入力手段１３などから構成
され、表示手段２には検索インタフェイス２１および検
索をガイドするための特徴語表示手段２２が表示され
る。計算プログラム保持手段４には本実施例の文献検索
装置を動作させるために必要となる検索インタフェイス
作動ルーチン４１、形態素解析ルーチン４２、検索ルー
チン４３および特徴語表示手段作動ルーチン４４が格納
される。特徴語表示手段作動ルーチン４４は検索された
文書群から特徴語を抽出して特徴語表示手段２２に表示
するため、特徴語抽出ルーチン４４１、共起関係解析ル
ーチン４４２、グラフ配置ルーチン４４３、およびグラ
フ表示ルーチン４４４をサブルーチンとして用いる。ワ
ークエリア５についての詳細は図２を参照して後述す
る。データベース保持手段６は検索対象となる文書デー
タベース６１、検索に用いるインデックスデータベース
６２、単語頻度に関するデータベース６３および除外語
データベース６４から構成される。これらのデータベー
スは、一般には、事前に準備されているものの中から、
使用者が自分の検索目的に合うものを検索対象データと
して選択して使用する。たとえば、新聞記事についての
検索をしたいときは、新聞社が発行しているデータベー
スを購入して使用することになる。もっとも、除外語デ
ータベース６４は装置の供給者が付属データとして提供
するものである場合があろう。Embodiment 1 Hereinafter, a first embodiment of the present invention will be described with reference to FIGS. The present embodiment is an example of a configuration of a search device using an independently used computer. In the present embodiment, a case where a search result is displayed as a graph will be mainly described. FIG. 1 shows the overall configuration of a document search apparatus according to the present embodiment. 1 is an input means, 2
Is a display means, 3 is a CPU, 4 is a calculation program holding means, 5 is a work area for operating the calculation program, and 6 is a database holding means. These means or devices are used for mutually transmitting signals. They are linked by a bus 100 for exchange. The input means 1 includes a keyboard 11, a mouse 12, a pen input means 13, and the like. The display means 2 displays a search interface 21 and a characteristic word display means 22 for guiding a search. The calculation program holding unit 4 stores a search interface operation routine 41, a morphological analysis routine 42, a search routine 43, and a characteristic word display unit operation routine 44 necessary for operating the document search apparatus of this embodiment. The characteristic word display means operation routine 44 extracts characteristic words from the retrieved document group and displays them on the characteristic word display means 22. Therefore, the characteristic word extraction routine 441, the co-occurrence relation analysis routine 442, the graph arrangement routine 443, and the graph The display routine 444 is used as a subroutine. Details of the work area 5 will be described later with reference to FIG. The database holding means 6 includes a document database 61 to be searched, an index database 62 used for search, a database 63 for word frequency, and an exclusion word database 64. These databases are generally prepared in advance,
The user selects and uses data suitable for his / her search purpose as search target data. For example, when searching for newspaper articles, a database issued by a newspaper company is purchased and used. However, the exclusion word database 64 may be provided as ancillary data by the device supplier.

【０００８】図２はワークエリア５の構成についての詳
細である。ワークエリア５は計算プログラム保持手段４
にある諸ルーチンが動作するために必要となるパラメー
タや一時的なデータを保持するためのエリアであり、検
索ワークエリア５１、特徴語抽出ワークエリア５２、共
起関係解析ワークエリア５３、グラフ配置ワークエリア
５４から構成される。各エリアには、更に細分されたデ
ータエリアが備えられるが、これらの詳細についてはそ
れぞれの関連するルーチンが動作する時に説明する。ユ
ーザが文献検索をしようとするとき、まず、キーボード
１１から文献検索システム起動のコマンドを入力する。
これに応じて、検索インタフェイス作動ルーチン４１が
起動され、対話的に検索作業を進めるための検索インタ
フェイス２１が表示手段２に表示される。図３は検索イ
ンタフェイス２１の初期画面の一例である。検索インタ
フェイス２１は検索要求入力部２１１、キーワード表示
・操作部２１２、ヒット件数表示部２１３、タイトル表
示部２１４、文書表示部２１５、検索実行ボタン２１６
および特徴語表示ボタン２１７などから構成される。FIG. 2 shows the configuration of the work area 5 in detail. The work area 5 is a calculation program holding means 4
Are areas for holding parameters and temporary data required for the operation of various routines in the above. A search work area 51, a characteristic word extraction work area 52, a co-occurrence relation analysis work area 53, a graph layout work It is composed of an area 54. Each area has a further subdivided data area, the details of which will be described when the respective associated routines operate. When a user attempts to search a document, first, a command for starting the document search system is input from the keyboard 11.
In response, the search interface operation routine 41 is activated, and the search interface 21 for interactively performing the search operation is displayed on the display unit 2. FIG. 3 is an example of an initial screen of the search interface 21. The search interface 21 includes a search request input unit 211, a keyword display / operation unit 212, a hit count display unit 213, a title display unit 214, a document display unit 215, and a search execution button 216.
And a feature word display button 217.

【０００９】本実施例では、文献検索用のキーワードと
して必須キーワード、加点キーワード、減点キーワード
の３種類を用いる。検索は必須キーワードに関するアン
ドを取って行なわれ、必須キーワードの指定の無い場合
には加点キーワードのオアで行われる。必須キーワード
で検索された文書が加点キーワードを含む場合には１点
加点し、減点キーワードを含む場合は１点減点する。同
じキーワードが同一文書に何回現れても１点の加点また
は減点とする。これら３種類のキーワードに対応してキ
ーワード表示・操作部２１２は３つの部分から構成され
る。構成は３つとも同様であるのでここでは一番左の必
須キーワードを主体に説明する。キーワード表示・操作
部２１２における必須キーワードの部分は、キーワード
表示部２１２１、移動ボタン２１２１１、クリアボタン
２１２１２から構成される。移動ボタン２１２１１は他
の種類のキーワードを必須キーワードに移す場合に使
い、クリアボタン２１２１２はキーワードを必須キーワ
ードから除去する場合に用いる。すなわち、加点キーワ
ードあるいは減点キーワードに表示されているキーワー
ドを選択して必須キーワードの移動ボタン２１２１１を
押せば、選択されたキーワードが必須キーワードに移さ
れる。また、必須キーワードに表示されているキーワー
ドを選択してクリアボタン２１２１２を押せば、選択さ
れたキーワードが必須キーワードから除去される。同じ
ように、必須キーワードに表示されているキーワードを
選択して、たとえば、加点キーワードの移動ボタン２１
２２１を押せば、選択されたキーワードが加点キーワー
ドに移される。また、必須キーワードに表示されている
キーワードを選択して、減点キーワードの移動ボタン２
１２３１を押せば、選択されたキーワードが減点キーワ
ードに移される。In this embodiment, three types of keywords for document search are used: essential keywords, additional keywords, and deducted keywords. The search is performed by taking an AND of the essential keywords, and if no essential keyword is specified, the search is performed by OR of the additional keywords. One point is added when the document searched for the essential keyword includes the point keyword, and one point is deducted when the document searched for the essential keyword includes the point keyword. Regardless of how many times the same keyword appears in the same document, one point is added or subtracted. The keyword display / operation unit 212 is composed of three parts corresponding to these three types of keywords. Since the configuration is the same for all three, the following description will be focused on the leftmost essential keyword. The essential keyword portion in the keyword display / operation unit 212 includes a keyword display unit 2121, a move button 21211, and a clear button 21212. The move button 21211 is used to transfer another type of keyword to the required keyword, and the clear button 21212 is used to remove the keyword from the required keyword. That is, when a keyword displayed as an additional keyword or a deducted keyword is selected and the move button 21211 for essential keyword is pressed, the selected keyword is moved to the essential keyword. When a keyword displayed as an essential keyword is selected and the clear button 21212 is pressed, the selected keyword is removed from the essential keywords. Similarly, a keyword displayed as a required keyword is selected, and for example, a move button 21
If the user presses 221, the selected keyword is moved to the additional keyword. In addition, select a keyword displayed as a required keyword, and click a deduction keyword move button 2
If the user presses 1231, the selected keyword is moved to the deduction keyword.

【００１０】また、これらの移動ボタンは後述するよう
に、表示されている特徴語をキーワードにコピーするた
めのコピーボタンとしても使用される。すなわち、移動
かコピーかは対象となる語がどこの領域にあるかにより
使い分けられる。検索要求を入力する場合には、検索要
求入力部２１１の検索要求入力窓２１１１をマウス１２
でクリックするなどして入力待ち状態にしてからキーボ
ード１１を用いて必須キーワード、加点キーワードおよ
び減点キーワード等の検索要求を入力する。続いて入力
完了ボタン２１１２を押すと、入力窓２１１１に入力さ
れた文字列が形態素解析ルーチン４２へ渡されて単語列
に分割され、さらに除外語データベース６４を参照し
て、そこに登録されている単語を除去した結果がキーワ
ード格納エリア５１１のデフォールトのキーワード格納
エリア５１１１または５１１２（図２）へ格納される。
ここではデフォールトのキーワードのタイプは必須キー
ワードとした。また、それぞれの内容はキーワード表示
部２１２１または２１２２にリストの形で表示される。
この場合、後述する例からも分かるように、形態素解析
ルーチン４２が持つ辞書に応じて単語の分割状態が決ま
る。[0010] These move buttons are also used as copy buttons for copying the displayed characteristic word to the keyword, as described later. That is, whether to move or copy is used depending on where the target word is located. When inputting a search request, the search request input window 211
Then, a search request such as an essential keyword, an additional keyword, and a deducted keyword is input by using the keyboard 11 to wait for an input by clicking or the like. Subsequently, when the input completion button 2112 is pressed, the character string input to the input window 2111 is passed to the morphological analysis routine 42, divided into word strings, and registered with reference to the excluded word database 64. The result of removing the word is stored in the default keyword storage area 5111 or 5112 (FIG. 2) of the keyword storage area 511.
Here, the default keyword type is a required keyword. Each content is displayed in the form of a list on the keyword display unit 2121 or 2122.
In this case, as will be understood from an example described later, the word division state is determined according to the dictionary of the morphological analysis routine 42.

【００１１】ここで、検索実行ボタン２１６を押すと検
索ルーチン４３が起動され、検索用インデックスデータ
ベース６２（すなわちある単語がどの文書に含まれてい
るかを示すデータ）を参照して、必須キーワードをアン
ドで含む文書を検索し、結果として得られ文書識別番号
の列が検索結果格納エリア５１２へ格納される。なお検
索ルーチン４３は必須キーワードで検索された文書につ
いて加点キーワードが含まれている場合には加点キーワ
ードの一つについて１点加点し、減点キーワードが含ま
れている場合には減点キーワードの一つについて１点減
点するという作業を行ない、この得点も文書識別番号と
合わせて検索結果格納エリア５１２へ格納する。必須キ
ーワードの指定がない場合には、検索ルーチン４３は加
点キーワードに関する検索を加点キーワードのオアで行
ない、以下同様の仕方で得点を計算する。必須キーワー
ドも加点キーワードもない場合には、検索実行ボタン２
１６が押されても検索は行なわない。Here, when a search execution button 216 is pressed, a search routine 43 is started, and a required keyword is ANDed with reference to a search index database 62 (that is, data indicating which document contains a certain word). Are searched, and a row of the obtained document identification numbers is stored in the search result storage area 512. Note that the search routine 43 adds one point to one of the additional keywords when the additional keyword is included in the document searched by the indispensable keyword, and adds one to the additional keyword when the additional keyword is included. The work of deducting one point is performed, and this score is stored in the search result storage area 512 together with the document identification number. If the essential keyword is not specified, the search routine 43 performs a search for the additional keyword by OR of the additional keyword, and calculates a score in the same manner. If there are no required keywords and no additional keywords, search execution button 2
No search is performed even if 16 is pressed.

【００１２】必須キーワードは、検索に際してはアンド
で処理されるから、より厳密に検索結果を絞りこみたい
ときには不可欠であるが、どちらかといえば、検索結果
に漏れがない検索をしたいときには、加点キーワードの
みとしてこのオアで検索を行い、この検索結果に入って
欲しくない事項を含む可能性がある事項を想定できると
きは減点キーワードを設定するのがよい。さらに、検索
ルーチン４３は検索結果格納エリア５１２に格納された
検索結果から得点分布を計算し、その結果を検索結果得
点分布格納エリア５１３に格納する。得点分布とは加点
または減点の得点が何点の文書が何件あったかを示すデ
ータである。The essential keyword is processed by AND in the search, so it is indispensable to narrow down the search result more strictly. However, if it is desired to search without any omission in the search result, the additional keyword is added. It is preferable to set a deduction keyword when a search is performed in this OR as only an item, and when it is possible to assume a matter that may include a matter that the user does not want to enter into the search result. Further, the search routine 43 calculates a score distribution from the search results stored in the search result storage area 512, and stores the result in the search result score distribution storage area 513. The score distribution is data indicating how many documents have a score of addition or deduction.

【００１３】以下「電子出版」を必須キーワードとする
検索要求を入力した場合を例に採り説明する。「電子出
版」なる文字列を必須キーワードとして検索要求入力窓
２１１１に入力した後、入力完了ボタン２１１２を押
す。形態素解析ルーチン４２により「電子出版」は「電
子／出版」と分割されて必須キーワード格納エリア５１
１１に格納され、さらに必須キーワード表示部２１２１
の１行目と２行目に分割して表示される。図４は、この
段階で検索実行ボタン２１６を押した場合の検索ワーク
エリア５１の状態を示したものである。今の例では必須
キーワードが「電子」と「出版」なので、それらが必須
キーワード格納エリア５１１１に格納されている。それ
以外の加点キーワードあるいは減点キーワードは、検索
要求入力窓２１１１に検索者によって付与されなかった
ので、加点キーワード格納エリア５１１２と減点キーワ
ード格納エリア５１１３は空欄のままである。また検索
された文書番号とその得点が検索結果文書番号格納エリ
ア５１２に格納されている。この場合には加点キーワー
ドと減点キーワードがないので得点は全て０である。ま
た得点別に件数をカウントして得られるデータが検索結
果得点分布格納エリア５１３に格納されている。この場
合得点は０のみでそれが７７件あったことを示してい
る。An example in which a search request having "electronic publishing" as a required keyword is input will be described below. After inputting the character string “electronic publishing” as a required keyword in the search request input window 2111, the input completion button 2112 is pressed. “Electronic publishing” is divided into “electronic / publishing” by the morphological analysis routine 42 and the essential keyword storage area 51
11 and a required keyword display unit 2121
Are displayed on the first line and the second line. FIG. 4 shows the state of the search work area 51 when the search execution button 216 is pressed at this stage. In this example, since the essential keywords are “electronic” and “publishing”, they are stored in the essential keyword storage area 5111. Since no other additional keywords or additional keywords are added to the search request input window 2111 by the searcher, the additional keyword storage area 5112 and the additional keyword storage area 5113 are left blank. The searched document number and its score are stored in the search result document number storage area 512. In this case, the points are all 0 because there are no additional keywords and no deduction keywords. Data obtained by counting the number of cases for each score is stored in the search result score distribution storage area 513. In this case, the score was only 0, indicating that there were 77 cases.

【００１４】図５は、この検索結果を表示した検索イン
タフェイス２１の状態を示したものである。必須キーワ
ード表示部２１２１に必須キーワード「電子」と「出
版」が表示され、ヒット件数表示部２１３に検索結果得
点分布格納エリア５１３の内容が表示され、タイトル表
示部２１４には検索された文書識別番号とそのタイトル
が１件１行で適当数表示されている。表示されていない
文書識別番号とそのタイトルを知りたいときは、いわゆ
るスクロールバーによって表示に現れる部分をずらせば
よい。表示されたタイトルから本文を読んで見たいもの
があれば、該当するタイトルの所をマウスなどで指示す
れば本文の内容の一部が文書表示部２１５に表示され
る。表示されていない部分の文書の内容を知りたいとき
は、同じように、スクロールバーによって表示に現れる
部分をずらせばよい。FIG. 5 shows the state of the search interface 21 displaying the search result. The required keywords “electronic” and “published” are displayed in the required keyword display unit 2121, the contents of the search result score distribution storage area 513 are displayed in the hit number display unit 213, and the searched document identification number is displayed in the title display unit 214. And an appropriate number of titles are displayed on each line. If the user wants to know the document identification number and its title that are not displayed, the part appearing on the display may be shifted by a so-called scroll bar. If there is something that the user wants to read and see from the displayed title, pointing to the title with a mouse or the like causes a part of the content of the text to be displayed on the document display unit 215. If the user wants to know the contents of the document not displayed, the user can shift the part appearing on the display using the scroll bar in the same manner.

【００１５】これで「電子出版」に関する文書が７７件
検索されたことになるが、次の段階として、さらに検索
を特定の対象に絞り込みたい場合、あるいはそうでなく
てもこの７７件の文書にはどのような話題が含まれてい
るかを概観したい場合がある。このような場合には検索
インタフェイス２１（図３）上の特徴語表示ボタン２１
７を押すと特徴語表示手段作動ルーチン４４が起動さ
れ、特徴語表示手段２２が表示画面２に表示される。Thus, 77 documents related to “electronic publishing” have been searched. As the next stage, if it is desired to further narrow down the search to a specific target, or if this is not the case, the 77 documents are searched. You may want to get an overview of what topics are included. In such a case, the feature word display button 21 on the search interface 21 (FIG. 3)
When 7 is pressed, the characteristic word display means operation routine 44 is started, and the characteristic word display means 22 is displayed on the display screen 2.

【００１６】図６は特徴語表示手段２２の一例の詳細を
示したものである。特徴語表示手段２２は操作部２２
１、キーワード表示・操作部２２２、ヒット件数表示部
２２３、特徴語表示部２２４、パラメーター設定部２２
５から構成される。キーワード表示・操作部２２２およ
びヒット件数表示部２２３は検索インタフェイス２１の
キーワード表示・操作部２１２およびヒット件数表示部
２１３とそれぞれ連動しており、特徴語表示手段２２上
の操作によりこれらの表示内容が変化した場合には自動
的に検索インタフェイス２１の方のそれぞれの表示も変
化する。しかし逆方向、すなわち検索インタフェイス２
１上の操作によりキーワードやヒット件数が変化した場
合には自動的には特徴語表示手段２２上には反映されな
い。これを取り込むには、操作部２２１のリセットボタ
ン２２１４を押すと検索インタフェイス２１側の内容が
こちらの特徴語表示手段２２側へコピーされる。なお検
索インタフェイス２１上の特徴語表示ボタン２１７を押
すことで表示画面２に表示される特徴語表示手段２２の
初期画面では検索インタフェイス２１上のキーワードと
ヒット件数が自動的にコピーされる。今の例の場合、必
須キーワード表示部２２２１には「電子」と「出版」が
表示され、ヒット件数表示部２２３には「得点０：７７
件」が表示されている。FIG. 6 shows details of an example of the characteristic word display means 22. The characteristic word display means 22 is an operation unit 22
1. Keyword display / operation unit 222, number of hits display unit 223, characteristic word display unit 224, parameter setting unit 22
5 is comprised. The keyword display / operation unit 222 and the number of hits display unit 223 are linked with the keyword display / operation unit 212 and the number of hits display unit 213 of the search interface 21, respectively. Is changed, the respective displays of the search interface 21 are also automatically changed. But in the opposite direction, search interface 2
If the keyword or the number of hits is changed by the above operation, it is not automatically reflected on the characteristic word display means 22. In order to capture the contents, when the reset button 2214 of the operation unit 221 is pressed, the contents of the search interface 21 are copied to the characteristic word display unit 22. By pressing the characteristic word display button 217 on the search interface 21, the keyword on the search interface 21 and the number of hits are automatically copied on the initial screen of the characteristic word display means 22 displayed on the display screen 2. In the case of this example, “Electronic” and “Publishing” are displayed in the essential keyword display section 2221, and “Score 0:77” is displayed in the hit number display section 223.
Cases "are displayed.

【００１７】ここで、操作部２２１の特徴語表示ボタン
２２１２を押すと特徴語抽出ルーチン４５が起動され、
検索結果格納エリア５１２に格納されたデータから最高
得点の文書識別番号を読み込み、それらの文書識別番号
に相当する文書の内容を解析して、それらに特徴的に含
まれる単語（特徴語）と、それら特徴語間の関連性を解
析してグラフにした結果を特徴語表示部２２４に表示す
る。その過程は以下の説明で詳述する。図７は「電子出
版」の例で、グラフ格納エリア５４３（図２）に格納さ
れたデータを示したものである。グラフはノードとリン
クからなりそれぞれノード格納エリア５４３１と、リン
ク格納エリア５４３２とに格納されている。格納される
ノードデータは各ノードに表示される特徴語（文字列）
とそれを特徴語表示部２２４のどこに表示すべきかを示
す座標を中心座標で、さらに文字を表示する領域の横と
縦の文字数と表示領域のサイズで構成されている（ただ
し表示領域のサイズについては、使いやすいようにそれ
らの１／２の値、すなわち中心から端までのサイズにし
てある。）。一方、格納されるリンクデータはグラフ上
に表示すべき線分の始点座標と終点座標で構成されてい
る。図では、リンク格納エリア５４３２に格納されてい
る始点座標と終点座標の他に、参考までに、それぞれに
対応する文字列のデータを付記したが、実際の装置で
は、このデータは不要である。図８は、操作部２２１の
特徴語表示ボタン２２１２が押されて、特徴語のグラフ
が表示された状態の特徴語表示手段２２を示した図であ
る。グラフ表示ルーチン４４４が、グラフ格納エリア５
４３のデータに従って、特徴語表示部２２４に特徴語と
これを結ぶリンクよりなるグラフを表現する。例えば図
７のデータから「コンパクト」は座標（１４９，１３
１）を中心として、横方向文字数３、行数２で且横方向
で両側に２７、縦方向で上下に１８の矩形の領域を表示
域として表示される。この実施例では、座標は特徴語表
示部２２４の左上を始点として横方向は右向に、縦方向
は下向に取る。また、リンクデータは始点と終点の座標
で定義される。リンクデータの１番目は、特徴語「出
版」と「電子」との中心座標を結ぶことを意味し、２番
目のデータは座標（２０３，１３１）から（３０８，４
０）への線分を意味する。これらの語の表示に際して
は、それぞれのノードの表示領域には文字表示用に背景
に白色不透明の矩形を表示して、ノードの表示領域で
は、線分を隠すのがグラフとしては見やすいが、一方、
リンクを示すグラフの線とノードの表示領域が重なる
と、グラフの線が現れないことになり誤解を招くことに
なりかねない。例えば、図７のデータでは、「デスクト
ップパブリッシング」と「出版物」を結ぶグラフの線は
「ニフティサーブ」の表示領域を通過することになるか
ら、「ニフティサーブ」に白色不透明の矩形をつける
と、この部分でグラフの線が線としては表われないこと
になる。その結果、「ニフティサーブ」と「出版物」と
がグラフの線で結ばれ、さらに「ニフティサーブ」と
「デスクトップパブリッシング」とがグラフの線で結ば
れたように見えることになる。図８では、この対策とし
て、背景に白色不透明の矩形を表示する代わりに、グラ
フの線の始点及び終点の近傍でのみグラフの線が表示さ
れないようにしてそのノードの表示領域の中に入り込む
のを避けるとともに、他の表示領域については通過して
いることがわかる表示とした。白色不透明の矩形をつけ
てもグラフの線が隠れないように配置することは大変難
しく、特に多数の特徴語をグラフ表示しようとすると見
やすいサイズでの表示が不可能となりかねない。Here, when the characteristic word display button 2212 of the operation unit 221 is pressed, a characteristic word extraction routine 45 is started,
The document identification numbers with the highest scores are read from the data stored in the search result storage area 512, the contents of the documents corresponding to the document identification numbers are analyzed, and words (characteristic words) characteristically included in the documents are analyzed. The result of analyzing the relevance between these characteristic words and forming a graph is displayed on the characteristic word display unit 224. The process will be described in detail below. FIG. 7 shows an example of “electronic publishing”, which shows data stored in the graph storage area 543 (FIG. 2). The graph includes nodes and links, and is stored in a node storage area 5431 and a link storage area 5432, respectively. The stored node data is the characteristic word (character string) displayed at each node
And the coordinates indicating where to display it in the characteristic word display unit 224 are the center coordinates, and the number of horizontal and vertical characters of the character display area and the size of the display area. Are half their value, ie, the size from the center to the edge, for ease of use.) On the other hand, the stored link data is composed of start point coordinates and end point coordinates of a line segment to be displayed on the graph. In the figure, in addition to the start point coordinates and the end point coordinates stored in the link storage area 5432, corresponding character string data are added for reference, but in an actual device, this data is unnecessary. FIG. 8 is a diagram illustrating the characteristic word display unit 22 in a state where the characteristic word display button 2212 of the operation unit 221 is pressed and a graph of characteristic words is displayed. The graph display routine 444 determines that the graph storage area 5
According to the data 43, a graph composed of characteristic words and links connecting the characteristic words is expressed in the characteristic word display unit 224. For example, from the data of FIG.
The display area is a rectangular area having 3 characters in the horizontal direction, 2 lines in the horizontal direction, 27 pixels on both sides in the horizontal direction, and 18 pixels vertically in the vertical direction. In this embodiment, starting from the upper left of the characteristic word display unit 224, the coordinates are set to the right in the horizontal direction and to the down in the vertical direction. The link data is defined by the coordinates of the start point and the end point. The first of the link data means connecting the center coordinates of the feature words “publishing” and “electronic”, and the second data is from coordinates (203, 131) to (308, 4).
0) means a line segment. When displaying these words, a white opaque rectangle is displayed on the background of each node for character display in the display area of the node, and in the display area of the node, hiding the line segment is easy to see as a graph, ,
If the line of the graph indicating the link and the display area of the node overlap, the line of the graph does not appear, which may be misleading. For example, in the data of FIG. 7, the line of the graph connecting “Desktop Publishing” and “Publication” passes through the display area of “Nifty Serve”. At this point, the line of the graph does not appear as a line. As a result, "Nifty Serve" and "Publication" are connected by graph lines, and "Nifty Serve" and "Desktop Publishing" appear to be connected by graph lines. In FIG. 8, as a countermeasure, instead of displaying a white opaque rectangle on the background, the graph line is not displayed only near the starting point and the ending point of the graph line and enters the display area of the node. And the other display areas are displayed so as to indicate that they are passing. Even if a white opaque rectangle is attached, it is very difficult to arrange the graph so that the lines of the graph are not hidden. Particularly, if many characteristic words are to be displayed in a graph, it may be impossible to display the characteristic words in an easily viewable size.

【００１８】パラメータ設定部２２５の特徴語表示設定
手段２２５１は特徴語表示部２２４に表示する単語数を
調節するためのものであり、設定用つまみ２２５１１を
左右に動かして所望の数値に設定する。表示部２２５１
２にはその設定値が表示され、特徴語抽出パラメータ格
納エリア５２１の抽出語数格納エリア５２１３にその値
が格納される。なおこの値は特徴語抽出ルーチン４４１
によって利用される。以下では特徴語表示手段２２の特
徴語表示ボタン２２１２が押されてから、図７に示した
ようなグラフデータが作成されるまでの過程を説明す
る。特徴語表示ボタン２２１２が押されると、計算プロ
グラム保持手段４に格納されている特徴語抽出ルーチン
４４１以下共起関係解析ルーチン４４２、グラフ配置ル
ーチン４４３が順に起動される。特徴語抽出ルーチン４
４１は検索ワークエリア５１の検索結果得点分布格納エ
リア５１３から最高得点とその件数を読み込む。図４に
示した「電子」と「出版」の例では最高得点（Ｓ）は０
点でありその件数（Ｋ）は７７件である。また特徴語抽
出パラメータ格納エリア５２１から走査文書数上限値
（Ｍ）５２１１を読み込む。（ここではＭ＝３００とす
る。）これは検索された文書件数Ｋが大きい時にすべて
の文書を解析していると時間がかかるので、一定限度Ｍ
を越える場合にはＭ個のサンプル抽出を行なうためのパ
ラメータである。The characteristic word display setting means 2251 of the parameter setting section 225 is for adjusting the number of words displayed on the characteristic word display section 224. The setting knob 22511 is moved right and left to set a desired numerical value. Display unit 2251
2, the set value is displayed, and the value is stored in the extracted word number storage area 5213 of the characteristic word extraction parameter storage area 521. This value is used as the characteristic word extraction routine 441.
Utilized by. Hereinafter, a process from when the characteristic word display button 2212 of the characteristic word display unit 22 is pressed to when the graph data as shown in FIG. 7 is created will be described. When the characteristic word display button 2212 is pressed, a characteristic word extraction routine 441 and a co-occurrence relation analysis routine 442 and a graph arrangement routine 443 stored in the calculation program holding unit 4 are sequentially activated. Feature word extraction routine 4
41 reads the highest score and the number thereof from the search result score distribution storage area 513 of the search work area 51. In the example of “electronic” and “publishing” shown in FIG. 4, the highest score (S) is 0.
It is a point and the number (K) is 77 cases. Further, the upper limit (M) 5211 of the number of scanned documents is read from the feature word extraction parameter storage area 521. (Here, it is assumed that M = 300.) Since it takes time to analyze all documents when the number K of retrieved documents is large, a certain limit M
In the case where the value exceeds, the parameter is for extracting M samples.

【００１９】特徴語抽出ルーチン４４１は、次に、検索
結果格納エリア５１２を参照し、得点が最高得点Ｓと一
致するすべての文書識別番号についてそれらの内容を検
索対象文書データベース６１から読み込み、形態素解析
ルーチン４２を用いて単語分割し、出現するすべての種
類の単語についてそれが出現する文書の数（以下これを
文書頻度と呼ぶ）をカウントする。この例では最高得点
の件数Ｋが７７件で、走査文書数上限値Ｍ＝３００以下
であったのですべての文書を読み込む。なお、該当文書
の形態素解析は、データベース保持手段にゆとりがある
場合には、あらかじめ全文書を形態素解析した結果を保
持しておき、それを読み込むようにすることも可能であ
る。そうすれば、検索の都度形態素解析をする必要がな
くなるので解析時間を大幅に短縮できて有効である。こ
うして得られる単語とその文書頻度のデータは特徴語抽
出ワークエリア５２の中の頻度データ格納エリア５２３
に格納される。なお上記で該当文書を形態素解析した結
果は後にも使うので、単語分割済み文書格納エリア５２
２に格納しておく。Next, the characteristic word extraction routine 441 refers to the search result storage area 512, reads the contents of all the document identification numbers whose scores match the highest score S from the search target document database 61, and performs morphological analysis. Words are divided by using the routine 42, and the number of documents in which the words appear in all types of words (hereinafter referred to as document frequency) is counted. In this example, since the number K of the highest scores is 77 and the upper limit value of the number of scanned documents M = 300 or less, all the documents are read. In the morphological analysis of the document, if there is enough space in the database holding means, it is possible to hold the result of morphological analysis of all documents in advance and read it. This eliminates the need to perform morphological analysis each time a search is performed, so that the analysis time can be significantly reduced, which is effective. The word and the document frequency data thus obtained are stored in the frequency data storage area 523 in the feature word extraction work area 52.
Is stored in Since the result of the morphological analysis of the relevant document is used later, the word-divided document storage area 52 is used.
2 is stored.

【００２０】図９は「電子出版」の例で頻度データ格納
エリア５２３に格納されたデータの一部を示す。各単語
ごとのデータは単語名、文書頻度、全体文書頻度、頻度
比、頻度クラスの５項目で構成されている。文書頻度は
上記作業で検索された文書（この場合７７件）の内の何
件のにその単語が出現したかを表す頻度である。また全
体文書頻度はキーワードによる検索結果に関係なく、検
索対象文書全体で何件の文書に使われているかという頻
度である。その情報は単語頻度データベース６３に格納
されており、そこから該当する単語の頻度情報を取り出
して来たものである。ここで、単語頻度データベース６
３は予め検索対象全文書を走査して、出現する全ての単
語についてその文書頻度をカウントして作成しておくも
のとする。頻度比は文書頻度を全体文書頻度で割算した
値である。例えば一番最初の「ＲＯＭ」では文書頻度が
２１で全体文書頻度が１１８３なので頻度比は２１÷１
１８３≒０．０１７である。FIG. 9 shows a part of data stored in the frequency data storage area 523 in the example of “electronic publishing”. The data for each word is composed of five items: word name, document frequency, overall document frequency, frequency ratio, and frequency class. The document frequency is a frequency representing the number of occurrences of the word in the documents (77 in this case) searched in the above operation. The total document frequency is a frequency indicating how many documents are used in the entire search target document regardless of the search result by the keyword. The information is stored in the word frequency database 63, and frequency information of the corresponding word is extracted therefrom. Here, the word frequency database 6
Reference numeral 3 presupposes that all documents to be searched are scanned in advance, and document frequencies of all appearing words are counted and created. The frequency ratio is a value obtained by dividing the document frequency by the entire document frequency. For example, in the first “ROM”, the document frequency is 21 and the entire document frequency is 1183, so the frequency ratio is 21 ÷ 1.
183 ≒ 0.017.

【００２１】次に、頻度クラスについて説明する。一般
にある文書群に特徴的な語は頻度比の大きさにより判断
でき、頻度比が大きいほど特徴度が高いと言える。しか
し文書頻度が大きく異なる２つの単語を頻度比で比較す
るのは危険である。低頻度語の場合には全体頻度が低い
のでたまたま頻度比が大きくなる確率が高い。たとえ
ば、図９では、「デスクトップパブリッシング」の頻度
比は０．７５となっており、頻度比が大きく特徴度が高
いと言えるかと言えば、そうではない。これは文書頻度
が３にすぎないのに、全体文書頻度も４でしかないため
である。そこで文書頻度が大きく異なる単語同士は比較
しないよう、予め文書頻度を適当な幅で区分してクラス
分けを行ない各クラスで頻度比が大きいものを特徴語と
して取る。これによって低頻度語から高頻度語までバラ
ンス良く特徴語を抽出することが可能となる。以下頻度
クラスの決め方の一例の説明である。特徴語ルーチン４
４１は頻度クラス分割数（Ｃ）５２１２を読み込む、こ
れはいくつの頻度クラスに分割するかを示すパラメータ
であり、使用者が設定する。ここではＣ＝５とする（一
般にＣは１以上の整数である）。ｉ番目の頻度クラスを
Ｃ［ｉ］として、Ｃ［ｉ］に属するための文書頻度がｆ
［ｉ］以上ｆ［ｉ＋１］未満であるとする。ただし最大
のクラスについては「ｆ［ｉ＋１］未満」のかわりに
「ｆ［ｉ＋１］以下」とする。この頻度閾値ｆ［ｉ］の
値の決め方であるが、ここではその一例としてＫ’を該
当文書数として、ｆ［ｉ］＝Ｋ’の（ｉ／（Ｃ＋１））
乗、とする。（検索された文書数Ｋが走査文書数上限値
Ｍを越えない場合にはＫ’＝Ｋであり、Ｋ＞Ｍの場合に
はＫ’＝Ｍである。）今の例ではＫ’＝７７でＣ＝５で
あるから、ｆ［１］＝７７の（１／６）乗＝２．０６，
以下、ｆ［２］＝４．２５，ｆ［３］＝８．７７，ｆ
［４］＝１８．１０，ｆ［５］＝３７．３３となる。従
って、クラス１：文書頻度３以上４以下、クラス２：文
書頻度５以上８以下、クラス３：文書頻度９以上１８以
下、クラス４：文書頻度１９以上３７以下、クラス５：
文書頻度３８以上７７以下、である。Next, the frequency class will be described. In general, words characteristic of a certain document group can be determined from the magnitude of the frequency ratio, and the higher the frequency ratio, the higher the characteristic level. However, it is dangerous to compare two words having greatly different document frequencies in a frequency ratio. In the case of low-frequency words, since the overall frequency is low, there is a high probability that the frequency ratio happens to increase. For example, in FIG. 9, the frequency ratio of “desktop publishing” is 0.75, and it cannot be said that the frequency ratio is large and the feature level is high. This is because the document frequency is only 3 and the overall document frequency is only 4. Therefore, in order not to compare words having greatly different document frequencies, the document frequencies are divided into appropriate classes in advance and classified into classes, and words having a large frequency ratio in each class are taken as characteristic words. This makes it possible to extract characteristic words from low-frequency words to high-frequency words in a well-balanced manner. The following is an example of how to determine the frequency class. Feature word routine 4
Reference numeral 41 denotes a frequency class division number (C) 5212, which is a parameter indicating the number of frequency classes to be divided and set by the user. Here, C = 5 (generally, C is an integer of 1 or more). Assuming that the i-th frequency class is C [i], the document frequency for belonging to C [i] is f
It is assumed that it is not less than [i] and less than f [i + 1]. However, for the largest class, "f [i + 1] or less" is used instead of "less than f [i + 1]". The method of determining the value of the frequency threshold f [i] is, for example, here, assuming that K ′ is the number of relevant documents, f [i] = K ′ (i / (C + 1))
To the power. (K '= K when the number K of retrieved documents does not exceed the upper limit M of the number of scanned documents, and K' = M when K> M.) In this example, K '= 77. Since C = 5, f [1] = 77 to the (1/6) th power = 2.06,
Hereinafter, f [2] = 4.25, f [3] = 8.77, f
[4] = 18.10 and f [5] = 37.33. Therefore, Class 1: Document frequency 3 or more and 4 or less, Class 2: Document frequency 5 or more and 8 or less, Class 3: Document frequency 9 or more and 18 or less, Class 4: Document frequency 19 or more and 37 or less, Class 5:
The document frequency is 38 or more and 77 or less.

【００２２】この分類条件に従って、各語の文書頻度か
らそれらの語の頻度クラスを決める。「ＲＯＭ」の場合
には文書頻度が２１なのでクラス４、また「インタラク
ティブ」は文書頻度が５なのでクラス２となる。なお文
書頻度がクラス１よりも小さい場合（この場合文書頻度
２以下）については特徴語抽出の対象から除外する。上
記の頻度クラスの付与は次の式で直接計算することもで
きる。ただしその値がＣと一致する場合には１を引き算
する。（頻度クラス）＝｛ｌｏｇ（文書頻度）÷ｌｏｇＫ’×
（Ｃ＋１）｝を越えない最大の整数値−１続いて特徴語抽出ルーチンは抽出語数（ｐ）５２１３を
読み込み、各頻度クラスから頻度比が上位のものを合計
でこの個数になるように抽出する。それを実現する方法
の一例としては、抽出語数ｐを頻度クラス分割数Ｃで割
算して得られる商をｎ、余りをｒとして、頻度クラスが
１以上ｒ以下のクラスからはｎ＋１個取り、頻度クラス
がｒより大きいクラスからはｎ個取るという方法があ
る。According to the classification condition, the frequency class of each word is determined from the document frequency of each word. In the case of “ROM”, the document frequency is 21 and the class is 4, and in the case of “interactive”, the document frequency is 5 and the class is 2. If the document frequency is lower than class 1 (in this case, the document frequency is 2 or less), it is excluded from the feature word extraction. The above assignment of the frequency class can also be directly calculated by the following equation. However, if the value matches C, 1 is subtracted. (Frequency class) = ｛log (document frequency) ÷ logK '×
(C + 1) Maximum integer value not exceeding -1 Subsequently, the feature word extraction routine reads the number of extracted words (p) 5213, and extracts from the respective frequency classes the ones with the highest frequency ratios so as to add up to this number. . As an example of a method for realizing this, the quotient obtained by dividing the number p of extracted words by the number C of frequency class divisions is n, and the remainder is r. From classes whose frequency classes are 1 or more and r or less, n + 1 are taken. There is a method of taking n classes from classes whose frequency class is larger than r.

【００２３】以下抽出個数ｐが１０であるとして図９の
例で説明する。分割数Ｃは５なのでｐ÷Ｃの商ｎは２，
余りｒは０である。従ってクラス１〜５から均等に２個
づつ取ることになる。頻度データ格納エリア５２３のデ
ータから各頻度クラスのものについて頻度比が大きいも
のから順に２個ずつ取る。図９のデータより、クラス５
の単語を頻度比が大きい順にならべると「出版」（０．
０２７），「電子」（０．０１５），「メディア」
（０．００６），「情報」（０．００１）となる。従っ
て上位２つの「出版」と「電子」が特徴語として取られ
る。以下同様にしてクラス４からは「ＲＯＭ」と「コン
パクト」、クラス３からは「メール」と「出版物」、ク
ラス２からは「インタラクティブ」と「ニフティサー
ブ」、クラス１からは「デスクトップパブリッシング」
と「パブリッシング」が特徴語として抽出される。それ
らは特徴語リスト格納エリア５２４に格納される。The following description will be made with reference to the example of FIG. 9 on the assumption that the number of extractions p is 10. Since the number of divisions C is 5, the quotient n of p ÷ C is 2,
The remainder r is 0. Therefore, two pieces are equally taken from the classes 1 to 5. From the data in the frequency data storage area 523, two data of each frequency class are taken in ascending order of frequency ratio. From the data in FIG. 9, class 5
The words “published” (0.
027), "Electronics" (0.015), "Media"
(0.006) and “information” (0.001). Therefore, the top two “publishing” and “electronic” are taken as characteristic words. Similarly, from class 4, "ROM" and "compact" from class 4, "mail" and "publication" from class 3, "interactive" and "nifty serve" from class 2, and "desktop publishing" from class 1
And "publishing" are extracted as characteristic words. They are stored in the feature word list storage area 524.

【００２４】図１０は特徴語リスト格納エリア５２４に
格納されたデータの例である。上記プロセスにより抽出
された特徴語とそれらの文書頻度が格納されている。図
では、参考に頻度クラスも示したが、これはなくても良
い。以上で特徴語抽出ルーチン４４１を抜け、続いて共
起関係解析ルーチン４４２が特徴語間の共起データ関係
を解析し、結果を共起データ格納エリア５３１に格納す
る。FIG. 10 shows an example of data stored in the feature word list storage area 524. The characteristic words extracted by the above process and their document frequencies are stored. In the figure, the frequency class is shown for reference, but this need not be provided. As described above, the process exits the feature word extraction routine 441. Subsequently, the co-occurrence relationship analysis routine 442 analyzes the co-occurrence data relationship between the feature words, and stores the result in the co-occurrence data storage area 531.

【００２５】共起データ格納エリア５３１は特徴語リス
ト格納エリア５２４に格納された特徴語の集合を縦横に
持つ２次元の配列である。各要素は対応する単語対が共
通して現れる文書の数を表す。共起関係解析ルーチン４
４２は検索された文書群を単語分割したものを単語分割
済み文書格納エリア５２２から読み込み、各文書ごとに
共出現するすべての特徴語ペアについて、共起データ格
納エリア５３１の対応する要素をインクリメントしてい
く。The co-occurrence data storage area 531 is a two-dimensional array having a set of characteristic words stored in the characteristic word list storage area 524 vertically and horizontally. Each element represents the number of documents in which the corresponding word pair appears in common. Co-occurrence relation analysis routine 4
Reference numeral 42 reads a word-segmented document group from the word-segmented document storage area 522, and increments the corresponding element in the co-occurrence data storage area 531 for all characteristic word pairs that co-occur for each document. To go.

【００２６】次に共起関係解析ルーチン４４２は各特徴
語対に対して共起強度を計算する。共起強度は上記作業
でカウントされた共起頻度を単語ペアの後者（表では列
に当たる単語）の文書頻度で割った値である。単語の文
書頻度は特徴語リスト格納エリア５２４に格納されてい
る値（図１０）を用いる。図１１は、この段階における
共起データ格納エリア５３１に格納されたデータを示
す。各桝目は二つの数値から構成され、上段が対応する
単語対の共起頻度、下段が単語対の共起強度（共起頻度
÷列側の単語の文書頻度）である。例えば６行３列の上
段数値６は、６行目の特徴語「出版物」と３列目の特徴
語「ＲＯＭ」が６件の文書に共出現したことを意味す
る。この場合単語対の列側の単語「ＲＯＭ」の文書頻度
は２１なので、下段の共起強度の数値は６÷２１≒０．
２９となる。共起データ格納エリア５３１では特徴語は
文書頻度の高い順に並べている。後の作業で用いるのは
表の対角線の下半分だけなので、残りの部分は省略し
た。Next, the co-occurrence relation analysis routine 442 calculates the co-occurrence strength for each characteristic word pair. The co-occurrence strength is a value obtained by dividing the co-occurrence frequency counted in the above operation by the document frequency of the latter word in the table (the word corresponding to the column in the table). The value (FIG. 10) stored in the feature word list storage area 524 is used as the document frequency of the word. FIG. 11 shows the data stored in the co-occurrence data storage area 531 at this stage. Each cell is composed of two numerical values, and the upper row shows the co-occurrence frequency of the corresponding word pair, and the lower row shows the co-occurrence strength of the word pair (co-occurrence frequency / document frequency of the word on the column side). For example, the upper numerical value 6 in the sixth row and the third column indicates that the characteristic word “published” in the sixth row and the characteristic word “ROM” in the third column co-appeared in six documents. In this case, since the document frequency of the word “ROM” on the column side of the word pair is 21, the numerical value of the co-occurrence strength in the lower row is 6 ÷ 21 ≒ 0.
29. In the co-occurrence data storage area 531, the characteristic words are arranged in descending order of the document frequency. Only the lower half of the diagonal line in the table will be used in later work, so the rest is omitted.

【００２７】続いて、共起関係解析ルーチン４４２はこ
の共起データから共起度の高い単語ペア（特徴語グラフ
でリンクを張るべきペア）を抽出する。本実施例では特
徴語間の関連性を示すリンクを、各単語から見てそれよ
り文書頻度が高い単語の中で共起強度の値が最も大きく
なる単語に張ることにした。共起関係解析ルーチン４４
２はこの基準に従ってリンクを張るべき単語対を集め共
起リンク格納エリア５３２に格納する。なお、共起強度
が２番あるは３番のものでも、１番のものと比べてそれ
ほど小さくない場合（例えば１番の０．９倍以上）に
は、リンクを張るというやり方も有力である。図１２は
この段階における共起リンク格納エリア５３２の内容を
示す図である。これらのリンクが抽出された過程を図１
１の例に基づいて説明をする。図１２の２番目の「出
版」について見ると、文書頻度が「出版」以上のものは
「電子」しかないので「出版」から「電子」にリンクが
張られる。次に３番目の「ＲＯＭ」についてみると、そ
れより頻度が高いのは「出版」と「電子」の２つであ
り、それらとの共起強度は共に０．２７である。この場
合には共起データ格納エリア５３１における番号の小さ
い「出版」の方にリンクを張る。次に４番の「コンパク
ト」についてみると、３番の「ＲＯＭ」との共起強度が
０．８１で最も大きい。従って「コンパクト」からは
「ＲＯＭ」へリンクを張る。以下同様の操作を続け、図
１２のようなリンクデータが得られる。Subsequently, the co-occurrence relation analysis routine 442 extracts word pairs having a high co-occurrence degree (pairs to be linked in the characteristic word graph) from the co-occurrence data. In this embodiment, a link indicating the relevance between characteristic words is set to a word having the largest value of co-occurrence strength among words having a higher document frequency than each word. Co-occurrence relation analysis routine 44
2 collects word pairs to be linked according to this criterion and stores them in the co-occurrence link storage area 532. Even if the co-occurrence strength is 2 or 3, it is also effective to link if the co-occurrence strength is not so small (for example, 0.9 times or more of 1). . FIG. 12 is a diagram showing the contents of the co-occurrence link storage area 532 at this stage. Figure 1 shows the process of extracting these links.
A description will be given based on the example of FIG. Looking at the second “publishing” in FIG. 12, since there is only “electronic” for a document whose document frequency is “publishing” or more, a link is provided from “publishing” to “electronic”. Looking at the third “ROM”, the two most frequent are “publishing” and “electronic”, and their co-occurrence strength is 0.27. In this case, a link is provided to "publishing" with a smaller number in the co-occurrence data storage area 531. Next, regarding the fourth “compact”, the co-occurrence strength with the third “ROM” is the largest at 0.81. Therefore, a link is provided from "compact" to "ROM". Thereafter, the same operation is continued to obtain link data as shown in FIG.

【００２８】以上で共起関係解析ルーチン４４２を抜
け、続いて、グラフ配置ルーチン４４３が起動される。
特徴語リスト格納エリア５２４のデータ（図１０）と共
起リンク格納エリア５３２のデータ（図１２）にもとづ
いて特徴語群をノードとするグラフを実際に２次元平面
に配置するという作業を行なう。図１３はグラフ配置ル
ーチン４４３の詳細である。グラフ配置ルーチン４４３
はｙ座標計算ルーチン４４３１、ｘ座標計算ルーチン４
４３２、表示座標への変換ルーチン４４３３、重なり回
避ルーチン４４３４、リンク配置ルーチン４４３５から
構成され、この順に起動する。ｙ座標計算ルーチン４４
３１およびｘ座標計算ルーチン４４３２は表示領域が
［−１，１］×［−１，１］の正方形領域であると仮定
して各ノードを配置すべき座標を計算する。この座標を
正規化された座標と呼ぶ。計算された座標データは正規
化座標格納エリア５４１に格納される。As described above, the co-occurrence relation analysis routine 442 is exited, and subsequently, the graph arrangement routine 443 is started.
Based on the data of the characteristic word list storage area 524 (FIG. 10) and the data of the co-occurrence link storage area 532 (FIG. 12), a work of actually arranging a graph having a group of characteristic words as nodes on a two-dimensional plane is performed. FIG. 13 shows details of the graph arrangement routine 443. Graph placement routine 443
Are the y coordinate calculation routine 4431 and the x coordinate calculation routine 4
432, a conversion routine to display coordinates 4433, an overlap avoidance routine 4434, and a link arrangement routine 4435, which are activated in this order. y coordinate calculation routine 44
31 and the x coordinate calculation routine 4432 calculate the coordinates at which each node is to be arranged, assuming that the display area is a square area of [-1, 1] × [-1, 1, 1]. These coordinates are called normalized coordinates. The calculated coordinate data is stored in the normalized coordinate storage area 541.

【００２９】初めにｙ座標計算ルーチン４４３１が起動
され、計算式：ｙ＝（６／π）×ａｒｃｔａｎ（０．２×ｌｏｇ（ｆ／
ｆｍ））に従って各特徴語の文書頻度ｆからそれを表示すべき位
置の正規化されたｙ座標を計算する。すなわち、文書頻
度の大きいもの程ｙ軸上では上段に配置されるようにす
る。ここでｆｍは特徴語を文書頻度順に並べた時にちょ
うど真中に来るものの頻度である（ただし偶数個の場合
には（個数÷２＋１）番目とする）。実施例では、「電
子」「出版」の文書頻度７７が最上段となり、「出版
物」の文書頻度９が中央位置に当たる。πは円周率、対
数ｌｏｇは自然対数、ａｒｃｔａｎは正接関数の逆関数
であり、角度はラジアンを単位とする。例えば「コンパ
クト」の頻度は２１なのでその正規化されたｙ座標は
（６／π）×ａｒｃｔａｎ（０．２×ｌｏｇ（２１÷
９））≒０．３２となる。その他の特徴語の正規化され
たｙ座標も同様に計算する。次にｘ座標計算ルーチン４
４３２が起動され各特徴語表示位置の正規化されたｘ座
標を計算する。図１４はｘ座標計算ルーチン４４３２の
詳細を示した図である。初めにステップ４４３２１によ
り親ノード（リンク先）のないノードが集められる。こ
の場合には「電子」のみがそれに当たる。したがってそ
のｘ座標の値がステップ４４３２１中の式ｘｉ＝−１＋
２ｉ／（ｒ＋１）にｉ＝１を代入して−１＋（２×１）
／（１＋１）＝０と計算される。First, the y coordinate calculation routine 4431 is started, and the calculation formula is: y = (6 / π) × arctan (0.2 × log (f /
fm)), the normalized y-coordinate of the position where it should be displayed is calculated from the document frequency f of each characteristic word. That is, the higher the document frequency, the higher the position on the y-axis. Here, fm is the frequency of the characteristic word that comes exactly in the center when the characteristic words are arranged in document frequency order (however, in the case of an even number, the number is (number ÷ 2 + 1) th). In the embodiment, the document frequency 77 of “electronic” and “publishing” is at the top, and the document frequency 9 of “publication” is at the center position. π is the pi, log is the natural logarithm, arctan is the inverse of the tangent function, and the angle is in radians. For example, since the frequency of “compact” is 21, its normalized y coordinate is (6 / π) × arctan (0.2 × log (21 ÷
9)) ≒ 0.32. The normalized y coordinates of the other feature words are calculated in the same manner. Next, x coordinate calculation routine 4
432 is activated to calculate the normalized x coordinate of each characteristic word display position. FIG. 14 is a diagram showing details of the x coordinate calculation routine 4432. First, in step 44321, nodes without a parent node (link destination) are collected. In this case, only "electrons" correspond to it. Therefore, the value of the x coordinate is calculated by the expression xi = −1 + in step 44321.
Substituting i = 1 for 2i / (r + 1), -1+ (2 × 1)
/ (1 + 1) = 0 is calculated.

【００３０】続いてループ４４３２２に入り、ステップ
４４３２３ではｘ座標の定まったノード（この場合「電
子」のみ）へリンクが張られているノードを一つ取る。
共起リンクのデータ（図１２）からここでは「出版」が
その条件を満たしていることが分かる。続いてステップ
４４３２４に入りステップ４４３２３で選ばれたノード
の親ノードの集合を求め、さらにそれらのｘ座標の平均
値を計算する。「出版」の親ノードの集合は｛「電
子」｝であり、そのｘ座標の平均は０である。次にステ
ップ４４３２５では親ノードの集合が｛「電子」｝と一
致するノードを集める。ここではそれは「出版のみであ
る。Subsequently, a loop 44322 is entered, and in a step 44323, one node linked to a node having a fixed x coordinate (in this case, only “electron”) is taken.
From the co-occurrence link data (FIG. 12), it can be seen that "publishing" satisfies the condition here. Subsequently, the process proceeds to step 44324, in which a set of parent nodes of the node selected in step 44323 is obtained, and further, the average value of their x coordinates is calculated. The set of parent nodes of “publishing” is {“electronic”}, and the average of the x coordinate is 0. Next, in step 44325, nodes whose parent node set matches {“electronic”} are collected. Here it is "publishing only.

【００３１】続いて分岐ステップ４４３２６へ入るが親
ノードのｘ座標の平均値が０なのでステップ４４３２７
が選択され、「出版」のｘ座標が計算される。ステップ
４４３２７の計算式にｓ＝１、ｘｐ＝０、ｉ＝１を代入
して、「出版」のｘ座標が０と計算される。以上で「電
子」と「出版」の正規化されたｘ座標が定まった。しか
しまだ全てのノードのｘ座標が定まってはいないのでル
ープ４４３２２を繰り返す。ステップ４４３２３ではま
だｘ座標が定まっていないノードの内、リンクが「電
子」と「出版」以外には張られていないノードの一つが
選択される。この場合「ＲＯＭ」がその条件を満たす。
ステップ４４３２４では「ＲＯＭ」のリンク先の集合を
求め｛「出版」｝を得る。また親ノード｛「出版」｝の
ｘ座標の平均値ｘｐが０と計算される。Then, the process proceeds to a branch step 44326, but since the average value of the x-coordinate of the parent node is 0, a step 44327 is performed.
Is selected and the x-coordinate of “publishing” is calculated. By substituting s = 1, xp = 0, and i = 1 into the calculation formula of step 44327, the x coordinate of “publishing” is calculated as 0. Thus, the normalized x-coordinates of “electronic” and “publishing” have been determined. However, since the x-coordinates of all the nodes have not been determined yet, the loop 44322 is repeated. In step 44323, among the nodes for which the x coordinate has not been determined yet, one of the nodes to which a link is not set other than “electronic” and “publishing” is selected. In this case, “ROM” satisfies the condition.
In step 44324, a set of link destinations of "ROM" is obtained to obtain "" publish "". Also, the average value xp of the x-coordinate of the parent node {“publish”} is calculated to be 0.

【００３２】ステップ４４３２５ではリンク先の集合が
｛「出版」｝と一致するようなノードを集める。「ＲＯ
Ｍ」以外では「メール」がそれに当たる。In step 44325, nodes whose link destination set matches {“publish”} are collected. "RO
Other than "M", "email" corresponds to it.

【００３３】親ノードのｘ座標の平均値ｘｐが０なので
分岐４４３２６では上段が選択され、ステップ４４３２
７により「ＲＯＭ」と「メール」のｘ座標がそれぞれ
［−１，１］を３等分して、−０．３３，０．３３とい
うように計算される。以下同様にして、すでにｘ座標が
決まったノードのみにリンクが張られるようなノードに
ついて、リンク先が共通のものを集め、親のｘ座標の平
均を中心として区間［−１，１］内に収まるよう均等に
配置するようにｘ座標を決めていく。Since the average value xp of the x-coordinate of the parent node is 0, the upper stage is selected at branch 44326, and step 4432 is selected.
7, x coordinates of "ROM" and "mail" are calculated as [-0.33, 0.33 by dividing [-1,1,] into three equal parts, respectively. In a similar manner, nodes having a common link destination are collected for nodes whose links are established only to nodes for which the x-coordinate has already been determined, and within the section [−1, 1] centered on the average of the parent x-coordinate. The x coordinate is determined so as to be evenly arranged so as to fit.

【００３４】図１５は「電子出版」の例でこの段階にお
ける正規化座標格納エリア５４１に格納された座標デー
タを示した図である。つづいて、グラフ配置ルーチン４
４３は表示座標への変換ルーチン４４３３を起動し、上
記の［−１，１］×［−１，１］領域に正規化された座
標を特徴語表示部２２４における実際の位置を表す座標
への変換を行ない、ノード格納エリア５４３１の中心座
標欄（図１６）に格納する。変換は次のような１次式で
行なう。Ｘ＝Ｒ_x×（１＋ｘ）＋Ｏ_x，Ｙ＝Ｒ_y×（ｙｍ
−ｙ）＋Ｏ_y。ここで小文字のｘとｙが正規化された座
標、大文字のＸとＹが特徴語表示部２２４における座標
である。ｙｍはｙの最大値を表す。図１５の例ではｙｍ
＝０．７７４である。なお係数Ｒ_x、Ｒ_y、Ｏ_x、Ｏ_yはグ
ラフ配置パラメータ格納エリア５４２（図２）の該当す
るエリアに格納された値を用いる。本例ではＲ_x＝２０
０，Ｒ_y＝２００，Ｏ_x＝６０，Ｏ_y＝４０とした。上記
の一次変換により例えば「コンパクト」の場合、正規化
された座標が（−０．５５５，０．３２０）なので，Ｘ
＝２００×（１−０．５５５）＋６０＝１４９，Ｙ＝２
００×（０．７７４−０．３２０）＋４０≒１３１とい
うように計算される。このようにして、全てのノードの
特徴語表示部２２４上での実座標が計算され、ノード格
納エリア５４３１に格納される（図１６）。この時次の
ステップへの準備として単語の順序は、ｘ座標が小さい
順に並べる。また文字表示領域の大きさとして横方向の
文字数ｈと行数ｖ、また文字表示領域の横サイズＨと縦
サイズＶを計算して、ノード格納エリア５４３１に格納
する。FIG. 15 is a diagram showing the coordinate data stored in the normalized coordinate storage area 541 at this stage in the example of “electronic publishing”. Then, graph arrangement routine 4
43 activates a conversion routine 4433 to display coordinates, and converts the coordinates normalized to the above [-1, 1] × [-1, 1] area to coordinates representing the actual position in the feature word display unit 224. The conversion is performed, and the result is stored in the center coordinate column (FIG. 16) of the node storage area 5431. The conversion is performed by the following linear expression. _{X = R x × (1 +} x) + O x, Y = R y × (ym
-Y) + O _y. Here, lowercase x and y are normalized coordinates, and uppercase X and Y are coordinates in the characteristic word display unit 224. ym represents the maximum value of y. In the example of FIG. 15, ym
= 0.774. The coefficients R _x , R _y , O _x , and O _y use the values stored in the corresponding areas of the graph arrangement parameter storage area 542 (FIG. 2). In this example, R _x = 20
0, R _y = 200, O _x = 60, and O _y = 40. For example, in the case of “compact” by the above-mentioned linear transformation, since the normalized coordinates are (−0.555, 0.320), X
= 200 × (1-0.555) + 60 = 149, Y = 2
It is calculated as 00 × (0.774−0.320) + 40 ≒ 131. In this way, the actual coordinates of all nodes on the characteristic word display unit 224 are calculated and stored in the node storage area 5431 (FIG. 16). At this time, as preparation for the next step, words are arranged in ascending order of x-coordinate. In addition, the number of horizontal characters h and the number of lines v as the size of the character display area, the horizontal size H and the vertical size V of the character display area are calculated and stored in the node storage area 5431.

【００３５】文字表示領域サイズは次の計算式に従って
計算する。文字は横書きとし横サイズの限度をＷ文字と
する。Ｗの値は文字表示部の横方向文字数上限値５４２
６に格納されている値を使う。ここではＷ＝３とする。
表示すべき文字数をＭとした場合、横方向の文字数ｈ、
と行数ｖはＭ≦Ｗの場合、ｈはＭ、ｖは１である。また
Ｍ＞Ｗの場合には、ｈはＷであり、ｖは（Ｍ÷Ｗ）以上
の最小の整数である。例えば「電子」については文字数
が２でこれは横幅限度のＷ＝３より小さいので、行数ｖ
は１で横幅ｈは２となる。また「インタラクティブ」の
場合には文字数が８で横幅限度Ｗ＝３を越えるので行数
ｖは（８／３）以上の最小の整数、すなわち３となり、
横幅ｈはＷ＝３である。また文字表示領域の横サイズの
２分の１の値Ｈと縦サイズの２分の１の値Ｖはそれぞれ
の文字数ｈとｖから次の式により計算される。ここで２
分の１の値を取ったのは後の処理で主にこの２分の１の
値を用いるからである。Ｈ＝ｈ×Ｆ／２＋ｍ_x、Ｖ＝ｖ
×Ｆ／２＋ｍ_y。ここでＦは文字フォントの大きさ、ｍ_x
はｘ方向のマージンの大きさ、ｍ_yはｙ方向のマージン
の大きさである。ｍ_xとｍ_yは２つのノードが接近し過ぎ
ないように、最低限保つべき間隔を表す。Ｆ、ｍ_x、ｍ_y
はそれぞれ文字サイズ５４２５、文字表示部の横方向マ
ージン５４２７、同縦方向マージン５４２８（図２）に
格納されている値を用いる。本例ではＦ＝１６、ｍ_x＝
３、ｍ_y＝２とする。例えば「コンパクト」の場合ｈ＝
３でｖ＝２なのでＨ＝３×１６／２＋３＝２７、Ｖ＝２
×１６／２＋２＝１８と計算される。図１６のノード格
納エリア５４３１における文字表示サイズとしての文字
数と表示領域サイズはこのようにして計算したものであ
る。The character display area size is calculated according to the following formula. Characters are written horizontally, and the size limit is W characters. The value of W is the upper limit 542 of the number of characters in the horizontal direction of the character display section.
Use the value stored in 6. Here, it is assumed that W = 3.
Assuming that the number of characters to be displayed is M, the number of characters h in the horizontal direction,
And the number of rows v is M ≦ W, h is M and v is 1. When M> W, h is W and v is a minimum integer equal to or larger than (M ÷ W). For example, for “Electronic”, the number of characters is two, which is smaller than the width limit W = 3, so that the number of lines v
Is 1 and the width h is 2. In the case of "interactive", since the number of characters is 8 and exceeds the width limit W = 3, the number of lines v is the minimum integer equal to or more than (8/3), that is, 3, and
The width h is W = 3. Further, a value H of a half of the horizontal size and a value V of a half of the vertical size of the character display area are calculated from the respective numbers of characters h and v by the following formula. Where 2
The reason for taking the half value is that the half value is mainly used in the subsequent processing. H = h × F / 2 + _mx , V = v
× F / 2 + m _y. Where F is the character font size, m _x
The magnitude of the x-direction of the margin, m _y is the magnitude of the y direction of the margin. m _x a m _y are such that the two nodes are not too close, it represents the distance should be kept minimal. F, m _x, m _y
Use the values stored in the character size 5425, the horizontal margin 5427, and the vertical margin 5428 (FIG. 2) of the character display unit, respectively. In this example, F = 16, _mx =
3, suppose that m _y = 2. For example, in the case of "compact", h =
3 and v = 2, H = 3 × 16/2 + 3 = 27, V = 2
X16 / 2 + 2 = 18 is calculated. The number of characters and the display area size as the character display size in the node storage area 5431 in FIG. 16 are calculated in this way.

【００３６】このようにして特徴語表示部における座標
が求まったが、この段階ではノードの重なりが生じるお
それがある。例えば図１６の例では「電子」と「出版」
の座標は同じなので重なってしまう。そのため重なり回
避ルーチン４４３４が起動され、重なりが生じないよう
に座標をずらす操作を行なう。Although the coordinates in the characteristic word display section have been obtained in this manner, there is a possibility that nodes may overlap at this stage. For example, in the example of FIG. 16, "electronic" and "publishing"
Have the same coordinates, so they overlap. For this reason, the overlap avoiding routine 4434 is started, and an operation is performed to shift the coordinates so that the overlap does not occur.

【００３７】図１７は重なり回避ルーチン４４３４の詳
細である。全ノードをｘ座標が小さい順にソートしたも
のをＮ［１］，．．．，Ｎ［ｒ］とする。Ｎ［ｉ］の座
標を（Ｘ［ｉ］，Ｙ［ｉ］）、文字表示領域サイズの値
を（Ｈ［ｉ］，Ｖ［ｉ］）とする。ｉ＝２，．．．，ｒ
について次の操作を行なう。ｊ＝１，．．．，ｉ−１の
内｜Ｙ［ｊ］−Ｙ［ｉ］｜＜Ｖ［ｉ］＋Ｖ［ｊ］となる
ようなｊについてＸ［ｊ］＋Ｈ［ｊ］の最大値を取りξ
とする。なおそのようなｊが無い場合にはこのｉについ
ては座標をずらす操作は必要ない。δ＝ξ−（Ｘ［ｉ］
−Ｈ［ｉ］）とする。δ≦０の場合にはこのｉについて
は座標をずらす操作は必要ない。δ＞０の場合には、重
なりが生じてしまうので、Ｎ［ｉ］，．．．，Ｎ［ｒ］
のｘ座標をすべて右にδずらす。すなわち、Ｘ［ｋ］＝
Ｘ［ｋ］＋δ（ｋ＝ｉ，．．．，ｒ）とする。FIG. 17 shows the details of the overlap avoiding routine 4434. All nodes are sorted in ascending order of x-coordinate, and N [1],. . . , N [r]. Let the coordinates of N [i] be (X [i], Y [i]) and the value of the character display area size be (H [i], V [i]). i = 2,. . . , R
Perform the following operation for j = 1,. . . , I−1, take the maximum value of X [j] + H [j] for j such that | Y [j] −Y [i] | <V [i] + V [j].
And If there is no such j, there is no need to perform the operation of shifting the coordinates for this i. δ = ξ− (X [i]
−H [i]). When δ ≦ 0, there is no need to perform the operation of shifting the coordinates for this i. If δ> 0, an overlap occurs, so that N [i],. . . , N [r]
Are shifted δ to the right. That is, X [k] =
Let X [k] + δ (k = i,..., R).

【００３８】以上により、全ノードが重ならずに表示で
きるような座標が与えられる。たとえばｉ＝２の「イン
タラクティブ」の場合についてみると、図１６のデータ
より、｜Ｙ［２］−Ｙ［１］｜＝｜２４０−１３１｜＝
１０９で、Ｖ［２］＋Ｖ［１］＝２６＋１８＝４４であ
るから｜Ｙ［２］−Ｙ［１］｜＜Ｖ［２］＋Ｖ［１］が
成り立たない。従って「インタラクティブ」については
横へずらす操作は行なわない。次にｉ＝３、すなわち
「ＲＯＭ」について見る。ｊ＝１については、｜Ｙ
［３］−Ｙ［１］｜＝｜１３１−１３１｜＝０に対して
Ｖ［３］＋Ｖ［１］＝１０＋１８＝２８となり、｜Ｙ
［３］−Ｙ［１］｜＜Ｖ［１］＋Ｖ［３］となる。すな
わちｊ＝１の「コンパクト」と重なりが生じてしまう。
またｊ＝２の「インタラクティブ」との関係を見ると、
｜Ｙ［３］−Ｙ［２］｜＝｜１３１−２４０｜＝１０
９、Ｖ［３］＋Ｖ［２］＝１０＋２６＝３６で｜Ｙ
［２］−Ｙ［３］｜＜Ｖ［２］＋Ｖ［３］とならないの
で「インタラクティブ」とは重なる恐れがない。従って
ｊ＝１についてのみｘ座標を考慮すれば良い。ξ＝Ｘ
［１］＋Ｈ［１］＝１４９＋２７＝１７６となり、ずら
し幅δはδ＝ξ−（Ｘ［ｉ］−Ｈ［ｉ］）＝１７６ー
（１９３ー２７）＝１０である。従ってｊ＝３、．．．
１０についてＸ［ｊ］をすべて＋１０する。（Ｘ
［３］，Ｙ［３］）＝（２０３，１３１）となり、図７
における「ＲＯＭ」の座標を得る。以下このステップの
繰り返しにより図７のノード格納エリア５４４１と同じ
データが得られる。この文字表示領域の重なり回避の操
作でも、前述した文字表示領域とグラフの線の重なりは
チェックできないし、実際問題として、限られた表示面
積では、これを厳密に避けようとすると、適当な大きさ
の中で、表示のできないことも起こりうるので、実施例
では、これについてのチェックはしないこととした。As described above, coordinates are provided so that all nodes can be displayed without overlapping. For example, in the case of “interactive” with i = 2, | Y [2] −Y [1] | = | 240−131 | =
At 109, since V [2] + V [1] = 26 + 18 = 44, | Y [2] −Y [1] | <V [2] + V [1] does not hold. Therefore, for "interactive", the operation of shifting to the side is not performed. Next, i = 3, that is, “ROM” will be described. For j = 1, | Y
[3] -Y [1] | = | 131-131 | = 0, V [3] + V [1] = 10 + 18 = 28, and | Y
[3] −Y [1] | <V [1] + V [3]. That is, the overlap with the “compact” of j = 1 occurs.
Looking at the relationship with “interactive” with j = 2,
| Y [3] -Y [2] | = | 131-240 | = 10
9, V [3] + V [2] = 10 + 26 = 36 and | Y
Since [2] −Y [3] | <V [2] + V [3], there is no possibility of overlapping with “interactive”. Therefore, the x coordinate only needs to be considered for j = 1. ξ = X
[1] + H [1] = 149 + 27 = 176, and the shift width δ is δ = ξ− (X [i] −H [i]) = 176− (193−27) = 10. Therefore, j = 3,. . .
For X, all X [j] are incremented by +10. (X
[3], Y [3]) = (203, 131), and FIG.
To obtain the coordinates of the "ROM" in. Thereafter, by repeating this step, the same data as in the node storage area 5441 of FIG. 7 is obtained. Even in the operation of avoiding the overlap of the character display area, the overlap between the character display area and the line of the graph described above cannot be checked, and as a practical matter, when the display area is limited and the strict avoidance is required, an appropriate size is required. In the meantime, there is a possibility that the display cannot be performed. Therefore, in the embodiment, this is not checked.

【００３９】最後にグラフ配置ルーチン４４３はリンク
配置ルーチン４４３５を起動する。リンク配置ルーチン
４４３５は共起関係解析ワークエリア５３の中の共起リ
ンク格納エリア５３２に格納された共起リンクを張るべ
き単語ペアに関する情報と、ノードデータ格納エリア５
４３１に格納されている各ノードの座標データから特徴
語表示部２２４に表示すべき線分のデータ、すなわち始
点の座標と終点の座標を作成してリンクデータ格納エリ
ア５４２２に格納する。例えば図１２の共起リンク格納
エリア５３２には「ＲＯＭ」から「出版」へのリンクが
ある。図７のノードデータ格納エリア５４３１に格納さ
れたデータより、「ＲＯＭ」の座標が（２０３，１３
１）であり「出版」の座標が（３０８，４０）であるこ
とが分かるので、（２０３，１３１）を始点として（３
０８，４０）を終点とする線分のデータがリンクデータ
格納エリア５４３２に格納される。以上により表示すべ
きグラフのデータ（図７）が作成された。以下では特徴
語表示手段２２の特徴語表示部２２４に表示された特徴
語のグラフ表示を参考にして検索作業を進展させる利用
形態の例を示す。Finally, the graph arrangement routine 443 activates the link arrangement routine 4435. The link arrangement routine 4435 stores information on a word pair to which a co-occurrence link is to be established stored in the co-occurrence link storage area 532 in the co-occurrence relation analysis work area 53 and the node data storage area 5.
From the coordinate data of each node stored in 431, data of a line segment to be displayed on the characteristic word display unit 224, that is, the coordinates of the start point and the coordinates of the end point are created and stored in the link data storage area 5422. For example, the co-occurrence link storage area 532 in FIG. 12 has a link from “ROM” to “publishing”. From the data stored in the node data storage area 5431 of FIG. 7, the coordinates of “ROM” are (203, 13).
Since it can be seen that the coordinates of “publishing” are (308, 40), (3) with (203, 131) as the starting point,
08, 40) is stored in the link data storage area 5432. Thus, the data of the graph to be displayed (FIG. 7) is created. Hereinafter, an example of a usage form in which a search operation is advanced with reference to a graph display of the characteristic word displayed on the characteristic word display unit 224 of the characteristic word display unit 22 will be described.

【００４０】図８は「電子出版」に関する特徴語表示の
例であるが、ここでユーザが仮に表示された語のひとつ
である「デスクトップパブリッシング」に興味があると
しよう。この場合には、画面上でその単語の所をマウス
１２などで指示してから加点キーワードの移動ボタン２
２２２２を指示すると「デスクトップパブリッシング」
が加点キーワード格納エリア５１１２に格納され、検索
インタフェイス２１の加点キーワード表示部２１２２と
特徴語表示手段２２の加点キーワード表示部２２２２に
表示される。そこで検索インタフェイス２１の検索実行
ボタン２１６もしくは特徴語表示手段２２の検索実行ボ
タン２２１１を押すと加点キーワードに「デスクトップ
パブリッシング」を加えた形で検索が実行され検索の絞
り込みをすることができる。また図８の特徴語表示部２
２４に表示された特徴語の中に興味ある単語を発見でき
なかった場合には特徴語表示数設定手段２２５１を用い
て表示語数を増やすことができる。図１８は特徴語表示
語数を２０に増やした場合の例である。この場合には図
９のデータの例では、このデータから特徴語抽出ルーチ
ン４４１により、２０個の単語が選択されて、図８のケ
ースで説明したと同様に表示される。ここで仮にユーザ
は「電子出版」における「情報検索」に興味があったと
すれば表示されたグラフに「検索」および「情報検
索)」という語が表示されているのでそれを利用でき
る。特徴語表示部の「検索」と「情報検索」をマウスな
どでクリックしてから加点キーワードへの移動ボタン２
２２２２を押せばこれらの単語が加点用のキーワードと
して付け加えられる。これで検索実行ボタン２２１１を
押せば検索の絞り込みができる。また検索を絞り込んだ
後で特徴語のグラフを見たい場合には特徴語表示ボタン
２２１２を押せば良い。それから検索と特徴語のグラフ
を連続して行なう場合には検索実行＋特徴語表示ボタン
２２１３を押せば以上のステップが連続して行なわれ
る。FIG. 8 shows an example of the display of characteristic words relating to “electronic publishing”. Here, it is assumed that the user is interested in “desktop publishing”, which is one of the provisionally displayed words. In this case, the position of the word is indicated on the screen with the mouse 12 or the like, and then the point button 2
If you specify 2222, "Desktop Publishing"
Are stored in the additional keyword storage area 5112, and are displayed on the additional keyword display unit 2122 of the search interface 21 and the additional keyword display unit 2222 of the characteristic word display unit 22. Then, when the search execution button 216 of the search interface 21 or the search execution button 2211 of the characteristic word display unit 22 is pressed, the search is executed in a form in which “desktop publishing” is added to the additional keyword, and the search can be narrowed down. Also, the characteristic word display unit 2 of FIG.
If an interesting word cannot be found in the characteristic words displayed in 24, the number of display words can be increased by using the characteristic word display number setting means 2251. FIG. 18 shows an example in which the number of characteristic word display words is increased to 20. In this case, in the example of the data in FIG. 9, 20 words are selected from this data by the characteristic word extraction routine 441, and displayed in the same manner as described in the case of FIG. Here, if the user is interested in “information search” in “electronic publishing”, the words “search” and “information search)” are displayed in the displayed graph, and can be used. Click on "Search" and "Information Search" in the feature word display area with a mouse or the like and then move to the additional keyword 2
Pressing 2222 adds these words as additional keywords. By pressing the search execution button 2211, the search can be narrowed down. If the user wants to view a graph of characteristic words after narrowing down the search, the user can press the characteristic word display button 2212. Then, when the search and the graph of the characteristic word are continuously performed, if the search execution + characteristic word display button 2213 is pressed, the above steps are continuously performed.

【００４１】次に「情報検索」には興味がない場合、あ
るいは「情報検索」に関する文書には既に目を通してし
まい、それ以外の話題に注目したい場合には、減点キー
ワードを利用する。すでに「検索」と「情報検索」が加
点キーワードに加えられている場合には、加点キーワー
ド表示部２２２２に表示されているこれらの単語をマウ
スなどで指示してから減点キーワードへの移動ボタン２
２２３２を押せばこれらの単語が加点キーワードから減
点キーワードへ移動する。なお特徴語表示部２２４に表
示されている単語を直接減点キーワードとして利用した
い場合には、加点キーワードの時と同様に、該当する単
語をマウスなどでクリックした後減点キーワードへの移
動ボタン２２２３２を押せば良い。すなわち、本実施例
では、検索キーワード間では移動ボタンにより移動の操
作が行われ、表示された特徴語とキーワード間では移動
ボタンにより複写の操作が行われる。Next, if the user is not interested in "information search", or has already read the document related to "information search" and wants to pay attention to other topics, the deduction keyword is used. If “search” and “information search” have already been added to the point-added keyword, these words displayed on the point-added keyword display section 2222 are indicated by a mouse or the like, and then the move to the point-reduced keyword button 2
If the user presses 2232, these words move from the point-added keyword to the deducted keyword. If the user wants to directly use the word displayed in the characteristic word display section 224 as a deduction keyword, click the corresponding word with a mouse or the like and press the move button 22232 to the deduction keyword, as in the case of the point-added keyword. Good. That is, in the present embodiment, a move operation is performed by the move button between the search keywords, and a copy operation is performed by the move button between the displayed characteristic word and the keyword.

【００４２】「検索」と「情報検索」を減点キーワード
へ移動してから検索を実行すると今度はこれらの単語を
含む文書の得点が下がり、相対的にこれらを含まない文
書の得点が上がるので「電子出版」に関する文書の内、
「情報検索」には関係のない文書に注目することが出来
る。図１９は特徴語表示様式選択手段２１７１を備え、
特徴語をグラフの形で表示したり、リストの形で表示し
たりすることを選択できる機能を備えた検索インタフェ
イス２１の一例である。リストでの表示はグラフで表示
した場合と比べて、多数の特徴語を表示する為、特徴語
相互の関連性を表示できないので関連性に着目した結果
の評価ができないという欠点がある反面、スクロールバ
ーを用いることにより、検索結果に出現する多数の特徴
語を一覧できるので、ユーザにとって興味と合致する関
連語を発見できる可能性が高くなる長所がある。When the search is executed after moving "search" and "information search" to the deducted keyword, the score of a document including these words is reduced, and the score of a document not including these words is relatively increased. Electronic publishing "
It is possible to focus on documents that are not related to “information search”. FIG. 19 includes a characteristic word display style selecting means 2171,
This is an example of a search interface 21 having a function of selecting display of a characteristic word in a graph form or a list form. Compared to the graph display, the list display displays many feature words, so the relevance of the feature words cannot be displayed, so the result of focusing on the relevance cannot be evaluated. By using the bar, many characteristic words appearing in the search result can be listed, so that there is an advantage that the possibility that the user can find a related word that matches the interest is high.

【００４３】したがって、図１９に示される特徴語表示
様式選択手段２１７１を利用して、まず、検索結果をグ
ラフ表示して特徴語の全体像を相互の関連性も含めて概
観して、結果を評価し、これにユーザの興味と合致する
関連語が十分に表われない場合には、リスト表示を用い
て更に細かく探すという二段階の結果評価ができる。さ
らに、リストを利用した表示から興味のもたれる語が得
られたとき、これをキーワードとして利用して、再度検
索からやり直すこともできる。図１９の特徴語表示様式
選択手段２１７１で「グラフ」を選択すれば、図８ある
いは図１８で説明したように、特徴語のグラフ表示がな
される。図１９に示すように、「リスト」を選択すれ
ば、図２０に一例を示すように、特徴語表示部２２４に
は、特徴語がリストの形で表示される。特徴語表示様式
選択手段２１７１で「リスト」を選択した場合でも、検
索された文書群から特徴語を抽出する方法は前述したグ
ラフ表示の場合と同じである。ただし、リスト表示の場
合、図９に示したように頻度を５クラスとするよりは、
高、中、低の３クラス程度とする方が見やすいと考えら
れるので、図２０の表示例では、頻度クラスの分割数は
３とした。図２０において、「リスト」の選択に対応し
て、特徴語表示部２２４には、高頻度特徴語表示部２２
４１、中頻度特徴語表示部２２４２および低頻度特徴語
表示部２２４３がそれぞれスクロールバー付きの表示枠
が設定され、頻度データ格納エリア５２３の特徴語の頻
度クラスデータに対応した特徴語が各表示枠内に表示さ
れる。各表示枠内での表示順は、たとえば、頻度比の大
きさ順にならべるのが良い。これにより、ユーザは、よ
り一般性の高い特徴語から固有名など特殊性の高い特徴
語までを一覧でき、幅広い選択肢から興味に合致した単
語を検索できる。Therefore, using the characteristic word display style selecting means 2171 shown in FIG. 19, first, the retrieval result is displayed in a graph, and the whole image of the characteristic word is overviewed including the mutual relevance. If the related words that match the user's interest are not sufficiently expressed, a two-stage result evaluation can be performed in which a more detailed search is performed using a list display. Further, when a word of interest is obtained from the display using the list, the word can be used as a keyword and the search can be started again. If "graph" is selected by the characteristic word display style selecting means 2171 in FIG. 19, a graph of characteristic words is displayed as described in FIG. 8 or FIG. As shown in FIG. 19, if "list" is selected, the characteristic words are displayed in the form of a list on the characteristic word display section 224, as shown in an example in FIG. Even when "list" is selected by the characteristic word display style selecting means 2171, the method of extracting characteristic words from the retrieved document group is the same as in the case of the graph display described above. However, in the case of a list display, as shown in FIG.
It is considered that it is easier to see three classes of high, medium and low. Therefore, in the display example of FIG. 20, the number of divisions of the frequency class is three. In FIG. 20, in response to the selection of “list”, the characteristic word display section 224 displays the high frequency characteristic word display section 22.
41, a display frame with a scroll bar is set in each of the medium-frequency characteristic word display section 2242 and the low-frequency characteristic word display section 2243, and a characteristic word corresponding to the frequency class data of the characteristic word in the frequency data storage area 523 is displayed in each display frame. Will be displayed within The display order in each display frame is preferably arranged, for example, in the order of the magnitude of the frequency ratio. As a result, the user can list from characteristic words having higher generality to characteristic words having higher specificity such as proper names, and can search for a word that matches the interest from a wide range of options.

【００４４】実施例２以下、本発明の第２の実施例を図２１に従って説明す
る。第1の実施例が独立に使用されるコンピュータによ
る検索装置の構成例であったのに対し、本実施例では、
複数のユーザによる検索要求に応えることのできる検索
方法を実現するものである。図２１に本実施例の文献検
索方法を実現する他の実施例の全体構成を示す。本実施
例は、一つのサーバに複数のクライアントが信号伝送回
線を介してアクセスし、クライアント毎に検索サービス
を受けることのできるものである。サーバは、サーバ自
体をクライアントとしても利用することはないのが一般
的である。しかし、本実施例では、クライアントからの
問題指摘に応じてサーバもクライアントとしても利用す
る必要がありうることを考慮して、サーバは、実施例１
で説明したのと実質的に同じ構成に通信手段７をプラス
した検索装置とした。クライアントは実施例１で説明し
た構成のうち入力手段１、表示手段２、ＣＰＵ３、計算
プログラム保持手段４、計算プログラムを動作させるた
めのワークエリア５およびバス１００のそれぞれに対応
するダッシュを付して示した手段、およびサーバとの連
係を取るための通信手段７および出力手段８としてのプ
リンタ８１よりなる。サーバのバス１００にはインタフ
ェイスＩＦ１が、およびクライアントのバス１００には
インタフェイスＩＦ２、ＩＦ３がそれぞれ設けられ
て、サーバ−クライアント間を結ぶ回線ＮＥＴ１，ＮＥ
Ｔ２で結ばれる。なお、クライアント２についてはバス
１００およびインタフェイスＩＦ２のみを図示して他は
省略した。Embodiment 2 Hereinafter, a second embodiment of the present invention will be described with reference to FIG. Whereas the first embodiment is an example of a configuration of a search device by a computer used independently, in this embodiment,
The present invention realizes a search method that can respond to search requests from a plurality of users. FIG. 21 shows the overall configuration of another embodiment for realizing the document search method of this embodiment. In this embodiment, a plurality of clients can access one server via a signal transmission line, and can receive a search service for each client. Generally, a server does not use the server itself as a client. However, in the present embodiment, in consideration of the fact that it is necessary to use both the server and the client in response to a problem pointed out by the client, the server is used in the first embodiment.
A search device is provided in which the communication means 7 is added to the configuration substantially the same as that described above. The client attaches dashes corresponding to each of the input unit 1, the display unit 2, the CPU 3, the calculation program holding unit 4, the work area 5 for operating the calculation program, and the bus 100 in the configuration described in the first embodiment. It comprises the means shown, the communication means 7 for linking with the server, and the printer 81 as the output means 8. An interface IF1 is provided on the bus 100 of the server, and interfaces IF2 and IF3 are provided on the bus 100 of the client. Lines NET1 and NE connecting the server and the client are provided.
It is tied at T2. As for the client 2, only the bus 100 and the interface IF2 are shown, and the others are omitted.

【００４５】クライアント１が文献検索をしようとする
とき、まず、入力手段１のキーボード１１から文献検索
システム起動のコマンドを入力する。これに応じて、ク
ライアンと側の通信手段７とサーバ側の通信手段７が通
信経路ＮＥＴ１を介して連絡を取り、サーバ側の計算プ
ログラム保持手段４の検索インタフェイス作動ルーチン
４１がクライアント１側に送信され、クライアント１側
で起動される。この結果、表示手段２に対話的に検索作
業を進めるための検索インタフェイス２１が表示され
る。検索インタフェイス２１が表示された後は、クライ
アント１はこの画面を利用して実施例１で説明したと同
様の手順で検索キーとなる語を入力してゆけば良い。な
お、クライアント側では検索インタフェイス作動ルーチ
ン４１のコピーを計算プログラム保持手段４に保持して
おいて、これを起動するものとしても良い。また、ＷＷ
Ｗプラウザなどのハイパーテキスト閲覧インタフェイス
を利用して本検索支援サービスが受けられるようにする
のも便利である。その場合には、サーバ側には、検索イ
ンタフェイス作動ルーチン４１をクライアント側に送信
するためのハイパーテキスト（ＨＴ）を用意する。な
お、クライアント側では汎用のハイパーテキスト閲覧イ
ンタフェイスが利用できる環境にあることを前提とす
る。When the client 1 attempts to search for a document, first, a command for starting the document search system is input from the keyboard 11 of the input unit 1. In response, the communication means 7 on the client side and the communication means 7 on the server side communicate with each other via the communication path NET1, and the search interface operation routine 41 of the calculation program holding means 4 on the server side communicates with the client 1 side. It is transmitted and activated on the client 1 side. As a result, a search interface 21 for interactively performing a search operation is displayed on the display unit 2. After the search interface 21 is displayed, the client 1 may use this screen to input a word serving as a search key in the same procedure as described in the first embodiment. Note that the client may hold a copy of the search interface operation routine 41 in the calculation program holding means 4 and start it. Also, WW
It is also convenient to be able to receive this search support service using a hypertext browsing interface such as a W browser. In this case, the server prepares hypertext (HT) for transmitting the search interface operation routine 41 to the client. It is assumed that the client has an environment in which a general-purpose hypertext browsing interface can be used.

【００４６】表示手段２に表示されているハイパーテキ
スト閲覧インタフェイスのアドレス入力部から、本検索
支援サービスが指定するアドレス（すなわちサーバのネ
ットワーク上でのアドレスと検索インタフェイス作動ル
ーチン４１を送付するためのハイパーテキストＨＴの存
在するファイル名など）を指定すると、双方の通信手段
を介して指定されたハイパーテキストＨＴが検索インタ
フェイス作動ルーチン４１を伴ってクライアント側に送
られ、送付された検索インタフェイス作動ルーチン４１
はクライアント側計算機で起動され、検索インタフェイ
ス２１が表示手段２に表示され利用可能となる。なお、
上記では、直接ハイパーテキストＨＴのアドレスを指定
したが、ハイパーテキスト閲覧インタフェイスの閲覧部
に表示されているハイパーテキストに、本ハイパーテキ
ストＨＴのアドレスがアンカーとして埋め込まれている
場合には、そのアンカーの部分をマウスなどでクリック
しても同様の動作をさせることができる。From the address input section of the hypertext browsing interface displayed on the display means 2, the address designated by the search support service (ie, the address of the server on the network and the search interface operation routine 41 are sent. Is specified, the specified hypertext HT is sent to the client via the communication means together with the search interface operation routine 41, and the sent search interface is sent. Operation routine 41
Is started on the client computer, and the search interface 21 is displayed on the display means 2 and becomes usable. In addition,
In the above description, the address of the hypertext HT is directly specified. However, if the address of the present hypertext HT is embedded as an anchor in the hypertext displayed on the browsing unit of the hypertext browsing interface, the anchor is used. The same operation can be performed by clicking the portion with a mouse or the like.

【００４７】クライアント１が入力した検索要求は通信
手段７、７と通信経路ＮＥＴ１を介してサーバ側に伝送
され、サーバ側で必要な検索と特徴語抽出とグラフ配置
計算が実行されて、その結果が再び通信手段７、７の連
絡によりクライアント１側に返信され、クライアント１
の検索インタフェイス作動ルーチン４１に手渡され、同
ルーチンはそのデータに基づいて特徴語グラフを特徴語
表示手段２２に表示する。クライアント１はこの検索結
果に応じて実施例１で説明したと同様に、さらに必要な
検索操作があればこれに応じたデータを入力すれば良
い。このデータは再度サーバ側に伝送され、サーバ側で
必要な検索が実行されて、その結果が特徴語表示手段２
２に表示される。クライアント１は、必要ならプリンタ
ー８１によってプリントされた出力を利用することがで
きる。このようにして、クライアント１は、実質的な検
索プログラムを持つことなく、サーバ側で実行された結
果のみを利用できる。したがって、クライアント１で
は、ワークエリア５は初期の入力データおよびサーバか
ら伝送されてきた検索結果と特徴語とそのグラフ配置に
関するデータ等を保持する能力があれば足りるから、簡
易な装置で充実した検索サービスを受けることができ
る。The search request input by the client 1 is transmitted to the server side via the communication means 7 and 7 and the communication path NET1, and the server executes necessary search, characteristic word extraction, and graph layout calculation. Is returned to the client 1 by the communication means 7, 7 again, and the client 1
Is passed to a search interface operation routine 41, which displays a characteristic word graph on the characteristic word display means 22 based on the data. As described in the first embodiment, the client 1 may input data corresponding to a further necessary search operation according to the search result, as described in the first embodiment. This data is transmitted to the server again, and a necessary search is executed on the server.
2 is displayed. The client 1 can use the output printed by the printer 81 if necessary. In this way, the client 1 can use only the result executed on the server side without having a substantial search program. Therefore, in the client 1, the work area 5 only needs to have the ability to hold the initial input data, the search result transmitted from the server, the characteristic word, and the data relating to the graph layout, and the like. Service is available.

【００４８】[0048]

【発明の効果】以上、二つのタイプについて説明したよ
うに、本発明によれば、ユーザは、より一般性の高い特
徴語から固有名など特殊性の高い特徴語までを一覧で
き、幅広い選択肢から興味に合致した単語を検索でき
る。As described above, according to the present invention, the user can list from the more general characteristic words to the highly specific characteristic words such as the proper names, and can select from a wide range of options. You can search for words that match your interests.

[Brief description of the drawings]

【図１】本発明の実施例としての独立に使用されるコン
ピュータによる検索装置の構成例を示すブロック図。FIG. 1 is a block diagram showing an example of a configuration of a search device by a computer used independently as an embodiment of the present invention.

【図２】ワークエリアのデータの割り当て配置の一例を
示す図。FIG. 2 is a diagram showing an example of data allocation arrangement in a work area.

【図３】ユーザとコンピュータとの間の検索インタフェ
イス表示画面の例を示す図。FIG. 3 is a view showing an example of a search interface display screen between a user and a computer.

【図４】検索実行時に検索ワークエリアに格納されるデ
ータの例を示す図。FIG. 4 is a diagram showing an example of data stored in a search work area when a search is performed.

【図５】図３に示した検索インタフェイス表示画面が検
索実行後に検索結果を表示した例を示す図。FIG. 5 is a diagram showing an example in which the search interface display screen shown in FIG. 3 displays a search result after executing a search.

【図６】ユーザが検索キーとしての特徴語を付与するた
めの特徴語表示手段起動時の表示画面の例を示す図。FIG. 6 is a diagram showing an example of a display screen at the time of starting a characteristic word display unit for giving a characteristic word as a search key by a user.

【図７】ユーザから特徴語表示要求があった時に特徴語
グラフ格納エリアに格納されるデータの例を示す図。FIG. 7 is a diagram illustrating an example of data stored in a characteristic word graph storage area when a characteristic word display request is made by a user.

【図８】検索された文書群における特徴語のグラフ表示
の一例を示す図。FIG. 8 is a diagram showing an example of a graph display of characteristic words in a retrieved document group.

【図９】検索された文書群における単語頻度データの一
例を示す図。FIG. 9 is a diagram showing an example of word frequency data in a retrieved document group.

【図１０】検索された文書群における特徴語リストの一
例を示す図。FIG. 10 is a diagram showing an example of a feature word list in a retrieved document group.

【図１１】検索された文書群における特徴語間の共起関
係を表すデータの一例を示す図。FIG. 11 is a diagram showing an example of data representing a co-occurrence relationship between characteristic words in a retrieved document group.

【図１２】検索された文書群において特に強い共起関係
を有する特徴語対のリストの一例を示す図。FIG. 12 is a diagram showing an example of a list of characteristic word pairs having a particularly strong co-occurrence relationship in a retrieved document group.

【図１３】特徴語のグラフ配置を計算する計算ルーチン
の構成の一例を示すパッド図（ＰＡＤ図、Problem Anal
ysis Diagram）。FIG. 13 is a pad diagram (PAD diagram, Problem Analyst) showing an example of the configuration of a calculation routine for calculating the graph arrangement of characteristic words;
ysis Diagram).

【図１４】グラフ配置におけるｘ座標計算方法の一例を
示すパッド図。FIG. 14 is a pad diagram showing an example of an x coordinate calculation method in a graph layout.

【図１５】検索結果のグラフ表示の際、表示データを正
規化された領域に仮想的に配置する際の座標データの一
例を示す図。FIG. 15 is a diagram showing an example of coordinate data when display data is virtually arranged in a normalized region when a graph of a search result is displayed.

【図１６】検索結果のグラフ表示の際、表示データの重
なり回避を行なう前のグラフの座標の一例を示す図。FIG. 16 is a view showing an example of coordinates of a graph before avoiding overlapping of display data when displaying a graph of a search result.

【図１７】グラフの表示ノードが重なるのを避けるため
のルーチンの詳細の一例を示すパッド図。FIG. 17 is a pad diagram showing an example of details of a routine for avoiding overlapping display nodes of a graph.

【図１８】特徴語表示数を２０にした場合の特徴語のグ
ラフ表示の一例を示す図。FIG. 18 is a diagram showing an example of a characteristic word graph display when the number of characteristic words to be displayed is 20;

【図１９】特徴語表示様式選択手段を備えた検索インタ
フェイス表示画面の例を示す図。FIG. 19 is a diagram showing an example of a search interface display screen provided with a feature word display style selection unit.

【図２０】特徴語のリスト表示の表示画面の例を示す
図。FIG. 20 is a diagram showing an example of a display screen for displaying a list of characteristic words.

【図２１】検索装置の主体がサーバ側に備えられこれに
複数のクライアントがアクセスして検索を行う場合の構
成例を示すブロック図。FIG. 21 is a block diagram showing a configuration example in a case where a subject of a search apparatus is provided on a server side and a plurality of clients access the server to perform a search.

[Description of sign]

１、１：入力手段、１１、１１：キーボード、１２、１
２：マウス、１３、１３：ペン入力手段、２、２：表示
手段、２１、２１：検索インタフェイス、７、７：通信
手段、８：出力手段、８１：プリンタ８１、ＩＦ１、Ｉ
Ｆ２、ＩＦ３：インタフェイス、ＮＥＴ１，ＮＥＴ
２：回線、２１１：検索要求入力部、２１２：キーワー
ド表示・操作部、２１２１：必須キーワード表示部、２
１２１１：必須キーワードへの追加ボタン、２１２１
２：必須キーワードの消去ボタン、２１２２：加点キー
ワード表示部、２１２３：減点キーワード表示部、２１
３：検索ヒット件数表示部、２１４：タイトル表示部、
２１５：文書表示部、２１６：検索実行ボタン、２１
６：特徴語表示ボタン、２１７１：特徴語表示様式選択
手段、２２：特徴語表示手段、２２１：特徴語表示手段
操作部、２２２：特徴語表示手段のキーワード表示・操
作部、２２３：特徴語表示手段の検索ヒット件数表示
部、２２４：特徴語表示部、２２４１：高頻度特徴語表
示部、２２４２：中頻度特徴語表示部、２２４３：高頻
度特徴語表示部、２２５：特徴語表示手段のパラメータ
設定部、２２５１：特徴語表示語数設定手段、３：計算
プログラム実行手段（ＣＰＵ）、４：計算プログラム保
持手段、４１：検索インタフェイス作動ルーチン、４
２：形態素解析ルーチン、４３：検索ルーチン、４４：
特徴語表示手段作動ルーチン、４４１：特徴語抽出ルー
チン、４４２：共起関係解析ルーチン、４４３：グラフ
配置ルーチン、４４３１：ｙ座標計算ルーチン、４４３
２：ｘ座標計算ルーチン、４４３３：表示座標への変換
ルーチン、４４３４：重なり回避ルーチン、４４３５：
リンク配置ルーチン、４４４：グラフ表示ルーチン、
５：ワークエリア、５１：検索ワークエリア、５１１：
キーワード格納エリア、５１１１：必須キーワード格納
エリア、５１１２：加点キーワード格納エリア、５１１
３：減点キーワード格納エリア、５１２：検索結果格納
エリア、５１３：検索結果得点分布格納エリア、５２：
特徴語抽出ワークエリア、５２１：特徴語抽出パラメー
タ格納エリア、５２１１：走査文書数上限値格納エリ
ア、５２１２：頻度クラス分割数格納エリア、５２１
３：抽出語数格納エリア、５２２：単語分割済み文書格
納エリア、５２３：頻度データベース格納エリア、５２
４：特徴語リスト格納エリア、５３：共起関係解析ワー
クエリア、５３１：共起データ格納エリア、５３２：共
起リンク格納エリア、５４：グラフ配置ワークエリア、
５４１：正規化座標格納エリア、５４２：グラフ配置パ
ラメータ格納エリア、５４３：グラフ格納エリア、５４
３１：ノード格納エリア、５４３２：リンク格納エリ
ア、６：データベース保持手段、６１：検索対象文書デ
ータベース、６２：検索用インデックスデータベース、
６３：単語頻度データベース、６４：除外語データベー
ス。1, 1: input means, 11, 11: keyboard, 12, 1
2: mouse, 13, 13: pen input means, 2, 2: display means, 21, 21: search interface, 7, 7: communication means, 8: output means, 81: printer 81, IF1, I
F2, IF3: Interface, NET1, NET
2: line, 211: search request input unit, 212: keyword display / operation unit, 2121: essential keyword display unit, 2
1211: Add button to required keyword, 2121
2: Erasing button for essential keywords, 2122: Additional keyword display area, 2123: Deduction keyword display area, 21
3: Search hit count display area, 214: Title display area,
215: document display section, 216: search execution button, 21
6: characteristic word display button, 2171: characteristic word display style selection means, 22: characteristic word display means, 221: characteristic word display means operation section, 222: keyword display / operation section of characteristic word display means, 223: characteristic word display Mean number of search hits display section, 224: characteristic word display section, 2241: high frequency characteristic word display section, 2242: medium frequency characteristic word display section, 2243: high frequency characteristic word display section, 225: parameter of characteristic word display means Setting unit, 2251: characteristic word display word number setting means, 3: calculation program execution means (CPU), 4: calculation program holding means, 41: search interface operation routine, 4
2: Morphological analysis routine, 43: Search routine, 44:
Characteristic word display means operation routine, 441: characteristic word extraction routine, 442: co-occurrence relation analysis routine, 443: graph arrangement routine, 4431: y coordinate calculation routine, 443
2: x coordinate calculation routine, 4433: conversion routine to display coordinates, 4434: overlap avoidance routine, 4435:
Link arrangement routine, 444: graph display routine,
5: Work area, 51: Search work area, 511:
Keyword storage area, 5111: Essential keyword storage area, 5112: Additional keyword storage area, 511
3: Deducted keyword storage area, 512: Search result storage area, 513: Search result score distribution storage area, 52:
Feature word extraction work area, 521: Feature word extraction parameter storage area, 5211: Scanned document number upper limit storage area, 5212: Frequency class division number storage area, 521
3: Extracted word number storage area, 522: Word divided document storage area, 523: Frequency database storage area, 52
4: characteristic word list storage area, 53: co-occurrence relation analysis work area, 531: co-occurrence data storage area, 532: co-occurrence link storage area, 54: graph layout work area,
541: normalized coordinate storage area, 542: graph arrangement parameter storage area, 543: graph storage area, 54
31: node storage area, 5432: link storage area, 6: database holding means, 61: search target document database, 62: search index database,
63: word frequency database, 64: exclusion word database.

Claims

[Claims]

Detecting a document having the keyword as a search result document from a search target document group according to a set keyword, and indicating how many documents in the search result document group appear in a certain word; Detecting the document frequency of a word, detecting the total document frequency of the word meaning how many documents the word appears in the entire search target document group, and determining the document frequency of the word and the total document frequency of the word. Deriving a frequency ratio meaning the ratio of, the document frequency is divided into frequency classes in a predetermined relationship, and each word corresponds to the frequency class according to the document frequency of each word,
Literature search support characterized by extracting an appropriate number of words from each frequency class in the order of the frequency ratio of words as feature words, and displaying the extracted feature words in a graph format or a list format Method.

2. The document search support method according to claim 1, wherein the extracted characteristic words are displayed in either a list format for each frequency class or a graph format showing the relationship between the characteristic words.

3. A means for detecting a document having the keyword as a search result document from a search target document group according to a set keyword, and means how many documents in the search result document group appear in a certain word. Means for detecting the document frequency of the word, means for detecting the number of documents in which the word appears in the entire search target document group, and the document frequency of the word and the entire document frequency of the word. Means for deriving a frequency ratio meaning a ratio of the words, means for classifying the frequency ratio into frequency classes according to a predetermined relationship, and means for associating each word with the frequency class according to the frequency ratio of each word; A document extractor comprising: means for extracting a number of words as characteristic words in order of magnitude of the frequency ratio of words; and means for displaying the extracted characteristic words in a graph format or a list format. Apparatus.

4. A means for displaying the extracted characteristic words in either a list form for each frequency class or a graph form showing the relationship between characteristic words, and means for selecting and specifying a characteristic word display form. 3. The document search device according to 3.

5. The relation between the characteristic words is determined on the basis of the co-occurrence relation between the characteristic words, and the graph form is formed by linking word pairs of characteristic words with characteristic words as nodes. The document search device according to claim 3 or 4, wherein the document search device is a graph.

6. The set keyword is a required keyword,
There are three types of keywords, an additional keyword and a deducted keyword. The search by the essential keywords is performed by an AND condition based on each essential keyword. For each document in the retrieved search result document group, the additional score is added when the additional keyword is included. Score high according to the number of keywords,
5. The document search apparatus according to claim 3, wherein when a deduction keyword is included, a deduction is performed according to the number of deduction keywords, and a characteristic word is extracted from a document group that has obtained a higher score.

7. The document search apparatus according to claim 6, wherein a search is performed by additional keywords when no essential keyword is set, and the search by each additional keyword is performed under an OR condition.

8. The type of keyword can be changed among three types of set keywords, a required keyword, an additional keyword, and a deducted keyword, and the displayed characteristic word can be changed to any one of the required keyword, the additional keyword, and the deducted keyword. 8. The method according to claim 6, wherein the crab can be copied.
Document search device described.

9. The document retrieval apparatus according to claim 4, wherein, in the characteristic word graph display, the vertical axis represents the document frequency of the characteristic word in the retrieved document group.

10. A method for detecting a document having a keyword as a search result document from a search target document group according to a keyword transmitted from a search source, and determining how many documents in the search result document group include a word. Detecting the document frequency of a word meaning the word, detecting the total document frequency of the word meaning how many documents the word appears in the entire search target document group, detecting the document frequency of the word and the entire word Deriving a frequency ratio meaning a ratio to a document frequency, classifying the frequency ratio into frequency classes according to a predetermined relationship, and associating each word with the frequency class according to the frequency ratio of each word, Extract an appropriate number of words from the class as feature words in order of magnitude of the word frequency ratio, and configure the extracted feature words as data that can be displayed in a graph format indicating the relationship between the feature words A document search service comprising: configuring the extracted feature words as data that can be displayed in a list format for each frequency class; and transmitting the feature words to a search source as data that can be displayed in a graph format or a list format. Method.

11. The search source includes at least means for transmitting a keyword for specifying a document having a keyword to be extracted, and a graph form or list showing the transmitted characteristic words and the relation between the characteristic words. 11. The document search service method according to claim 10, further comprising means for receiving and displaying data that can be displayed in a format and receiving a search service.

12. The search source according to claim 10, wherein the search source receives the search service by transmitting, together with display software, data that can be displayed in the form of a graph or a list showing the transmitted characteristic words and the relation between the characteristic words. Literature search service method.

13. The search source receives a search service by transmitting user interface driving software for receiving a search service at the start of a search operation or in advance from a search service provider and driving the user interface drive software to receive the search service. 10. A document search service method according to 10.

14. A computer-readable recording medium on which frequency data for calculating a characteristic degree of each word appearing in a search result is recorded, wherein data on each word is:
(A) a character string, (b) a document frequency indicating how many times the word appears in the searched documents, and (c) a number of documents used in the entire search target document regardless of the search result. (D) the document frequency of the word in the search result calculated from the document frequency in the search result and the entire document frequency in the whole database,
(E) A frequency class obtained by classifying according to the magnitude of the document frequency in the search result, and a word having a higher degree of characteristic from each of the frequency classes is set as a characteristic word in the search target document group. A computer-readable recording medium that records frequency data of words appearing in search results.

15. A computer-readable recording medium for recording co-occurrence data in which characteristic words co-occur in order to calculate the degree of relevance between characteristic words appearing in a search result, wherein data relating to each characteristic word pair is recorded. Is related to (a) the co-occurrence frequency at which both characteristic words in the search result document group co-occur, and (b) the relation between the co-occurrence frequency and the characteristic words calculated from the frequency data appearing in the search results of both characteristic words. And computer-readable recording which records co-occurrence data between characteristic words in a search result, wherein a link indicating that the characteristic word pair having a high degree of relevance is strong is provided. Medium.

16. A computer-readable recording medium for recording data for displaying a graph of a characteristic word pair appearing in a search result on a screen, wherein the data for displaying the graph of the characteristic word pair on a screen is: a) data for displaying a characteristic word at a node portion of a graph; and (b) data for displaying a link indicating the relevance between characteristic words. Character string, and the number and size of characters in the vertical and horizontal characters of the area to display the character string, the data of each link is composed of start point coordinates and end point coordinates, the two-dimensional display of the characteristic word graph by link and character string A computer-readable recording medium that records data for displaying a characteristic word graph on a screen, wherein the characteristic word graph is enabled.

17. A feature word of each word appearing in a search result is calculated, a feature word is derived, and a link is established to a feature word pair determined to be highly relevant based on the co-occurrence frequency of the feature word pair. A computer-readable recording medium which records data for displaying a graph of characteristic words obtained on a screen, wherein data relating to each word appearing in the search result is (a) a character string, and (b) a searched document. , The document frequency indicating the number of occurrences of the word, and (c) the document frequency throughout the database indicating the number of documents used in the entire search target document regardless of the search result. d) the characteristic degree of the word in the search result calculated from the document frequency in the search result and the entire document frequency in the entire database; and (e) the frequency class in the case of classifying according to the magnitude of the document frequency in the search result. In each of the frequency classes, a word having a higher degree of the characteristic degree is set as a characteristic word in the search target document group. In order to calculate the degree of relevance between the characteristic words, data relating to each characteristic word pair is represented by ( f) From the co-occurrence frequency at which both characteristic words co-occur in the search result document group and (g) the co-occurrence frequency and the relevance of both characteristic words calculated from the frequency data appearing in the search results for each of the characteristic words In order to link the characteristic word pair having a high degree of relevance, and to display the characteristic word graph with the link displayed on the screen, the data for displaying the characteristic word graph on the screen is (h) graph data. Data for displaying a characteristic word in a node portion, and (i) data for displaying a link indicating the relevance between the characteristic words, and the data of each node includes center coordinates, a character string to be displayed, and, The character string display area is composed of the number of characters in the vertical and horizontal directions and the size, and the data of each link is composed of start point coordinates and end point coordinates, and each of the characteristic word graphs can be displayed two-dimensionally by links and character strings. A computer-readable recording medium that records data for displaying a characteristic word graph on a screen.