JP2001014330A

JP2001014330A - Storage medium stored with term evaluation program

Info

Publication number: JP2001014330A
Application number: JP11185396A
Authority: JP
Inventors: Atsushi Takato; 淳高藤; Katsuhiko Mitobe; 勝彦水戸部; Katsuyuki Doi; 功志土居; Hiroyuki Mitsuya; 浩之三ツ矢
Original assignee: JustSystems Corp
Current assignee: JustSystems Corp
Priority date: 1999-06-30
Filing date: 1999-06-30
Publication date: 2001-01-19
Anticipated expiration: 2019-06-30
Also published as: JP4010711B2

Abstract

PROBLEM TO BE SOLVED: To conduct a term analysis meeting with an operator's intention among multiple documents. SOLUTION: According to given retrieval conditions, a term statistic arithmetic means 5 specifies one or more documents as objects of term extraction among multiple document data stored in a document storage means 3 and calculates the statistics of terms present in the specified term extraction object documents as term statistics of the retrieval conditions. An evaluated value determining means 7 when supplied with the term statistics of different retrieval conditions from the term statistic arithmetic means 5 determines evaluated values of terms with a numeric vector having dimensions as many as the given retrieval conditions. The term extracted documents are narrowed down under the retrieval conditions and no terms having no relation with the retrieval conditions are extracted. The evaluated values of the respective terms are represented as the numeric vector to analyze the terms.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、文書データから
抽出したタームを評価するターム評価装置に関し、特に
タームの評価手法に関する。The present invention relates to a term evaluation apparatus for evaluating terms extracted from document data, and more particularly to a term evaluation method.

【０００２】[0002]

【従来技術および発明が解決しようとする課題】今日、
多くの文書情報をデータとして記憶しておき、これらの
文書情報から所望の情報や知識を発見すること（以下、
マイニングという）が試みられている。その１つの手法
として、文書中からタームを抽出して、抽出したターム
に基づいて、前記マイニングをおこなうことが提案され
ている。BACKGROUND OF THE INVENTION Today,
Many pieces of document information are stored as data, and desired information and knowledge are discovered from these pieces of document information (hereinafter, referred to as “data”).
Mining) has been attempted. As one of the methods, it has been proposed that a term is extracted from a document and the mining is performed based on the extracted term.

【０００３】しかし、文書中には雑多なタームが存在
し、タームの抽出基準について、確立された手法が存在
しなかった。したがって、抽出したタームを評価するこ
とができなかった。However, there are various terms in a document, and there is no established method for extracting terms. Therefore, the extracted terms could not be evaluated.

【０００４】この発明は上記問題を解決し、複数の文書
からの抽出したタームを評価できるターム評価装置また
はその方法を提供することを目的とする。さらに、複数
の文書からターム抽出対象文書を決定することを目的と
する。[0004] It is an object of the present invention to solve the above-mentioned problems and to provide a term evaluation apparatus or method capable of evaluating terms extracted from a plurality of documents. It is another object of the present invention to determine a term extraction target document from a plurality of documents.

【０００５】[0005]

【課題を解決するための手段および発明の効果】１）本
発明にかかるプログラムを記憶した記録媒体において
は、前記プログラムは、前記コンピュータに、以下の処
理を実行させる、Ａ）複数の検索条件が与えられると、各検索条件毎に、
以下のターム抽出対象文書決定処理と、ターム統計量演
算処理を実行し、 a1)与えられた検索条件に基づいて、複数の文書データ
から、ターム抽出対象となる文書を１または２以上特定
するターム抽出対象文書決定処理、 a2)特定したターム抽出対象文書に存在するタームの統
計量を、その検索条件におけるターム統計量として演算
するターム統計量演算処理、Ｂ）前記複数のターム統計量を用いて、前記与えられた
検索条件数と同じ次元を持つ数値ベクトルを各タームの
評価値として決定する。Means for Solving the Problems and Effects of the Invention 1) In a recording medium storing a program according to the present invention, the program causes the computer to execute the following processing. Given, for each search condition,
A term extraction target document determination process and a term statistic calculation process are executed, and a1) a term for specifying one or more documents to be a term extraction target from a plurality of document data based on a given search condition; Extraction target document determination processing, a2) term statistic calculation processing for calculating the statistic of the term present in the specified term extraction target document as the term statistic in the search condition, B) using the plurality of term statistics , A numerical vector having the same dimension as the given number of search conditions is determined as the evaluation value of each term.

【０００６】このように、検索条件毎に前記文書データ
から抽出対象文書を決定することにより、前記検索条件
に合致した文書からターム抽出が可能となる。また、抽
出したタームを、前記検索条件数と同じ次元を持つ数値
ベクトルで表すことにより、各種の数値演算が可能とな
る。これにより、複数の文書からの抽出したタームを評
価することができる。また、前記タームの選別分類した
分析も可能となる。As described above, by determining a document to be extracted from the document data for each search condition, terms can be extracted from a document that matches the search condition. In addition, various numerical calculations can be performed by expressing the extracted terms by a numerical vector having the same dimension as the number of search conditions. As a result, terms extracted from a plurality of documents can be evaluated. In addition, it is also possible to analyze the terms by sorting.

【０００７】２）本発明にかかるプログラムを記憶した
記録媒体においては、前記各タームの評価値に基づい
て、所定の特徴を有するタームを抽出する。これによ
り、前記ベクトルに基づいて特徴的なタームを抽出する
ことができる。[0007] 2) In a recording medium storing the program according to the present invention, terms having predetermined characteristics are extracted based on the evaluation value of each term. Thus, a characteristic term can be extracted based on the vector.

【０００８】３）本発明にかかるプログラムを記憶した
記録媒体においては、前記検索条件として複数の視点プ
ロファイルが与えられると、各視点プロファイル毎に、
その視点プロファイルに関連する複数の関連ワードを用
いて、前記ターム抽出対象文書を特定する。したがっ
て、前記視点プロファイルに対応するベクトル要素で各
タームを評価することができる。[0008] 3) In the recording medium storing the program according to the present invention, when a plurality of viewpoint profiles are given as the search condition, for each viewpoint profile,
The term extraction target document is specified using a plurality of related words related to the viewpoint profile. Therefore, each term can be evaluated using the vector element corresponding to the viewpoint profile.

【０００９】４）本発明にかかるプログラムを記憶した
記録媒体においては、前記複数の文書は、作成時期が記
憶されており、１の視点プロファイルおよび抽出対象期
間が与えられると、前記各抽出対象期間毎に、前記１の
視点プロファイルを用いて、前記複数の検索条件を生成
して、ターム抽出対象文書決定処理を実行する。これに
より、１の視点プロファイルにおける時系列的な変化を
分析することができる。[0011] 4) In the recording medium storing the program according to the present invention, the plurality of documents store a creation time, and when one viewpoint profile and an extraction target period are given, each of the extraction target periods is given. Each time, the plurality of search conditions are generated using the one viewpoint profile, and the term extraction target document determination processing is executed. Thus, a time-series change in one viewpoint profile can be analyzed.

【００１０】５）本発明にかかるプログラムを記憶した
記録媒体においては、前記抽出されるタームは、特定の
ベクトル要素だけが所定のしきい値以上または以下の値
を持つタームである。したがって、操作者の特定の興味
に固有のタームを抽出することができる。[0010] 5) In the recording medium storing the program according to the present invention, the extracted terms are terms in which only specific vector elements have values equal to or greater than a predetermined threshold value. Therefore, terms specific to the operator's particular interests can be extracted.

【００１１】６）本発明にかかるプログラムを記憶した
記録媒体においては、前記抽出されるタームは、特定の
複数のベクトル要素がすべて、所定のしきい値以上、ま
たは以下の値を持つタームである。したがって、操作者
の特定の興味に共通するタームを抽出することができ
る。6) In the recording medium storing the program according to the present invention, the term to be extracted is a term in which all of a plurality of specific vector elements have values equal to or more than a predetermined threshold value. . Therefore, terms common to the operator's specific interests can be extracted.

【００１２】７）本発明にかかるプログラムを記憶した
記録媒体においては、前記抽出されるタームは、特定の
ベクトル要素値が他のタームと比べて特異的に大きい、
または特異的に小さいタームである。したがって、抽出
されたタームから操作者の特定の興味に特に関係するタ
ームを抽出することができる。[0012] 7) In the recording medium storing the program according to the present invention, the extracted terms have specific vector element values that are specifically larger than other terms.
Or specifically small terms. Therefore, terms that are particularly relevant to the specific interest of the operator can be extracted from the extracted terms.

【００１３】８）本発明にかかるプログラムを記憶した
記録媒体においては、前記複数の検索条件は順列に意味
がある検索条件であり、前記抽出されるタームは、ある
ベクトル要素の値がその前後のベクトル要素と比べて所
定の差分より大きなタームである。したがって、１の視
点プロファイルにおける時系列的にある時期に変化する
タームを抽出することができる。8) In the recording medium storing the program according to the present invention, the plurality of search conditions are search conditions having a meaning in a permutation, and the term to be extracted is such that a value of a certain vector element has a value before and after the value of a certain vector element. The term is larger than a predetermined difference compared to the vector element. Therefore, it is possible to extract terms that change at a certain time in a time series in one viewpoint profile.

【００１４】９）本発明にかかるプログラムを記憶した
記録媒体においては、ターム中に含まれる所定の文字列
であって、文書作成者の意向が表現される文字列を意向
毎に分類して記憶しておき、操作者がいずれかの分類を
選択すると、指定された文字列が存在するタームだけを
抽出する。したがって、文書作成者の意向が表現される
文字列を含むタームを抽出することができる。これによ
り、操作者の意向に合致したターム抽出が容易となる。
かかる文書作成者の意向が表現される文字列としては、
助動詞、接尾語、助数詞等がある。9) In a recording medium storing the program according to the present invention, a predetermined character string included in the term and expressing the intention of the document creator is classified and stored for each intention. If the operator selects one of the categories, only the terms in which the specified character string exists are extracted. Therefore, a term including a character string expressing the intention of the document creator can be extracted. This facilitates term extraction that matches the operator's intention.
As a character string expressing the intention of such a document creator,
There are auxiliary verbs, suffixes, classifiers, etc.

【００１５】１０）本発明にかかる文書データを構成す
るタームの評価方法においては、Ａ）複数の検索条件が与えられると、各検索条件毎に、
以下のa1)ターム抽出対象文書決定処理と、a2)ターム統
計量演算処理を行い、 a1)与えられた検索条件に基づいて、複数の文書データ
から、ターム抽出対象となる文書を１または２以上特定
することによってターム抽出対象文書を決定するターム
抽出対象文書決定処理、 a2)特定したターム抽出対象文書に存在するタームの統
計量を、その検索条件におけるターム統計量として演算
するターム統計量演算処理、Ｂ）前記複数のターム統計量を用いて、前記与えられた
検索条件数と同じ次元を持つ数値ベクトルで各タームの
評価値を決定する。10) In the method for evaluating terms constituting document data according to the present invention, A) When a plurality of search conditions are given, for each search condition,
The following a1) term extraction target document determination processing and a2) term statistic calculation processing are performed. A1) Based on given search conditions, one or more documents to be subjected to term extraction are extracted from a plurality of document data. Term extraction target document determination processing that determines the term extraction target document by specifying it.a2) Term statistic calculation processing that calculates the statistic of the term present in the specified term extraction target document as the term statistic in the search condition. B) Using the plurality of term statistics, an evaluation value of each term is determined by a numerical vector having the same dimension as the given number of search conditions.

【００１６】このように、検索条件毎に前記文書データ
から抽出対象文書を決定することにより、前記検索条件
に合致した文書からターム抽出が可能となる。また、抽
出したタームを、前記検索条件数と同じ次元を持つ数値
ベクトルで表すことにより、各種の数値演算が可能とな
る。これにより、複数の文書からの抽出したタームを評
価することができる。また、前記タームの選別分類した
分析も可能となる。As described above, by determining the document to be extracted from the document data for each search condition, it is possible to extract terms from a document that matches the search condition. In addition, various numerical calculations can be performed by expressing the extracted terms by a numerical vector having the same dimension as the number of search conditions. As a result, terms extracted from a plurality of documents can be evaluated. In addition, it is also possible to analyze the terms by sorting.

【００１７】１１）本発明にかかる文書データを構成す
るタームの評価方法においては、操作者が、動詞または
形容詞に付属して用いられ、動詞または形容詞に文書作
成者の意向が表現される文字列が与えられると、指定さ
れた文字列が存在するタームだけを抽出する。したがっ
て、文書作成者の意向が表現される文字列を含むターム
を抽出することができる。これにより、操作者の意向に
合致したターム抽出が容易となる。11) In the method of evaluating terms constituting document data according to the present invention, an operator uses a character string attached to a verb or adjective, and the verb or adjective expresses the intention of the document creator. Is given, extract only the terms in which the specified character string exists. Therefore, a term including a character string expressing the intention of the document creator can be extracted. This facilitates term extraction that matches the operator's intention.

【００１８】１２）本発明にかかる文書データを構成す
るタームの評価装置においては、ターム統計量演算手段
は、与えられた検索条件に基づいて、複数の文書データ
から、ターム抽出対象となる文書を１または２以上特定
し、特定したターム抽出対象文書に存在するタームの統
計量を演算して、その検索条件におけるターム統計量と
して出力する。評価手段は前記ターム統計量演算手段か
ら異なる複数の検索条件におけるターム統計量が与えら
れると、与えられた検索条件数と同じ次元を持つ数値ベ
クトルで各タームの評価値を決定する。12) In the term evaluation apparatus for constructing document data according to the present invention, the term statistic calculation means extracts a document to be subjected to term extraction from a plurality of document data based on a given search condition. One or more specified terms are calculated, and the statistic of the term present in the specified term extraction target document is calculated and output as the term statistic under the search condition. When the term statistic in a plurality of different search conditions is given from the term statistic calculation means, the evaluation means determines an evaluation value of each term by a numerical vector having the same dimension as the given number of search conditions.

【００１９】このように、検索条件毎に前記文書データ
から抽出対象文書を決定することにより、前記検索条件
に合致した文書からターム抽出が可能となる。また、抽
出したタームを、前記検索条件と同じ次元を持つ数値ベ
クトルで表すことにより、各種の数値演算が可能とな
る。これにより、複数の文書からの抽出したタームを評
価することができる。また、前記タームの選別分類した
分析も可能となる。As described above, by determining a document to be extracted from the document data for each search condition, terms can be extracted from a document that meets the search condition. In addition, various numerical calculations can be performed by expressing the extracted terms with a numerical vector having the same dimension as the search condition. As a result, terms extracted from a plurality of documents can be evaluated. In addition, it is also possible to analyze the terms by sorting.

【００２０】[0020]

【発明の実施の形態】１．機能ブロック図の説明本発明の一実施形態を図面に基づいて説明する。図１に
示すターム評価装置１は、文書記憶手段３、ターム統計
量演算手段５、評価手段７、分析手段８および報知手段
１９を備えている。BEST MODE FOR CARRYING OUT THE INVENTION Description of Functional Block Diagram One embodiment of the present invention will be described with reference to the drawings. The term evaluation apparatus 1 shown in FIG. 1 includes a document storage unit 3, a term statistic calculation unit 5, an evaluation unit 7, an analysis unit 8, and a notification unit 19.

【００２１】文書記憶手段３は、複数の文書を記憶す
る。ターム統計量演算手段５は、与えられた検索条件に
基づいて、文書記憶手段３に記憶された複数の文書デー
タから、ターム抽出対象となる文書を１または２以上特
定し、特定したターム抽出対象文書に存在するタームの
統計量を、その検索条件におけるターム統計量として演
算する。評価値決定手段７は、ターム統計量演算手段５
から異なる複数の検索条件におけるターム統計量が与え
られると、与えられた検索条件数と同じ次元を持つ数値
ベクトルで各タームの評価値を決定する。The document storage means 3 stores a plurality of documents. The term statistic calculation means 5 specifies one or two or more documents to be a term extraction target from a plurality of document data stored in the document storage means 3 based on a given search condition, and specifies the specified term extraction target. The statistic of the term present in the document is calculated as the term statistic in the search condition. The evaluation value determining means 7 includes the term statistic calculating means 5
When term statistics for a plurality of different search conditions are given from, the evaluation value of each term is determined by a numerical vector having the same dimension as the given number of search conditions.

【００２２】分析手段８は、前記各タームの評価値に基
づいてタームを分析する。報知手段９は前記分析結果を
報知する。The analysis means 8 analyzes terms based on the evaluation value of each term. The notifying means 9 notifies the analysis result.

【００２３】本実施形態においては、ターム統計量演算
手段は５は、視点プロファイルとこの視点プロファイル
に関連する複数の関連ワードで構成される視点プロファ
イル対応情報を複数記憶しており、検索条件として視点
プロファイルが与えられると、前記複数の関連ワードを
用いて、前記ターム抽出対象文書を特定する。In the present embodiment, the term statistic calculation means 5 stores a plurality of viewpoint profile correspondence information composed of a viewpoint profile and a plurality of related words related to the viewpoint profile. When a profile is given, the term extraction target document is specified using the plurality of related words.

【００２４】このように、ターム抽出文書を視点プロフ
ァイルで絞り込むことにより、抽出タームが操作者の興
味のある視点プロファイルに合致して抽出される。ま
た、各タームの評価値を数値ベクトルで表すことによ
り、タームの分析が可能となる。As described above, by narrowing down the term extraction document by the viewpoint profile, the extraction term is extracted in accordance with the viewpoint profile of interest of the operator. Also, by expressing the evaluation value of each term by a numerical vector, it becomes possible to analyze the terms.

【００２５】本実施形態においては、報知手段として表
示手段を採用したが、これ以外の報知手段を採用しても
よい。In this embodiment, the display means is used as the notification means, but other notification means may be used.

【００２６】２．ハードウェア構成 (2.1)概略図１に示すターム評価装置１のハードウェア構成につい
て説明する。図２に示すコンピュータシステム４０は、
入力装置４１、制御装置４３、表示装置４５および記憶
装置４７を備えている。入力装置４１は、各種の命令を
入力するためのものである。記憶装置４７には、与えら
れた命令に基づいて所定の処理を行うプログラムが記憶
される。制御装置４３は、記憶装置４７に記憶されたプ
ログラムに基づいて所定のデータ処理を行う。2. 1. Hardware Configuration (2.1) Outline The hardware configuration of the term evaluation apparatus 1 shown in FIG. 1 will be described. The computer system 40 shown in FIG.
An input device 41, a control device 43, a display device 45, and a storage device 47 are provided. The input device 41 is for inputting various commands. The storage device 47 stores a program for performing a predetermined process based on the given instruction. The control device 43 performs predetermined data processing based on the program stored in the storage device 47.

【００２７】(2.2)詳細図３に、図２に示すコンピュータシステム４０をＣＰＵ
を用いて実現したハードウェア構成の一例を示す。(2.2) Details FIG. 3 shows a computer system 40 shown in FIG.
1 shows an example of a hardware configuration realized by using.

【００２８】コンピュータシステム４０は、ＣＰＵ２
３、メモリ２７、ハードディスク２６、ＣＲＴ３０、Ｆ
ＤＤ２５、キーボード２８、マウス３１およびバスライ
ン２９を備えている。ＣＰＵ２３は、ハードディスク２
６に記憶された制御プログラムにしたがいバスライン２
９を介して、各部を制御する。The computer system 40 includes the CPU 2
3, memory 27, hard disk 26, CRT 30, F
A DD 25, a keyboard 28, a mouse 31, and a bus line 29 are provided. The CPU 23 is a hard disk 2
6 according to the control program stored in the bus line 2
Each component is controlled via 9.

【００２９】この制御プログラムは、ＦＤＤ２５を介し
て、プログラムが記憶されたフレキシブルディスク２５
ａから読み出されてハードディスク２６にインストール
されたものである。なお、フレキシブルディスク以外
に、ＣＤ−ＲＯＭ、ＩＣカード等のプログラムを実体的
に一体化したコンピュータ可読の記録媒体から、ハード
ディスクにインストールさせるようにしてもよい。さら
に、通信回線を用いてダウンロードするようにしてもよ
い。This control program is transmitted via the FDD 25 to the flexible disk 25 storing the program.
a and is installed on the hard disk 26. In addition to the flexible disk, a hard disk may be installed from a computer-readable recording medium in which a program such as a CD-ROM or an IC card is substantially integrated. Furthermore, you may make it download using a communication line.

【００３０】本実施形態においては、プログラムをフレ
キシブルディスクからハードディスク２６にインストー
ルさせることにより、フレキシブルディスクに記憶させ
たプログラムを間接的にコンピュータに実行させるよう
にしている。しかし、これに限定されることなく、フレ
キシブルディスクに記憶させたプログラムをＦＤＤ２５
から直接的に実行するようにしてもよい。なお、コンピ
ュータによって、実行可能なプログラムとしては、その
ままのインストールするだけで直接実行可能なものはも
ちろん、一旦他の形態等に変換が必要なもの（例えば、
データ圧縮されているものを、解凍する等）、さらに
は、他のモジュール部分と組合して実行可能なものも含
む。In the present embodiment, the program is installed from the flexible disk to the hard disk 26 so that the computer indirectly executes the program stored in the flexible disk. However, without being limited to this, the program stored in the flexible disk is stored in the FDD25.
Alternatively, it may be executed directly from. Note that, as a program executable by a computer, not only a program that can be directly executed by simply installing it as it is, but also a program that needs to be once converted into another form or the like (for example,
Decompression of data that has been compressed, etc.), and also includes those that can be executed in combination with other module parts.

【００３１】ハードディスク２６には、プログラム記憶
部２６ａ、対応表記憶部２６ｂ、文書記憶部２６ｃ、評
価記憶部２６ｄを有する。プログラム記憶部２６ａに
は、後述するプログラムが記憶されている。対応表記憶
部２６ｂには、図４に示すような複数の視点プロファイ
ルが記憶されている。各視点プロファイルは対応する複
数の関連キーワードが記憶されている。文書記憶部２６
ｃには評価対象の文書が複数記憶されている。本実施形
態においては、各文書は、作成日時、その文書のタイト
ルおよび各文書の内容で構成されている。評価記憶部２
６ｄには各タームの評価結果が記憶される。メモリ２７
にはその他、各種の演算結果等が記憶される。The hard disk 26 has a program storage unit 26a, a correspondence table storage unit 26b, a document storage unit 26c, and an evaluation storage unit 26d. The program storage unit 26a stores a program described later. A plurality of viewpoint profiles as shown in FIG. 4 are stored in the correspondence table storage unit 26b. Each viewpoint profile stores a plurality of corresponding keywords. Document storage unit 26
In c, a plurality of documents to be evaluated are stored. In the present embodiment, each document is composed of a creation date and time, a title of the document, and the content of each document. Evaluation storage unit 2
6d stores the evaluation result of each term. Memory 27
In addition, various calculation results and the like are stored.

【００３２】３．フローチャートつぎに、ハードディスク２６のプログラム記憶部２６ａ
に記憶されているプログラムについて、図５〜図７のフ
ローチャートを用いて説明する。以下では、視点プロフ
ァイルとして「好景気」、「不景気」を用いて、各ター
ムを評価する場合を、例として説明する。3. Flow chart Next, the program storage unit 26a of the hard disk 26
Will be described with reference to the flowcharts of FIGS. In the following, a case where each term is evaluated using “good economy” and “bad economy” as viewpoint profiles will be described as an example.

【００３３】まず、操作者は複数のクエリを入力する。
本実施形態においては、クエリに視点プロファイルを採
用したので、図８に示すような選択ボックス６２をＣＲ
Ｔ３０に表示させて、視点プロファイル「好景気」、
「不景気」を選択すればよい。図８は、視点プロファイ
ル「好景気」を選択後、視点プロファイル「不景気」を
選択した状態を示す。First, the operator inputs a plurality of queries.
In the present embodiment, since the viewpoint profile is used for the query, the selection box 62 shown in FIG.
It is displayed on T30, and the viewpoint profile "good economy"
What is necessary is just to select “recession”. FIG. 8 shows a state in which the viewpoint profile “Boom” is selected after the viewpoint profile “Boom” is selected.

【００３４】図８に示す選択ボックス６２に、視点プロ
ファイルが存在しない場合には、ボタン６４をクリック
して、必要な視点プロファイルおよび対応する関連キー
ワードを追加するようにすればよい。これにより、図４
に示す対応表に追加される。If there is no viewpoint profile in the selection box 62 shown in FIG. 8, a button 64 may be clicked to add a necessary viewpoint profile and a corresponding related keyword. As a result, FIG.
Is added to the correspondence table shown in (1)

【００３５】なお、存在する視点プロファイルについて
も、関連キーワードを追加削除する場合には、関連キー
ワードボックス６３に追加または削除するようにすれば
よい。このように、操作者の興味のある視点からの視点
プロファイルを作成して、ターム抽出対象文書を特定す
ることにより、操作者の望む視点に則った文書を抽出す
ることができる。In addition, when a related keyword is to be added to or deleted from an existing viewpoint profile, it may be added to or deleted from the related keyword box 63. As described above, by creating a viewpoint profile from a viewpoint of the operator's interest and specifying a term extraction target document, it is possible to extract a document according to the viewpoint desired by the operator.

【００３６】ＣＰＵ２３は、複数のクエリが入力された
と判断すると（図５ステップＳ１）、クエリ番号ｋを初
期化する（ステップＳ３）。そして、０番目のクエリで
ある視点プロファイル「好景気」について、抽出対象文
書を決定する（ステップＳ５）。抽出対象文書の決定処
理について、図６を用いて説明する。本実施形態におい
ては、以下に述べるように、視点プロファイルとの類似
度を判断して、所定のしきい値を越える類似度の文書を
ターム抽出文書として決定した。When determining that a plurality of queries have been input (step S1 in FIG. 5), the CPU 23 initializes a query number k (step S3). Then, an extraction target document is determined for the viewpoint profile “good business” which is the 0th query (step S5). The process of determining a document to be extracted will be described with reference to FIG. In the present embodiment, as described below, a similarity with a viewpoint profile is determined, and a document having a similarity exceeding a predetermined threshold is determined as a term extraction document.

【００３７】ＣＰＵ２３は、文書番号ｍを初期化する
（ステップＳ５０）。各文書のベクトル化処理が終了し
ているか否かを判断する（ステップＳ５１）。この場
合、ベクトル化処理は、終了していないので、ステップ
Ｓ５３に進み、各文書について形態素解析を行い、文書
中に出現する単語（ターム）を抽出する。そして、抽出
したタームについて、ｔｆｉｄｆ法を用いて重要ターム
を決定する（ステップＳ５５）。ｔｆｉｄｆ法とは、情
報検索におけるキーワード決定の手法であり、ある文書
中におけるそのタームの出現頻度を示すｔｆ（term fr
equency）および全文書中で当該タームがいかに少ない
文書でしか現れないかの希少性を示すｉｄｆ（inverse
document frequency）を用いて、タームの重み付け
をする手法である。The CPU 23 initializes the document number m (Step S50). It is determined whether the vectorization processing of each document has been completed (step S51). In this case, since the vectorization process has not been completed, the process proceeds to step S53, in which morphological analysis is performed on each document, and words (terms) appearing in the document are extracted. Then, for the extracted terms, important terms are determined using the tfidf method (step S55). The tfidf method is a technique for determining a keyword in information retrieval, and is a tf (term fr) that indicates the frequency of occurrence of the term in a certain document.
idf (inverse) that indicates the rarity of the term and how rare it appears in all documents.
This is a method of weighting terms using document frequency.

【００３８】このようにして抽出した重要タームを用い
て、各文書を重要タームの数と同じ次元の数値ベクトル
としてベクトル化する。例えば、重要タームが１００あ
る場合に１００次元の数値ベクトルが得られる。なお、
文書によっては決定された重要タームを含んでいない場
合がある。この場合には、その文書のその次元の値は０
（疎）となる。ＣＰＵ２３は、このようにして得られた
数値ベクトルをメモリ２７に記憶しておく。Using the important terms extracted in this way, each document is vectorized as a numerical vector having the same dimension as the number of important terms. For example, when there are 100 important terms, a 100-dimensional numerical vector is obtained. In addition,
Some documents do not include the determined important terms. In this case, the value of that dimension of the document is 0
(Sparse). The CPU 23 stores the thus obtained numerical vector in the memory 27.

【００３９】つぎにＣＰＵ２３は、視点プロファイルを
ベクトル化する（ステップＳ５９）。本実施形態におい
ては、ｋ番目の視点プロファイルの全関連キーワードを
ハードディスク２６から読み出して、視点プロファイル
内の各関連キーワードの数と同じ次元の数値ベクトルと
してベクトル化した。本実施形態においては、視点プロ
ファイル内の各関連キーワードのｔｆと全文書における
ｉｄｆから、前記各関連キーワード毎のｔｆｉｄｆ値を
求め、前記関連キーワードの数と同じ次元の数値ベクト
ルとしてベクトル化した。Next, the CPU 23 converts the viewpoint profile into a vector (step S59). In the present embodiment, all the related keywords of the k-th viewpoint profile are read out from the hard disk 26 and are vectorized as numerical vectors having the same dimension as the number of the related keywords in the viewpoint profile. In the present embodiment, the tfidf value for each related keyword is obtained from the tf of each related keyword in the viewpoint profile and the idf in all documents, and is vectorized as a numerical vector having the same dimension as the number of the related keywords.

【００４０】つぎに、視点プロファイルとｍ番目の文書
との類似度を演算する（ステップＳ６１）。かかる類似
度は、ステップＳ５７で求めた数値ベクトルとステップ
Ｓ５９で求めた数値ベクトルの内積を演算することによ
り求めることができる。Next, the similarity between the viewpoint profile and the m-th document is calculated (step S61). Such similarity can be obtained by calculating the inner product of the numerical value vector obtained in step S57 and the numerical value vector obtained in step S59.

【００４１】つぎにＣＰＵ２３は、全文書について類似
度演算が終了したか否か判断する（ステップＳ６３）。
全文書について類似度演算が終了していなければ、文書
番号ｍをインクリメントし（ステップＳ６５）、ステッ
プＳ６１以下の処理を行う。Next, the CPU 23 determines whether or not the similarity calculation has been completed for all the documents (step S63).
If the similarity calculation has not been completed for all the documents, the document number m is incremented (step S65), and the processing from step S61 is performed.

【００４２】ステップＳ６３にて、全文書について類似
度演算が終了すると、所定のしきい値を越える文書を抽
出文書として決定する（ステップＳ６７）。When the similarity calculation is completed for all documents in step S63, a document exceeding a predetermined threshold is determined as an extracted document (step S67).

【００４３】つぎに、ＣＰＵ２３は、抽出対象文書に存
するタームの統計量を演算する。かかる演算処理につい
て、図７を用いて説明する。Next, the CPU 23 calculates the statistics of terms in the document to be extracted. Such calculation processing will be described with reference to FIG.

【００４４】ＣＰＵ２３は、注目文書番号ｉ、注目ター
ム番号ｊを初期化する（ステップＳ２５）。ｉ番目の文
書を注目文書とする（ステップＳ２７）。注目文書のｊ
番目のタームを注目タームとする（ステップＳ２９）。The CPU 23 initializes the noted document number i and the noted term number j (step S25). The i-th document is set as the document of interest (step S27). Noted document j
The term of interest is set as the term of interest (step S29).

【００４５】ＣＰＵ２３は、注目タームが注目文書で初
めて出現したタームか否か判断する（ステップＳ３
１）。注目文書で初めて出現したタームである場合に
は、ターム出現文書数ｂｉをインクリメントする（ステ
ップＳ３３）。The CPU 23 determines whether the term of interest is a term that first appeared in the document of interest (step S3).
1). If the term appears for the first time in the document of interest, the term occurrence document number bi is incremented (step S33).

【００４６】ＣＰＵ２３は、注目タームが、メモリ２７
の抽出ターム表（図示せず）に存在するか否か判断する
（ステップＳ３５）。既に存在する場合には、そのター
ムの出現頻度ｔｉをインクリメントする（ステップＳ３
７）。一方、存在しない場合には、抽出ターム表に追加
する（ステップＳ３９）。The CPU 23 determines that the term of interest is stored in the memory 27.
It is determined whether or not it exists in the extracted term table (not shown) (step S35). If the term already exists, the term appearance frequency ti is incremented (step S3).
7). On the other hand, if it does not exist, it is added to the extracted term table (step S39).

【００４７】ＣＰＵ２３は、注目タームが最終タームで
あるか否か判断する（ステップＳ４１）。最終タームで
なければ、注目ターム番号ｊをインクリメントして（ス
テップＳ４３）、ステップＳ２９以下の処理を繰り返
す。一方、最終タームであれば、全文書について、ター
ム抽出終了したか否か判断する（ステップＳ４５）。タ
ーム抽出終了していない文書が残っている場合には、注
目文書番号ｉをインクリメントし（ステップＳ４７）、
注目ターム番号ｊを初期化し（ステップＳ４９）、ステ
ップＳ２７以下の処理繰り返す。これにより、各ターム
について、出現文書数ｂｉおよび出現頻度ｔｉが求めら
れる。The CPU 23 determines whether or not the term of interest is the last term (step S41). If it is not the last term, the term of interest j is incremented (step S43), and the processing from step S29 is repeated. On the other hand, if it is the last term, it is determined whether term extraction has been completed for all documents (step S45). If there remains any document for which term extraction has not been completed, the target document number i is incremented (step S47),
The term of interest j is initialized (step S49), and the processing from step S27 is repeated. As a result, the number of appearance documents bi and the appearance frequency ti are obtained for each term.

【００４８】つぎに、ＣＰＵ２３は、全クエリについて
処理が終了したか否か判断する（図５ステップＳ９）。
全クエリについて処理が終了してない場合には、ＣＰＵ
２３は、クエリ番号ｋをインクリメントし（ステップＳ
１１）、ステップＳ５以下の処理を繰り返す。Next, the CPU 23 determines whether or not the processing has been completed for all the queries (step S9 in FIG. 5).
If processing has not been completed for all queries, the CPU
23 increments the query number k (step S
11), and repeat the processing from step S5.

【００４９】全クエリについて処理が終了すると、ＣＰ
Ｕ２３は、各タームについて各クエリに対する数値ベク
トル化処理を行う（ステップＳ１３）。本実施形態にお
いては、タームの数値ベクトル化のベクトル要素として
は、1)ターム出現頻度ｔｉ、2)ターム出現文書数ｂｉ、
3)ターム出現頻度ｔｉ／ターム出現文書数ｂｉ、4)関連
度のいずれかを選択できるようにした。When the processing is completed for all queries, the CP
U23 performs numerical vectorization processing for each query for each term (step S13). In the present embodiment, the vector elements for term vectorization include 1) term appearance frequency ti, 2) term appearance document number bi,
Any of 3) term appearance frequency ti / term appearance document number bi and 4) relevance can be selected.

【００５０】関連度とは、視点プロファイルを構成する
関連ワード群に対して文書中で関連する程度を表す数値
であり、評価するアルゴリズムを変更することによって
異なる。本実施形態においては、ターム出現文書数ｂｉ
および全文書における希少度に基づいて求めるようにし
た。したがって、ターム出現文書数が多いほど関連度が
高くなり、ターム出現文書数が少ないと関連度が低くな
る。また、ターム抽出文書以外の文書にはあまり存在せ
ず、ターム抽出文書に存在するタームは関連度が高くな
る。すなわち、ターム出現文書数が高く、かつ、ターム
抽出文書以外の文書には、あまり存在せず、ターム抽出
文書に数多く存在する場合に、関連度が高くなる。The degree of relevance is a numerical value indicating the degree of relevance in a document to a related word group forming a viewpoint profile, and differs depending on the algorithm to be evaluated. In the present embodiment, the term appearance document number bi
And the rarity of all documents. Therefore, the relevance increases as the number of term-appearing documents increases, and decreases as the number of term-appearing documents decreases. Also, there is not much in documents other than the term extracted document, and the terms present in the term extracted document have a high degree of relevance. In other words, when the number of term-appearing documents is high, and there are few documents other than the term-extracted document, and there are many terms-extracted documents, the relevance increases.

【００５１】なお、1)ターム出現頻度ｔｉおよび3)ター
ム出現頻度ｔｉ／ターム出現文書数ｂｉは、クエリセッ
トに使用されたキーワードと共起するタームの抽出、す
なわち、操作者の興味に関連する事象の抽出に役立つも
のと考えられる。また、2)ターム出現文書数ｂｉは、操
作者の興味に関連する主要なトピックと、その発生件数
の抽出に役立つものと考えられる。4)関連度は、特に操
作者の興味と希少な関連を持つ事象の抽出に役立つもの
と思われる。Note that 1) term appearance frequency ti and 3) term appearance frequency ti / term appearance document number bi are related to the extraction of terms co-occurring with the keywords used in the query set, that is, the interest of the operator. It is thought to be useful for extracting events. Also, 2) the term occurrence document number bi is considered to be useful for extracting main topics related to the operator's interest and the number of occurrences. 4) The degree of relevance seems to be particularly useful for extracting events that are rarely related to the interests of the operator.

【００５２】図９に、各タームを数値ベクトルで表した
例を示す。この場合、視点プロファイル「好景気」、
「不景気」について、ベクトル要素として関連度を求
め、しきい値はいずれかが０以上のもの（全てのター
ム）が、表示領域７２に表示されている。かかるベクト
ル要素およびしきい値の設定は、しきい値設定ボタン７
１をマウスでクリックすれば、しきい値設定ダイアログ
７４が表示されるので、操作者が所望の条件を設定する
ことができる。FIG. 9 shows an example in which each term is represented by a numerical vector. In this case, the viewpoint profile "boom economy"
Regarding “recession”, the degree of relevance is obtained as a vector element, and any one of the thresholds of 0 or more (all terms) is displayed in the display area 72. The setting of the vector element and the threshold value is performed by the threshold value setting button 7
If 1 is clicked on with a mouse, a threshold setting dialog 74 is displayed, so that the operator can set desired conditions.

【００５３】図１０に設定条件を変更して、共通ターム
を抽出した場合の表示例を示す。図１０では、ベクトル
要素としてターム出現文書数ｂｉで、しきい値はすべて
のクエリが１以上のものが、表示領域７２に表示されて
いる。このように、指定された複数のベクトル要素がす
べて、操作者より指定されたしきい値以上、または以下
の値を持つタームを共通タームという。FIG. 10 shows a display example when the common terms are extracted by changing the setting conditions. In FIG. 10, the term appearance document number bi as a vector element and the threshold value for all the queries being 1 or more are displayed in the display area 72. As described above, a term in which a plurality of designated vector elements all have a value equal to or greater than or equal to a threshold value designated by the operator is referred to as a common term.

【００５４】共通タームは、全てのクエリに共通して抽
出されるタームであり、本実施形態のように、クエリを
ユーザの興味を表す視点プロファイルとした場合には、
ユーザのすべての興味に共通するターム（トピック）で
あることが多い。The common term is a term that is extracted in common for all queries. When the query is a viewpoint profile representing the interest of the user as in this embodiment,
It is often a term (topic) common to all interests of the user.

【００５５】図１１に設定条件を変更して、固有ターム
を抽出した場合の表示例を示す。図１１では、ベクトル
要素としてターム出現文書数ｂｉで、しきい値は１つの
クエリのみ１以上のものが、表示領域７２に表示されて
いる。このように、特定のベクトル要素だけが操作者よ
り指定されたしきい値以上または以下の値を持つターム
を固有タームという。FIG. 11 shows a display example in a case where the setting conditions are changed and the unique terms are extracted. In FIG. 11, the number of term-appearing documents bi as a vector element and a threshold value of one or more for only one query are displayed in the display area 72. As described above, a term in which only a specific vector element has a value equal to or greater than or equal to a threshold value specified by the operator is called a unique term.

【００５６】固有タームは、対応するクエリによっての
み抽出されるタームであり、本実施形態のように、クエ
リをユーザの興味を表す視点プロファイルとした場合に
は、ユーザの特定の興味に固有のタームであることが多
い。The unique term is a term extracted only by the corresponding query. When the query is a viewpoint profile representing the interest of the user as in the present embodiment, the term unique to the specific interest of the user is used. Often it is.

【００５７】つぎに、ＣＰＵ２３は、特徴タームを抽出
する（図５ステップＳ１５）。特徴タームの抽出は、操
作者が抽出基準を与えるようにすればよい。本実施形態
においては、特定の文字列を含むタームを抽出するよう
にした。この場合は、フィルタリング条件を設定するこ
とにより、以下のようにして、特徴ターム抽出が行われ
る。Next, the CPU 23 extracts characteristic terms (step S15 in FIG. 5). The extraction of the feature terms may be performed by the operator giving an extraction criterion. In the present embodiment, terms including a specific character string are extracted. In this case, by setting a filtering condition, feature term extraction is performed as follows.

【００５８】図１２に文字列「会社」を含むタームのみ
を抽出した場合の表示例を示す。この場合、フィルタリ
ング条件を設定するには、操作者がフィルタボタン７５
をクリックすると、フィルタリングダイアログ７６が表
示されるので、文字列フィルタとして「会社」を設定す
るようにすればよい。FIG. 12 shows a display example when only terms including the character string "company" are extracted. In this case, to set the filtering condition, the operator presses the filter button 75
Clicking displays a filtering dialog 76, so that "company" may be set as a character string filter.

【００５９】フィルタリング条件としては、文字列フィ
ルタ以外に、辞書フィルタ、パターンフィルタを単独ま
たは組み合わせて設定することができる。As the filtering condition, in addition to the character string filter, a dictionary filter and a pattern filter can be set alone or in combination.

【００６０】辞書フィルタとは、操作者が望む１または
２以上の用語を記憶した用語集である。本実施形態にお
いては、辞書フィルタに、図１３に示すようなユーザの
意向毎に、対応する助動詞をあらかじめ記憶しておき、
操作者がこれを選択できるようにした。具体的には、操
作者が意向として「希望」を選択すると、「たい」の複
数の用語がｏｒ条件で文字列フィルタとしてフィルタリ
ング処理がなされる。これにより、用語「たい」を含む
ターム、例えば、「知りたい」、「調べたい」などのタ
ームを抽出することができる。The dictionary filter is a glossary storing one or more terms desired by the operator. In the present embodiment, the dictionary filter stores in advance a corresponding auxiliary verb for each user's intention as shown in FIG.
The operator can select this. Specifically, when the operator selects “desired” as the intention, a plurality of terms “want” are subjected to a filtering process as a character string filter under the or condition. This makes it possible to extract terms including the term “want”, for example, terms such as “want to know” and “want to check”.

【００６１】このように、マイニングを行う場合の指針
となる助動詞をあらかじめ辞書化しておくことにより、
マイニングが容易となる。As described above, by preliminarily converting the auxiliary verb, which serves as a guideline when performing mining, into a dictionary,
Mining becomes easy.

【００６２】また、辞書フィルタに、ユーザの意向毎
に、あらかじめ１または２以上の接尾語を記憶してお
き、選択できるようにしてもよい。接尾語とは、ある語
の末尾に添えて意味を添え、またはある品詞に一定の資
格を与える独立しない語をいう。例えば、「的」、
「性」等である。これにより、「具体的」や「革新性」
等のタームを抽出することができる。そして、かかるタ
ームが文書中でどのような単語に係っているかを知るこ
とができる。Further, one or two or more suffixes may be stored in advance in the dictionary filter for each user's intention so that the suffix can be selected. A suffix is a non-independent word that adds meaning to the end of a word or gives certain qualifications to a part of speech. For example, "target",
"Sex" and the like. As a result, "specific" and "innovative"
Terms can be extracted. Then, it is possible to know what word the term relates to in the document.

【００６３】また、辞書フィルタに、ユーザの意向毎
に、あらかじめ１または２以上の助数詞を記憶してお
き、選択できるようにしてもよい。例えば、「円」、
「ＭＨｚ」等である。これにより、数詞と助数詞で構成
されたタームが抽出できるので、かかるタームが文書中
でどのような単語に係っているかを知ることができる。
特に、値段の場合は、値段のしきい値を設定して、それ
以下またはそれ以上の値段の商品の情報のみを抽出する
ことができる。Further, one or two or more classifiers may be stored in advance in the dictionary filter for each user's intention, and may be selected. For example, "yen",
"MHz" and the like. As a result, a term composed of a number and a classifier can be extracted, so that it is possible to know what word the term relates to in a document.
In particular, in the case of a price, it is possible to set a threshold value of the price, and extract only information on products whose prices are lower or higher.

【００６４】なお、かかる辞書フィルタはユーザの意向
毎にさらに階層構造にしてもよい。The dictionary filter may have a hierarchical structure for each user's intention.

【００６５】なお、本実施形態においては、辞書フィル
タにあらかじめ記憶しておいたユーザの意向に対応する
助動詞を用いたが、このような辞書フィルタを用いなく
とも、ユーザが文字列フィルタにこれを与えてフィルタ
リングしてもよい。In this embodiment, the auxiliary verb corresponding to the user's intention, which is stored in the dictionary filter in advance, is used. However, even if such a dictionary filter is not used, the user can add the auxiliary verb to the character string filter. It may be provided and filtered.

【００６６】なお、特徴タームとしては、上記以外に、
以下のような基準で抽出することができる。The characteristic terms other than the above are as follows.
It can be extracted based on the following criteria.

【００６７】１）特異ターム：あるベクトル要素値が他
のタームより特異的に大きい、あるいは特異的に小さい
ターム特異タームは、各ベクトル要素値の差分が所定のしきい
値より大きいタームであり、固有タームほどではない
が、ユーザの特定の興味に関係するタームであることが
多い。例えば、３つのベクトル要素値が「１，１，６」
でしきい値を「５」とした場合も、「１，３，６」でし
きい値を「５」とした場合でも、特異タームとして抽出
される。1) Singular term: A term in which a certain vector element value is specifically larger or smaller than another term A singular term is a term in which the difference between each vector element value is larger than a predetermined threshold value. Although not as unique as terms, they are often related to a particular interest of the user. For example, three vector element values are “1,1,6”
When the threshold value is set to “5” in “5” or when the threshold value is set to “5” in “1, 3, 6”, it is extracted as a unique term.

【００６８】２）差分ターム：順列に意味があるクエリ
について、あるクエリとその前後のクエリで差分が大き
なターム差分タームは、例えば、互いに関連する一連のクエリと
して自動車の売れ筋のジャンルの移り変わりとして、第
１の視点プロファイル「セダン」、第２の視点プロファ
イル「スポーツ」、第３の視点プロファイル「ＲＶ」・
・を用いた場合、移り変わりとともに共起する新規トピ
ック等の抽出が可能となる。なお、差分タームについて
は後述の抽出期間を限定するような場合にも有効であ
る。2) Difference term: For a query having a meaning in the permutation, a term having a large difference between a certain query and a query before and after the query. The first viewpoint profile “Sedan”, the second viewpoint profile “Sports”, the third viewpoint profile “RV”
In the case of using, it becomes possible to extract new topics that co-occur with the change. Note that the difference term is also effective when limiting an extraction period described later.

【００６９】このように、本実施形態においては、視点
プロファイルごとに抽出対象文書を決定して、その視点
プロファイルにおける各タームの評価を演算し、これを
複数の視点プロファイルについて繰り返して、各ターム
を複数の視点プロファイルに対する数値ベクトルで評価
するようにした。ターム抽出対象文書を操作者の望む観
点で特定することにより、操作者の興味に関係した文書
からターム抽出を行うので、操作者の興味に深く関連し
ているタームを抽出することができる。また、ターム総
数を絞り込めるため、ターム分析がより容易となる。As described above, in the present embodiment, a document to be extracted is determined for each viewpoint profile, the evaluation of each term in the viewpoint profile is calculated, and this is repeated for a plurality of viewpoint profiles. The evaluation was made using numerical vectors for multiple viewpoint profiles. By specifying the term extraction target document from a viewpoint desired by the operator, terms are extracted from documents related to the operator's interest, so that terms that are closely related to the operator's interest can be extracted. Also, the term analysis becomes easier because the total number of terms can be narrowed down.

【００７０】また、抽出した各タームを複数の視点プロ
ファイルについてのベクトルで表すことにより、操作者
の望む観点で他面的にタームを評価することができる。
したがって、未知の知識の発見が容易となる。Further, by expressing each extracted term by a vector for a plurality of viewpoint profiles, it is possible to evaluate the terms from another viewpoint from the viewpoint desired by the operator.
Therefore, it is easy to find unknown knowledge.

【００７１】本実施形態においては、各タームのベクト
ル要素のパターンを指定するパターンフィルタを採用し
ている。これにより、例えば、前記共通ターム、固有タ
ーム、特異ターム、差分タームを抽出することができ
る。In the present embodiment, a pattern filter for designating the pattern of the vector element of each term is employed. As a result, for example, the common term, the unique term, the unique term, and the difference term can be extracted.

【００７２】４．トレンド分析上記実施形態においては、複数の検索条件として、異な
る視点プロファイルを用いた。しかし、１の視点プロフ
ァイルおよび各文書の作成期間によって、複数の検索条
件としてもよい。例えば、図１４に示すように、視点プ
ロファイル「不景気」を選択し、重みづけダイアログ８
３にて、この視点プロファイルの関連キーワードについ
て、重みづけを設定する。この場合、全関連キーワード
は重みづけ「１」と設定されている。さらに、着目期間
を、開始日が１９９７／１／１で、期間は１月毎で、回
数５回に設定されている。4. Trend Analysis In the above embodiment, different viewpoint profiles are used as a plurality of search conditions. However, a plurality of search conditions may be used depending on one viewpoint profile and a period for creating each document. For example, as shown in FIG. 14, the viewpoint profile “recession” is selected, and the weighting dialog 8
At 3, weights are set for the related keywords of this viewpoint profile. In this case, all the related keywords are set to weight “1”. Further, the start period is set to five times, the start date is 1997/1/1, and the period is every month.

【００７３】前記条件における実行結果を図１５に示
す。表示領域７７にはターム抽出文書が日付順に表示さ
れている。表示領域７２には、ターム抽出文書から抽出
したタームの関連度で並べ替えられて表示されている。
本実施形態においては、図１６に示すフローチャートに
基づいて、並べ替えるようにした。FIG. 15 shows an execution result under the above conditions. In the display area 77, the term extracted documents are displayed in order of date. In the display area 72, the terms are sorted and displayed according to the relevance of the terms extracted from the term extraction document.
In the present embodiment, rearrangement is performed based on the flowchart shown in FIG.

【００７４】ＣＰＵ２３は、まず、期間番号ｎを初期化
する（図１６ステップＳ７１）。第ｎ番目の値でターム
を並び替える（ステップＳ７３）。この場合、第０番目
の期間（１／１〜１／３１の期間）の値で大きい順に並
び替えられる。The CPU 23 first initializes the period number n (step S71 in FIG. 16). The terms are rearranged by the n-th value (step S73). In this case, the images are rearranged in descending order by the value of the 0th period (period of 1/1 to 1/31).

【００７５】つぎに、ＣＰＵ２３は、第ｎ番目の期間の
値が「０」のタームを抽出して、これらについて、ｎ＋
１番目の期間の値でソートする（ステップＳ７５）。す
なわち、この場合、第０番目の期間の値が「０」で、か
つ、第１番目の期間の値が「０」でないタームが、第１
番目の期間の値が大きい順に並び替えられる。Next, the CPU 23 extracts the terms whose value in the n-th period is “0”, and
Sort by the value of the first period (step S75). That is, in this case, the term whose value in the 0th period is “0” and whose value in the first period is not “0” is the first term.
The values of the period are sorted in descending order.

【００７６】ＣＰＵ２３は、最終期間まで検討が終了し
たか否か判断し（ステップＳ７７）、最終期間でなけれ
ば、期間番号ｎをインクリメントし（ステップＳ７
９）、ステップＳ７５以下の処理を繰り返す。一方、最
終期間まで検討が終了すると、処理を終了する。The CPU 23 determines whether the examination has been completed up to the final period (step S77), and if not, increments the period number n (step S7).
9), repeat the processing from step S75. On the other hand, when the examination is completed up to the final period, the processing ends.

【００７７】このような並べ替えにより、着目期間に初
めて出現するタームを発見しやすくなる。例えば、図１
５において、着目期間を第２番目の期間（３／１〜３／
３１）とすると、ターム「土産物・水産物卸」がその前
の期間２／１〜２／２８には出現せず、着目期間３／１
〜３／３１で初めて出現したタームであることがわか
る。By such rearrangement, it is easy to find a term that appears for the first time in the period of interest. For example, FIG.
In 5, the target period is set to the second period (3/1 to 3 /
31), the term “souvenir and marine products wholesale” does not appear in the previous period 2/1 to 2/28, and the term of interest 3/1
It can be seen that the term first appeared on March 31.

【００７８】かかるターム分析に、文字列フィルタやパ
ターンフィルタを併用するようにしてもよい。図１７、
図１８にかかるフィルタリング条件を設定した場合の分
析結果を示す。図１７では、視点プロファイル「株」を
選択し、重みづけダイアログ６３にて、この視点プロフ
ァイルの関連キーワードについて、重みづけを設定す
る。この場合、重みづけは全部「１」である。さらに、
着目期間を期間設定ダイアログ６７に設定する。この場
合であれば、開始日が１９９７／１／１で、期間は３月
毎で、回数４回に設定されている。A character string filter or a pattern filter may be used in combination with the term analysis. FIG.
FIG. 18 shows an analysis result when the filtering condition is set. In FIG. 17, the viewpoint profile “stock” is selected, and weighting is set in the weighting dialog 63 for the related keywords of this viewpoint profile. In this case, the weights are all “1”. further,
The attention period is set in the period setting dialog 67. In this case, the start date is 1997/1/1, the period is set every three months, and the number of times is set to four.

【００７９】前記条件における実行結果を図１８に示
す。表示領域７７にはターム抽出文書が日付順に表示さ
れている。表示領域７２には、ターム抽出文書から抽出
したタームが表示されている。この場合、さらに、フィ
ルタリング条件として、文字列フィルタとして、文字列
「株」が、パターンフィルタ「単調増加」が設定されて
いるので、抽出したタームのうち、文字列「株」を含
み、さらに、３月毎の値が単調増加しているタームが、
表示されている。FIG. 18 shows an execution result under the above conditions. In the display area 77, the term extracted documents are displayed in order of date. In the display area 72, terms extracted from the term extraction document are displayed. In this case, as the filtering condition, the character string "stock" is set as the character string filter, and the pattern filter "monotonically increasing" is set. Therefore, the character string "stock" is included in the extracted terms. The term where the value every March is monotonically increasing,
Is displayed.

【００８０】このようなトレンド分析により、各期間の
文書に共通に出現するターム、特定の期間の文書にのみ
出現するターム、時間の経過とともに新たに発生または
消滅するタームを抽出することができる。例えば、徐々
に増加または減少する事象の抽出に役立つ。By such a trend analysis, it is possible to extract terms that appear in a document in each period, terms that appear only in a document in a specific period, and terms that newly appear or disappear over time. For example, it is useful for extracting events that gradually increase or decrease.

【００８１】なお、このようなトレンド分析について
は、視覚的な把握を容易とするために、ターム毎、また
はある程度まとめてグラフ表示（折れ線グラフ等）する
ようにしてもよい。For such trend analysis, a graph (line graph or the like) may be displayed for each term or to some extent collectively to facilitate visual grasp.

【００８２】５．他の実施形態なお、上記実施形態においては、前記複数の視点プロフ
ァイルとして、好景気、不景気と逆の方向性を有する１
対の視点プロファイルを採用したが、これに限定され
ず、複数の視点プロファイルであればどのようなもので
も、ユーザが自由に設定することができる。5. Other Embodiments In the above embodiment, one of the plurality of viewpoint profiles has a direction opposite to that of a good economy or a bad economy.
Although a pair of viewpoint profiles is adopted, the present invention is not limited to this, and the user can freely set any of a plurality of viewpoint profiles.

【００８３】なお、クエリとして視点プロファイルを与
えたが、ユーザの興味に関連する幾つかのタームをキー
ワードとして直接与えてもよく、さらに、例えば、「株
価と景気の動向との関連について知りたい」というよう
な自然文が与えられると、この自然文を意味解析して、
キーワードを設定して、クエリを生成するようにしても
よい。さらに、クエリを制約する検索時間範囲や必ず含
むべきターム等の制約条件を追加するようにしてもよ
い。Although the viewpoint profile is given as a query, some terms related to the user's interest may be given directly as keywords. Further, for example, “I want to know the relationship between stock prices and economic trends” Given a natural sentence like this, semantic analysis of this natural sentence,
A query may be generated by setting a keyword. Further, constraints such as a search time range that restricts the query and terms that must be included may be added.

【００８４】また、各クエリごとにターム抽出の対象と
なる文書データやその数が異なるので、統計データによ
る補正を行ってもよい。例えば、ターム出現頻度をター
ムを含む文書数または抽出した文書数で除算すればよ
い。Further, since the document data to be subjected to term extraction and the number thereof are different for each query, the correction may be performed using statistical data. For example, the term appearance frequency may be divided by the number of documents including the term or the number of extracted documents.

【００８５】なお、数値ベクトルデータを従来の分類手
法を用いて分類するようにしてもよい。例えば、以下の
３つの手法を採用することができる。Note that the numerical vector data may be classified using a conventional classification method. For example, the following three methods can be adopted.

【００８６】１）クラスタ分析あらかじめ分類基準がない場合には、クラスタ分析を行
えばよい。そして分類後は、各集団の典型例を抽出する
ことにより、その集団の意味を推定することもできる。
典型例の抽出には、例えば、集団の中心に一番近いター
ムを抽出すればよい。1) Cluster Analysis If there is no classification standard in advance, a cluster analysis may be performed. After the classification, the typical meaning of each group can be extracted to extract the meaning of the group.
To extract a typical example, for example, the term closest to the center of the group may be extracted.

【００８７】また、クラスタリングの手法としては、従
来から用いられている手法が採用でき、例えば、階層的
クラスタ分析だけでなく、非階層的クラスタ分析するよ
うにしてもよい。As a clustering method, a conventionally used method can be adopted. For example, not only a hierarchical cluster analysis but also a non-hierarchical cluster analysis may be performed.

【００８８】２）クラシフィケーションいくつかのクラスタをあらかじめ用意しておき、各数値
ベクトルをもっとも近いクラスタに割り当てることによ
りタームを分類する。2) Classification Several clusters are prepared in advance, and terms are classified by assigning each numerical vector to the closest cluster.

【００８９】３）単純な分類同じベクトル要素値が最大となるタームをまとめて、１
つのクラスタとする。クエリ数と同じ数の分類クラスタ
が生成され、かつ、各分類クラスタには対応するクエリ
に関連するタームが分類される。3) Simple Classification The terms in which the same vector element value is maximum are put together and
One cluster. As many classification clusters as the number of queries are generated, and each classification cluster classifies the term associated with the corresponding query.

【００９０】このような分類処理により、ある分類グル
ープの特徴的なタームを抽出し、比較検討することによ
り、新たな知識を発見することも可能となる。With such a classification process, it is possible to discover new knowledge by extracting characteristic terms of a certain classification group and comparing and extracting the terms.

【００９１】また、本実施形態においては、抽出対象文
書をユーザの興味のある視点プロファイルによって特定
しているので、比較的精度の高いターム分析が可能とな
る。また、全文書から抽出する場合と比べて、ターム総
数が少なくなり、分類が短時間で可能となり、ユーザの
分析も容易となる。Further, in this embodiment, since the document to be extracted is specified by the viewpoint profile of the user's interest, the term analysis can be performed with relatively high accuracy. Also, compared to the case of extracting from all documents, the total number of terms is reduced, classification can be performed in a short time, and user analysis is facilitated.

【００９２】なお、上記実施形態においては、前記複数
の視点プロファイルとして、逆の方向性を有する１対の
視点プロファイルを採用したが、これに限定されず、例
えば一見関係の無いような視点プロファイル、例えば、
「政治」と「出生率」等であってもよい。In the above-described embodiment, a pair of viewpoint profiles having opposite directions is adopted as the plurality of viewpoint profiles. However, the present invention is not limited to this. For example,
“Politics” and “birth rate” may be used.

【００９３】また、本実施形態においては、前記視点プ
ロファイルとして状態を表すプロファイルとして「好景
気」、「不景気」を用いたが、変化を表すプロファイル
として「増産」、「増益」等を用いてもよい。このよう
な状態または変化を表すプロファイルを用いることによ
り、データマイニングがより容易となる。なお、視点プ
ロファイルについては、操作者が望む視点プロファイル
であれば、これらに限定されず、どのようなものであっ
てもよい。Further, in the present embodiment, “good economy” and “bad economy” are used as the profiles representing states as the viewpoint profiles, but “increase in production”, “increase in profit”, etc. may be used as profiles indicating changes. . By using a profile representing such a state or change, data mining becomes easier. The viewpoint profile is not limited to these as long as it is a viewpoint profile desired by the operator, and any profile may be used.

【００９４】なお、抽出文書決定する際に行う、各文書
からタームを抽出してベクトル化する処理については、
全文書の内容を確定できれば、あらかじめ処理すること
も可能であるので、新たな文書が追加された時に実行し
て、記憶しておいてもよい。Note that the process of extracting terms from each document and converting them into a vector, which is performed when determining an extracted document, is described below.
If the contents of all the documents can be determined, they can be processed in advance, and may be executed when a new document is added and stored.

【００９５】また、本実施形態においては、抽出文書決
定処理にて、各文書から重要タームを抽出するようにし
たが、タームであればこれに限定されず、例えば重要タ
ームだけでなく、全タームを抽出するようにしたり、あ
る抽出基準で抽出するようにしてもよい。In this embodiment, the important terms are extracted from each document in the extracted document determination processing. However, the present invention is not limited to this. For example, not only important terms but also all terms May be extracted, or may be extracted based on a certain extraction criterion.

【００９６】本実施形態においては、２つの数値ベクト
ルの類似度を、両数値ベクトルの内積を演算することに
より決定したが、両数値ベクトルのコサイン値を類似度
としてもよい。In this embodiment, the similarity between two numerical vectors is determined by calculating the inner product of the two numerical vectors, but the cosine value of both numerical vectors may be used as the similarity.

【００９７】また、本実施形態においては、日本語の文
書の場合について説明したが、他の言語、例えば、英
語、中国語、韓国語等についても同様に適用することが
できる。Further, in this embodiment, the case of a Japanese document has been described, but the present invention can be similarly applied to other languages, for example, English, Chinese, Korean, and the like.

【００９８】本実施形態においては、図１に示す機能を
実現する為に、ＣＰＵ２３を用い、ソフトウェアによっ
てこれを実現している。しかし、その一部もしくは全て
を、ロジック回路等のハードウェアによって実現しても
よい。In the present embodiment, the functions shown in FIG. 1 are realized by using the CPU 23 and software. However, some or all of them may be realized by hardware such as a logic circuit.

【００９９】このように、文書に存在するタームを操作
者の興味がある複数の視点プロファイルに関する得点に
よって数値ベクトル化している。これにより、タームの
出現頻度に依存した値で表す場合と比べて、自己の興味
と各タームとの関係を操作者が容易に把握することがで
きる。また、数値ベクトル化されたタームを各種の統計
的解析手法を用いて分析することができる。さらに、分
析結果の意味付けが容易となる。As described above, terms present in a document are converted into numerical vectors based on scores regarding a plurality of viewpoint profiles of interest to the operator. As a result, the operator can easily grasp the relationship between his / her own interest and each term as compared with a case where the term is represented by a value depending on the appearance frequency of the term. In addition, the terms converted into numerical vectors can be analyzed using various statistical analysis methods. Further, the meaning of the analysis result becomes easy.

[Brief description of the drawings]

【図１】本発明にかかるターム評価装置１の機能ブロッ
ク図である。FIG. 1 is a functional block diagram of a term evaluation device 1 according to the present invention.

【図２】図１に示すターム評価装置のハードウエア構成
の一例を示す図である。FIG. 2 is a diagram showing an example of a hardware configuration of the term evaluation device shown in FIG.

【図３】図２に示すコンピュータシステム４０をＣＰＵ
２３を用いて実現したハードウエア構成の一例を示す図
である。FIG. 3 shows a computer system 40 shown in FIG.
FIG. 3 is a diagram illustrating an example of a hardware configuration realized using the H.23.

【図４】視点プロファイルと関連キーワードの対応を示
す図である。FIG. 4 is a diagram showing a correspondence between a viewpoint profile and a related keyword.

【図５】ターム評価処理の全体フローチャートである。FIG. 5 is an overall flowchart of a term evaluation process.

【図６】抽出文書決定処理の詳細フローチャートであ
る。FIG. 6 is a detailed flowchart of an extracted document determination process.

【図７】統計量決定処理の詳細フローチャートである。FIG. 7 is a detailed flowchart of a statistic determination process.

【図８】クエリ決定のためのダイアログの一例である。FIG. 8 is an example of a dialog for determining a query.

【図９】抽出タームの表示の一例である。FIG. 9 is an example of a display of an extraction term.

【図１０】抽出タームの表示の一例である。FIG. 10 is an example of a display of an extraction term.

【図１１】抽出タームの表示の一例である。FIG. 11 is an example of a display of an extraction term.

【図１２】抽出タームの表示の一例である。FIG. 12 is an example of display of an extraction term.

【図１３】辞書フィルタのデータ構造を示す。FIG. 13 shows a data structure of a dictionary filter.

【図１４】トレンド分析のための設定ダイアログを示
す。FIG. 14 shows a setting dialog for trend analysis.

【図１５】トレンド分析の結果を示す。FIG. 15 shows the result of a trend analysis.

【図１６】並べ替え処理の詳細フローチャートである。FIG. 16 is a detailed flowchart of a rearrangement process.

【図１７】トレンド分析のための設定ダイアログを示
す。FIG. 17 shows a setting dialog for trend analysis.

【図１８】トレンド分析の結果を示す。FIG. 18 shows the result of a trend analysis.

[Explanation of symbols]

２３・・・ＣＰＵ２７・・・メモリ 23: CPU 27: Memory

───────────────────────────────────────────────────── フロントページの続き (72)発明者土居功志徳島県徳島市川内町平石若松108番４号株式会社ジャストシステム内 (72)発明者三ツ矢浩之徳島県徳島市川内町平石若松108番４号株式会社ジャストシステム内Ｆターム(参考） 5B075 ND03 PP13 PQ02 PQ32 PQ46 PR06 QM08 ──────────────────────────────────────────────────続き Continued on the front page (72) Inventor Koji Doi 108-4 Hiraishiwakamatsu, Kawauchi-cho, Tokushima City, Tokushima Prefecture Just System Co., Ltd. (72) Inventor Hiroyuki Mitsuya 108-4, Hiraishi-Wakamatsu, Kawauchi-cho, Tokushima City, Tokushima Prefecture F-term in Justsystem Corporation (reference) 5B075 ND03 PP13 PQ02 PQ32 PQ46 PR06 QM08

Claims

[Claims]

1. A recording medium storing a program for causing a computer including an input device, a control device, and a storage device to function as a term evaluation device, wherein the program causes the computer to execute the following processing: A ) Given multiple search conditions, for each search condition,
A term extraction target document determination process and a term statistic calculation process are executed, and a1) a term for specifying one or more documents to be a term extraction target from a plurality of document data based on a given search condition; Extraction target document determination processing, a2) term statistic calculation processing for calculating the statistic of the term present in the specified term extraction target document as the term statistic in the search condition, B) using the plurality of term statistics And determining a numerical vector having the same dimension as the given search condition number as an evaluation value of each term.

2. A recording medium storing the program according to claim 1, wherein the program is based on an evaluation value of each of the terms.
Extracting terms having predetermined characteristics.

3. A recording medium storing the program according to claim 1, wherein, when a plurality of viewpoint profiles are given as the search condition, a plurality of related words related to the viewpoint profile are provided for each viewpoint profile. The term extraction target document is specified using:

4. The recording medium storing the program according to claim 2, wherein the plurality of documents store a creation time, and when one viewpoint profile and a plurality of extraction target periods are given, each of the plurality of documents is extracted. Generating the plurality of search conditions by using the one viewpoint profile for each period, and executing a term extraction target document determination process.

5. The recording medium storing the program according to claim 2 or 4, wherein the extracted terms are terms in which only specific vector elements have values equal to or greater than or equal to a predetermined threshold value. Characterized by,.

6. A recording medium storing the program according to claim 2 or 4, wherein the extracted terms are such that all of a plurality of specific vector elements have values equal to or greater than or equal to a predetermined threshold value. Term.

7. A recording medium storing the program according to claim 2 or 4, wherein the extracted terms have a specific vector element value that is specifically large or specifically small compared to other terms. Term.

8. The recording medium storing the program according to claim 2 or 4, wherein the plurality of search conditions are search conditions having a meaning in a permutation, and the extracted terms are those having a value of a certain vector element. A term that is larger than a predetermined difference compared to the vector elements before and after the term.

9. A recording medium storing the program according to claim 5, wherein a predetermined character string included in the term and expressing the intention of the document creator is classified and stored for each intention. In addition, if the operator selects one of the categories, only the terms in which the specified character string exists are extracted.

10. A method for evaluating terms constituting document data, comprising: A) when a plurality of search conditions are given,
The following a1) term extraction target document determination processing and a2) term statistic calculation processing are performed. A1) Based on given search conditions, one or more documents to be subjected to term extraction are extracted from a plurality of document data. Term extraction target document determination processing that determines the term extraction target document by specifying it.a2) Term statistic calculation processing that calculates the statistic of the term present in the specified term extraction target document as the term statistic in the search condition. B) using the plurality of term statistics, determining an evaluation value of each term with a numerical vector having the same dimension as the given number of search conditions.

11. The term evaluation method according to claim 10, wherein the operator is designated when the verb or the adjective is given a character string expressing the intention of the document creator, which is attached to the verb or the adjective. Extracting only the terms in which the character string exists.

12. Based on a given search condition, one or more documents from which term extraction is to be performed are specified from a plurality of document data, and the statistics of terms existing in the specified term extraction target document are calculated. Term statistic calculating means for outputting as the term statistic in the search condition, when the term statistic in the plurality of different search conditions is given from the term statistic calculating means, the term statistic has the same dimension as the given number of search conditions An evaluation means for determining an evaluation value of each term by a numerical vector.