JP4592629B2

JP4592629B2 - Document search support method and apparatus, program, and computer-readable recording medium

Info

Publication number: JP4592629B2
Application number: JP2006089417A
Authority: JP
Inventors: 吉秀佐藤; 晴美川島
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2006-03-28
Filing date: 2006-03-28
Publication date: 2010-12-01
Anticipated expiration: 2026-03-28
Also published as: JP2007265034A

Description

本発明は、文書検索支援方法及び装置及びプログラム及びコンピュータ読み取り可能な記録媒体に係り、特に、文書検索を効率よく行うために、利用者が要求する検索条件に合致する文書を、検索条件を用いて整理、表示するための文書検索支援方法及び装置及びプログラム及びコンピュータ読み取り可能な記録媒体に関する。 The present invention relates to a document search support method and apparatus, a program, and a computer-readable recording medium. In particular, in order to efficiently perform a document search, a document that matches a search condition requested by a user is used. The present invention relates to a document search support method and apparatus, a program, and a computer-readable recording medium.

通常の文書検索では、利用者の検索要求に直接的に合致する文書を提示する手法が用いられる。しかし、対象の文書数が膨大な場合には、条件との単純な整合をとるだけでは文書数の十分な絞り込みが行えず、利用者は効率的な文書取得ができない。 In a normal document search, a technique for presenting a document that directly matches a user's search request is used. However, when the number of target documents is enormous, the number of documents cannot be sufficiently narrowed down by simply matching with the conditions, and the user cannot efficiently obtain the documents.

このため、キーワードに加え、文書の作成時刻や文書のジャンル情報など、複数の条件を統合的に指定する検索方法がある（例えば、特許文献１参照）。 For this reason, there is a search method in which a plurality of conditions such as document creation time and document genre information are specified in addition to keywords (see, for example, Patent Document 1).

他にも、「経済」「社会」などの文書のジャンル情報や、「国際原子力機関＝組織」「国際連合＝組織」のように、文書中に出現する語句の種別情報などの付加情報を予め文書に付与しておき、利用者が与えるキーワードによって文書の絞り込みを行った後に、該付加情報を用いて文書を自動的に整理し、ラベルと呼ぶ代表情報を提示することで検索を支援する技術がある。
特開２００５−３３９１３９号公報 In addition, additional information such as genre information of documents such as “economy” and “society” and type information of words and phrases appearing in the document, such as “International Atomic Energy Agency = Organization” and “United Nations = Organization”, is stored in advance. A technology that supports search by assigning to a document, narrowing down the document by a keyword given by the user, automatically organizing the document using the additional information, and presenting representative information called a label There is.
JP 2005-339139 A

しかしながら、上記の既存技術は、各文書に対してジャンル情報や語句の種別情報などを付与しておくことが前提であり、人手による付与作業を行うか、もしくは語句の種別を記録した特殊な辞書を準備する必要がある。 However, the above existing technique is based on the premise that genre information, phrase type information, etc. are assigned to each document, and a special dictionary that performs manual assignment work or records the type of phrase. Need to prepare.

本発明は、上記の点に鑑みなされたもので、利用者が指定する検索条件に基づく検索結果の整理と提示を、事前の特殊な準備を要せずに実現する文書検索支援方法及び装置及びプログラム及びコンピュータ読み取り可能な記録媒体を提供することを目的とする。 The present invention has been made in view of the above points, and provides a document search support method and apparatus for organizing and presenting search results based on search conditions designated by a user without requiring special preparation in advance. An object is to provide a program and a computer-readable recording medium.

図１は、本発明の原理を説明するための図である。 FIG. 1 is a diagram for explaining the principle of the present invention.

本発明（請求項１）は、検索対象とする文書を格納した文書記憶手段、文書検索手段、重心選択手段、ベクトル生成手段、ベクトル記憶手段、クラスタ生成手段、クラスタ記憶手段を有する文書検索支援装置において、利用者が要求する検索条件に合致する文書を検索するための文書検索支援方法であって、
文書検索手段が、利用者から入力された検索条件に合致する文書を、文書記憶手段から取得する文書検索ステップ（ステップ１）と、
重心選択手段が、文書検索ステップで取得した文書それぞれについて、入力された各検索条件に含まれるある単語Ａと該単語Ａ以外の同一検索条件に含まれる単語Ｂが文書中に出現する位置の距離に基づいて、該文書中に出現する単語Ａにスコアを付与し、該スコアが最も小さい単語の文書中の位置を重心位置として特定する処理を該検索条件に含まれる全単語について行う重心選択ステップ（ステップ２）と、
ベクトル生成手段が、文書記憶手段から取得した各文書中から、重心位置を中心とする一定範囲内に出現する単語と該単語の出現回数を集計して、文書毎のベクトルを取得し、ベクトル記憶手段に格納するベクトル生成ステップ（ステップ３）と、
クラスタ生成手段が、ベクトル記憶手段に格納されている各文書のベクトル間の距離を計算し、距離が近い文書同士を統合してクラスタを形成させ、クラスタ記憶手段に格納するクラスタ生成ステップ（ステップ４）と、を行う。 The present invention (Claim 1) is a document search support apparatus having a document storage means storing a document to be searched, a document search means, a center of gravity selection means, a vector generation means, a vector storage means, a cluster generation means, and a cluster storage means. A document search support method for searching for a document that matches a search condition requested by a user,
A document search step (step 1) in which the document search means acquires from the document storage means a document that matches the search condition input by the user;
For each document acquired by the centroid selection means in the document search step, a distance between a position where a word A included in each input search condition and a word B included in the same search condition other than the word A appear in the document The center-of-gravity selection step of assigning a score to the word A appearing in the document and specifying the position in the document of the word having the smallest score as the center- of- gravity position for all words included in the search condition (Step 2)
The vector generation means totals the words appearing within a certain range centered on the center of gravity position and the number of appearances of the words from each document acquired from the document storage means, acquires a vector for each document, and stores the vector A vector generation step (step 3) to be stored in the means;
A cluster generation unit calculates a distance between vectors of each document stored in the vector storage unit, integrates documents having close distances to form a cluster, and stores the cluster in the cluster storage unit (step 4). ) And do.

また、本発明（請求項２）は、主要キーワード取得手段を更に有する文書検索支援装置において、
主要キーワード取得手段が、クラスタ記憶手段に格納されているクラスタの構成文書に基づいてベクトル記憶手段のベクトルを参照して、各単語毎に重みの総和を求め、該総和の大きい単語から上位Ｎ語を主要キーワードとして選択する主要キーワード取得ステップ（ステップ５）を、更に行う。 The present invention (Claim 2) is a document search support apparatus further comprising a main keyword acquisition unit.
The main keyword acquisition means refers to the vector of the vector storage means based on the cluster configuration document stored in the cluster storage means , finds the sum of the weights for each word, and determines the top N words from the words with the highest sum A main keyword acquisition step (step 5) for selecting as a main keyword is further performed.

また、本発明（請求項３）は、代表情報生成手段を更に有する文書検索支援装置において、
クラスタ記憶手段に格納されているクラスタの構成文書のそれぞれについて、主要キーワードの出現位置を特定し、該出現位置を中心とする一定範囲を文書の重要箇所として取得する代表情報生成ステップ（ステップ６）を、更に行う。 The present invention (Claim 3) is a document search support apparatus further comprising representative information generating means.
A representative information generation step (step 6) in which the appearance position of the main keyword is specified for each of the clustered documents stored in the cluster storage means, and a certain range centered on the appearance position is acquired as an important part of the document. Is further performed.

また、本発明（請求項４）は、重心選択ステップにおいて、
重心選択手段が、利用者から入力される検索条件を複数の単語に分割可能な場合は、単語Ｗ_１〜Ｗ_ｎに分割し、文書中に単語Ｗ_１が複数回出現する場合は、単語Ｗ_１の第１の出現位置から最も近い位置にあるＷ_２との距離で決まる重み，単語Ｗ_１の第１の出現位置から最も近い位置にあるＷ_３との距離で決まる重み，…，単語Ｗ_１の出現位置から最も近い位置にあるＷｎとの距離で決まる重み、の合計値を第１のＷ_１のスコアとして記憶手段に記録し、同様に単語Ｗ_１の第２、第３，…の出現位置についてもスコアを計算し、スコアが最小のＷ_１の出現位置を、文書中での単語Ｗ_１の重心と判定し、同様に、他の単語Ｗ_２〜Ｗ_ｎの重心位置を決定することで、該検索条件を構成する単語Ｗ_１〜Ｗ_ｎの重心位置を該文書毎に決定し、
また、検索条件を複数の単語に分割不可能な場合には、該検索条件の文書中で最も文頭に近い出現位置を重心と判定する。 Further, the present invention (Claim 4), in the gravity center selection step,
When the centroid selection unit can divide the search condition input by the user into a plurality of words, the centroid selection unit divides the search condition into words W _{1 to} W _n, and when the word W ₁ appears multiple times in the document, the word W _A weight determined by the distance to W ₂ closest to the first appearance position of 1, a weight determined by the distance from W ₃ closest to the first appearance position of word W ₁ ,. weights determined by the distance between the Wn located closest to the _first appearance position, the sum recorded in the storage means as the first score of the W _1, Similarly, the second word W _1, third, ... of The score is also calculated for the appearance position, the appearance position of W ₁ having the smallest score is determined as the centroid of the word W _{1 in} the document, and similarly, the centroid positions of the other words W _{2 to} W _n are determined. Thus, the centroid positions of the words W _{1 to} W _n constituting the search condition are determined for each document. ,
If the search condition cannot be divided into a plurality of words, the appearance position closest to the beginning of the sentence in the document of the search condition is determined as the center of gravity.

図２は、本発明の原理構成図である。 FIG. 2 is a principle configuration diagram of the present invention.

本発明（請求項５）は、利用者が要求する検索条件に合致する文書を検索するための文書検索支援装置であって、
利用者から入力された検索条件に合致する文書を、検索対象とする文書を格納した文書記憶手段３０１から取得する文書検索手段３０６と、
文書検索手段３０６で取得した文書それぞれについて、入力された各検索条件に含まれるある単語Ａと該単語Ａ以外の同一検索条件に含まれる単語Ｂが文書中に出現する位置の距離に基づいて、該文書中に出現する単語Ａにスコアを付与し、該スコアが最も小さい単語の文書中の位置を重心位置として特定する処理を該検索条件に含まれる全単語について行う重心選択手段３０７と、
文書記憶手段３０１から取得した各文書中から、重心位置を中心とする一定範囲内に出現する単語と該単語の出現回数を集計して、文書毎のベクトルを取得し、ベクトル記憶手段３０９に格納するベクトル生成手段３０８と、
ベクトル記憶手段に格納されている各文書のベクトル間の距離を計算し、距離が近い文書同士を統合してクラスタを形成させ、クラスタ記憶手段３１１に格納するクラスタ生成手段３１０と、を有する。

The present invention (Claim 5) is a document search support apparatus for searching for a document that matches a search condition requested by a user,
A document search unit 306 for acquiring a document that matches the search condition input by the user from the document storage unit 301 storing the document to be searched;
For each document acquired by the document search means 306, based on the distance between the position where a word A included in each input search condition and the word B included in the same search condition other than the word A appear in the document , A centroid selection unit 307 that assigns a score to the word A appearing in the document and performs a process of specifying the position in the document of the word having the smallest score as the centroid position for all words included in the search condition ;
From each document acquired from the document storage unit 301, the words appearing within a certain range centered on the center of gravity position and the number of appearances of the words are totaled, and a vector for each document is acquired and stored in the vector storage unit 309. Vector generating means 308 to perform,
A cluster generation unit 310 that calculates a distance between vectors of each document stored in the vector storage unit, integrates documents having close distances to form a cluster, and stores the cluster in the cluster storage unit 311;

また、本発明（請求項６）は、クラスタ記憶手段３１１に格納されているクラスタの構成文書に基づいてベクトル記憶手段３０９のベクトルを参照して、各単語毎に重みの総和を求め、該総和の大きい単語から上位Ｎ語を主要キーワードとして選択する主要キーワード取得手段３１２を、更に有する。
前記クラスタ記憶手段に格納されているクラスタの構成文書に基づいて前記ベクトル記憶手段のベクトルを参照し選択する主要キーワード取得手段を、
更に有する請求項５記載の文書検索支援装置。 The present invention (Claim 6) refers to the vector in the vector storage unit 309 based on the cluster configuration document stored in the cluster storage unit 311 to obtain the sum of weights for each word, and the sum There is further provided a main keyword acquisition means 312 for selecting the top N words from the large words as the main keywords.
Main keyword acquisition means for referring to and selecting a vector of the vector storage means based on a cluster configuration document stored in the cluster storage means;
The document search support apparatus according to claim 5, further comprising:

また、本発明（請求項７）は、クラスタ記憶手段に格納されているクラスタの構成文書のそれぞれについて、主要キーワードの出現位置を特定し、該出現位置を中心とする一定範囲を文書の重要箇所として取得する代表情報生成手段を、更に有する。 According to the present invention (Claim 7), the appearance position of the main keyword is specified for each of the cluster-structured documents stored in the cluster storage means, and a certain range centered on the appearance position is designated as an important part of the document. As a representative information generation means.

また、本発明（請求項８）は、重心選択手段３０７において、
利用者から入力される検索条件を複数の単語に分割可能な場合は、単語Ｗ_１〜Ｗ_ｎに分割し、文書中に単語Ｗ_１が複数回出現する場合は、単語Ｗ_１の第１の出現位置から最も近い位置にあるＷ_２との距離で決まる重み，単語Ｗ_１の第１の出現位置から最も近い位置にあるＷ_３との距離で決まる重み，…，単語Ｗ_１の出現位置から最も近い位置にあるＷ_ｎとの距離で決まる重み，の合計値を第１のＷ_１のスコアとして記憶手段に記録し、同様に単語Ｗ_１の第２、第３，…の出現位置についてもスコアを計算し、スコアが最小のＷ_１の出現位置を、文書中での単語Ｗ_１の重心と判定し、同様に、他の単語Ｗ_２〜Ｗ_ｎの重心位置を決定することで、該検索条件を構成する単語Ｗ_１〜Ｗｎの重心位置を該文書毎に決定し、
また、前記検索条件を複数の単語に分割不可能な場合には、該検索条件の文書中で最も文頭に近い出現位置を重心と判定する手段を含む。 Further, the present invention (Claim 8) is provided in the center-of-gravity selection means 307.
When the search condition input by the user can be divided into a plurality of words, the search condition is divided into words W _{1 to} W _n, and when the word W ₁ appears multiple times in the document, the first word W ₁ weights determined by the distance between the W ₂ located closest to the appearance position, the weights determined by the distance between W ₃ located closest to the first occurrence position of the word W _1, ..., from the appearance position of the word W ₁ weights determined by the distance between a certain W _n closest, the sum recorded in the storage means as the first score of the W _1, Similarly, the second word W _1, third, also ... occurrence position of The score is calculated, the appearance position of W ₁ having the smallest score is determined as the centroid of the word W _{1 in} the document, and similarly, the centroid positions of the other words W _{2 to} W _n are determined. Determining the centroid positions of the words W _{1 to} Wn constituting the search condition for each document;
In addition, when the search condition cannot be divided into a plurality of words, the search condition includes means for determining the appearance position closest to the beginning of the sentence as the center of gravity in the document of the search condition.

本発明（請求項９）は、請求項５乃至８のいずれか１項に記載の文書検索支援装置を構成する各手段としてコンピュータを機能させるための文書検索支援プログラムである。 The present invention (Claim 9) is a document search support program for causing a computer to function as each means constituting the document search support apparatus according to any one of Claims 5 to 8.

本発明（請求項１０）は、請求項９に記載の文書検索支援プログラムを格納したことを特徴とするコンピュータ読み取り可能な記録媒体である。

The present invention (claim 10) is a recording medium computer-readable, characterized in that storing the document search support program according to claim 9.

本発明によれば、検索条件を構成する単語間の距離を用い、出現位置を中心とする一定範囲を文書の重要箇所として取得することで、以後に行う文書の整理処理の正確性を高めるものである。 According to the present invention, the distance between words constituting a search condition is used, and a certain range centered on the appearance position is acquired as an important part of the document, thereby improving the accuracy of subsequent document organization processing. It is.

また、ベクトル間の距離が小さい、すなわち同じ形態素を多く含む文書同士が同じクラスタに統合される結果、類似文書からなるクラスタが複数得ることができる。 In addition, as a result of integration of documents having a small distance between vectors, that is, containing many of the same morphemes into the same cluster, a plurality of clusters of similar documents can be obtained.

また、クラスタから各クラスタを代表する単語である主要キーワードを決定することにより、各クラスタを形成する過程で強く影響を及ぼした単語を選択することができる。 Further, by determining a main keyword that is a word representing each cluster from the clusters, it is possible to select a word that has a strong influence in the process of forming each cluster.

上記のように、検索結果を表示する際に、各文書中において検索条件の周辺に出現する単語を用いて類似文書のクラスタリングを行い、各クラスタを代表する情報も、検索条件の周辺情報を利用して取得するため、事前に特別な辞書等の知識を与えることなく、利用者の検索意図に即した検索結果の整理と提示を行うことができる。 As described above, when displaying search results, similar documents are clustered using words that appear in the vicinity of the search condition in each document, and information representing each cluster also uses the peripheral information of the search condition. Therefore, the search results can be organized and presented in accordance with the user's search intention without giving knowledge of a special dictionary or the like in advance.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図３は、本発明の一実施の形態における文書検索支援装置の構成図である。 FIG. 3 is a configuration diagram of the document search support apparatus according to the embodiment of the present invention.

同図に示す文書検索支援装置は、文書記録部３０１、文書解析部３０２、単語記録部３０３、検索要求文生成部３０４、検索要求文記録部３０５、文書検索部３０６、重心選択部３０７、ベクトル生成部３０８、ベクトル記録部３０９、クラスタ生成部３１０、クラスタ記録部３１１、主要キーワード取得部３１２、主要キーワード記録部３１３、代表情報生成部３１４、検索結果表示部３１５から構成される。 The document search support apparatus shown in FIG. 1 includes a document recording unit 301, a document analysis unit 302, a word recording unit 303, a search request statement generation unit 304, a search request statement recording unit 305, a document search unit 306, a centroid selection unit 307, a vector. A generation unit 308, a vector recording unit 309, a cluster generation unit 310, a cluster recording unit 311, a main keyword acquisition unit 312, a main keyword recording unit 313, a representative information generation unit 314, and a search result display unit 315 are configured.

文書記録部３０１には、本発明において検索対象とする文書を格納する。図４に文書記録部３０１に格納された文書データの例を示す。文書記録部３０１には、各文書に一意に付与しておく文書ＩＤと、各文書の本文とを記録する。 The document recording unit 301 stores a document to be searched in the present invention. FIG. 4 shows an example of document data stored in the document recording unit 301. The document recording unit 301 records a document ID uniquely assigned to each document and the text of each document.

文書解析部３０２は、文書記録部３０１に記録された文書から、条件に合致する文書を検索するための準備である解析処理を事前に行う。文書記録部３０１から文書ＩＤと文書の本文との対を取得し、当該本文を解析して本文中の単語を取得した後、本文中での各単語の出現位置情報、並びに、文書ＩＤと共に単語記録部３０３に記録する。解析処理には、例えば自然言語処理の分野で多く利用される形態素解析手法を用い、文章を構成する最小の構成単位である「形態素」に分割する。形態素解析は、分割した各形態素に品詞情報が付与されるが、本実施の形態の文書解析部３０２は、全ての形態素のうち、「名詞」という品詞情報を持つ形態素のみを取得するものとする。本実施の形態では、以後、本文中から取得した名詞のみを単語と呼び、その後の処理で扱うこととするが、必ずしも名詞に限定されるものではない。 The document analysis unit 302 performs in advance an analysis process that is a preparation for searching for a document that matches a condition from the documents recorded in the document recording unit 301. After obtaining a pair of the document ID and the text of the document from the document recording unit 301 and analyzing the text to acquire a word in the text, the appearance position information of each word in the text and the word along with the document ID Records in the recording unit 303. For the analysis processing, for example, a morpheme analysis technique that is frequently used in the field of natural language processing is used, and the morpheme is divided into “morphemes” that are the minimum constituent units constituting the sentence. In the morpheme analysis, the part of speech information is given to each divided morpheme, but the document analysis unit 302 of the present embodiment acquires only the morpheme having the part of speech information of “noun” among all the morphemes. . In the present embodiment, hereinafter, only nouns acquired from the text will be referred to as words and will be handled in subsequent processing, but are not necessarily limited to nouns.

図５は、本発明の一実施の形態における文書解析部が行う処理のフローチャートである。 FIG. 5 is a flowchart of processing performed by the document analysis unit according to the embodiment of the present invention.

ステップ１０１）文書解析部３０２は、文書記録部３０１から１文書分の文書ＩＤと本文を取得する。 Step 101) The document analysis unit 302 acquires the document ID and text for one document from the document recording unit 301.

ステップ１０２）取得した本文について形態素解析処理を行う。 Step 102) A morphological analysis process is performed on the acquired text.

ステップ１０３）形態素解析を行うことにより、単語（ここでは名詞）を取得する。 Step 103) A word (here, a noun) is acquired by performing morphological analysis.

ステップ１０４）１文書から取得された単語群を、文書中での出現位置情報、並びに文書ＩＤと共に単語記録部３０３に記録する。 Step 104) The word group acquired from one document is recorded in the word recording unit 303 together with the appearance position information in the document and the document ID.

ステップ１０５）文書記録部３０１中の全文書の解析処理が終わるまで処理を繰り返す。 Step 105) The processing is repeated until the analysis processing of all the documents in the document recording unit 301 is completed.

本処理を行った結果、単語記録部３０３には、図６に示すような単語データが格納される。図６中の“ＳＴＡＲＴ”と“ＥＮＤ”は、本文中での開始位置と終了位置をバイト単位で表すものである。これらの出現位置情報は、文書解析部３０２が形態素解析処理を行った後、文頭から形態素を順に追跡しながら付与するものである。例えば、文書ＩＤ“０００１”の文書では、文頭が「政府は１日、」であるため、「政府」という単語は０バイト目に出現し、日本語の１文字が２バイトの情報量を持つことを考慮すれば、０バイト目から３バイト目までが「政府」の出現位置であることがわかる。文書ＩＤ“０００１”中に２番目に出現する単語「消費税」は、１２バイト目から１７バイト目に出現することが分かるため、同様に“ＳＴＡＲＴ”の値として「１２」が、“ＥＮＤ”の値として「１７」が、それぞれ単語記録部３０３に格納される。以後の単語、及びＩＤ“０００２”以後の文書についても全く同様である。 As a result of this processing, word data as shown in FIG. “START” and “END” in FIG. 6 represent the start position and end position in the text in bytes. The appearance position information is given while the document analysis unit 302 performs morpheme analysis processing and then sequentially tracks the morpheme from the beginning of the sentence. For example, in the document with the document ID “0001”, since the sentence head is “Government is 1 day”, the word “Government” appears in the 0th byte, and one Japanese character has an information amount of 2 bytes. Considering this, it can be seen that the 0th to 3rd bytes are “government” appearance positions. Since it is understood that the word “consumption tax” that appears second in the document ID “0001” appears from the 12th byte to the 17th byte, “12” is similarly set to “END” as the value of “START”. “17” is stored in the word recording unit 303 as the value of. The same applies to the subsequent words and documents after ID “0002”.

なお、本発明の検索支援装置に必須の構成要素ではなく、図６に示すようなデータが予め作成され、外部記憶装置等に存在すれば、文書解析部３０１を当該検索支援装置の構成に含めなくともよい。 If the data as shown in FIG. 6 is created in advance and exists in an external storage device or the like instead of an essential component of the search support device of the present invention, the document analysis unit 301 is included in the configuration of the search support device. Not necessary.

検索要求文生成部３０４は、利用者が入力する検索条件から、本発明の文書検索部３０６に送る検索要求文を生成する。例えば、利用者が「内閣総理大臣」「外交問題」という２語の検索条件を入力した場合に、「ｑ＝（内閣ｏｒ総理ｏｒ大臣）ａｎｄ（外交ｏｒ問題）」という検索要求文を生成し、検索要求文記録部３０５に記録する。 The search request statement generation unit 304 generates a search request statement to be sent to the document search unit 306 of the present invention from the search conditions input by the user. For example, when a user inputs two search conditions such as “Prime Minister” and “Diplomacy Issue”, a search request sentence “q = (Cabinet or Prime Minister or Minister) and (Diplomacy or issue)” is generated. And recorded in the search request sentence recording unit 305.

以下に、検索要求文生成部３０４の処理を詳細に説明する。 Hereinafter, the processing of the search request sentence generation unit 304 will be described in detail.

図７は、本発明の一実施の形態における検索要求文生成処理のフローチャートである。 FIG. 7 is a flowchart of search request statement generation processing in an embodiment of the present invention.

ステップ２０１）まず、検索要求文生成部３０４は、第１の条件である「内閣総理大臣」を取得する。 Step 201) First, the search request sentence generation unit 304 acquires the first condition “Prime Minister”.

ステップ２０２）これを文書解析部３０２に送出して形態素解析処理及び単語（ここでは名詞）の取得を行う。 Step 202) This is sent to the document analysis unit 302 to perform morphological analysis processing and word (herein, noun) acquisition.

ステップ２０３）取得した「内閣」「総理」「大臣」の３単語をバッファ（図示せず）に一時的に保持する。 Step 203) The acquired three words “Cabinet”, “Prime Minister” and “Minister” are temporarily stored in a buffer (not shown).

ステップ２０４）全ての検索条件を解析したかを判定し、まだ解析していない場合には、ステップ２０１に戻って第２の条件である「外交問題」の処理を行い、同様に「外交」「問題」の２単語をステップ２０３と同様にバッファ（図示せず）に保持する。全ての検索条件を解析した場合には、ステップ２０５の処理を行う。 Step 204) It is determined whether or not all the search conditions have been analyzed. If the analysis has not been performed yet, the process returns to Step 201 to perform the processing of the “diplomatic problem” that is the second condition. Two words “problem” are held in a buffer (not shown) in the same manner as in step 203. If all the search conditions have been analyzed, the process of step 205 is performed.

ステップ２０５）上記の検索要求文を生成する。なお、上記の検索要求文は、「内閣または総理または大臣のうちのいずれか１個以上の単語を含み、かつ、外交または問題のうちのいずれか１個以上の単語を含む文書」の検索を要求することを表す文である。利用者が入力する検索条件が１語のみ（例えば「内閣総理大臣」のみ）の場合は、検索要求文は「ｑ＝（内閣ｏｒ総理ｏｒ大臣）」となり、形態素解析処理を行っても複数の単語には分割されない検索条件（例えば「内閣」と「外交」）の場合は分割を行わず、検索要求文は「ｑ＝（内閣）ａｎｄ（外交）」となる。 Step 205) The above search request sentence is generated. Note that the above search request sentence is a search for “a document that includes any one or more words of the cabinet, prime minister, or minister and includes any one or more words of diplomacy or problems”. It is a sentence that indicates what is required. If the search condition entered by the user is only one word (for example, “Prime Minister only”), the search request will be “q = (Cabinet or Prime Minister or Minister)”, and multiple morphological analysis processing will be performed. In the case of search conditions that are not divided into words (for example, “Cabinet” and “diplomatic”), no division is performed, and the search request sentence is “q = (Cabinet) and (diplomatic)”.

以上の手順で検索要求文生成部３０４が生成した検索要求文は、検索要求文記録部３０５に格納される。 The search request text generated by the search request text generation unit 304 in the above procedure is stored in the search request text recording unit 305.

文書検索部３０６は、検索要求文記録部３０５から検索要求文を取得し、単語記録部３０３内に問い合わせて条件に合致する文書を検索し、文書ＩＤの一覧を重心選択部３０７に送出する。 The document search unit 306 acquires a search request sentence from the search request sentence recording unit 305, queries the word recording unit 303 to search for a document that meets the conditions, and sends a list of document IDs to the centroid selection unit 307.

重心選択部３０７は、検索された各文書について、検索要求文中の単語が本文中で互いに近い位置に密集して出現する箇所（重心位置）を選択する。ここでの重心位置とは、利用者が入力した検索条件の１語毎に得る位置情報であり、上記の例では、「内閣総理大臣」と「外交問題」のそれぞれの重心位置を得ることになる。また、重心選択部３０７は、例えば「外交の問題」「外交に関わる問題」のように、検索要求文を構成する「外交」や「問題」などの単語が実際の文書中で連続していない場合であっても、それらが「の」や「に関わる」等の短い語を挟んで近い位置に出現する場合には重心位置であるとあると判定する。このため、それぞれ離れた位置に出現している複数の単語の組をもって１ヶ所の重心位置と呼ぶ場合もある。 The center-of-gravity selection unit 307 selects, for each searched document, a location (center of gravity position) where words in the search request sentence appear densely at positions close to each other in the text. The barycentric position here is the position information obtained for each word of the search condition entered by the user. In the above example, the barycentric positions of “Prime Minister” and “Diplomatic Issues” are obtained. Become. In addition, the gravity center selection unit 307 does not include words such as “diplomatic” and “problem” that constitute the search request sentence in the actual document, such as “diplomatic problem” and “diplomatic problem”. Even in such a case, if they appear at positions close to each other with a short word such as “no” or “related”, it is determined that the position is the center of gravity. For this reason, a set of a plurality of words appearing at different positions may be referred to as one center of gravity position.

図８は、本発明の一実施の形態における重心位置の決定方法を説明するための図である。 FIG. 8 is a diagram for explaining a method for determining the position of the center of gravity according to the embodiment of the present invention.

同図は、ある１文書の本文を表す。本文中には「問題」という単語が７０１，７０４，７０５の３回出現し、「外交」という単語が７０２，７０３の２回出現している。また、検索要求文として「ｑ＝（内閣ｏｒ総理ｏｒ大臣）ａｎｄ（外交ｏｒ問題）」が与えられているとする。上述したように、重心位置の決定は利用者が入力した検索条件それぞれについて行うが、ここでは「内閣総理大臣」の重心位置は既に決定できたとして「外交問題」の重心位置決定の手順を、図９を用いて説明する。 The figure shows the text of one document. In the text, the word “problem” appears 701, 704, 705 three times, and the word “diplomatic” appears 702, 703 twice. Further, it is assumed that “q = (Cabinet or Prime Minister or Minister) and (Diplomatic or Problem)” is given as a search request sentence. As described above, the position of the center of gravity is determined for each search condition entered by the user.Here, the center of gravity position of the “Prime Minister” has already been determined. This will be described with reference to FIG.

図９は、本発明の一実施の形態における重心位置決定処理のフローチャートである。 FIG. 9 is a flowchart of the center-of-gravity position determination process according to the embodiment of the present invention.

ステップ３０１）重心選択部３０７は、検索条件を構成する１単語である「外交」を取得する。 Step 301) The center-of-gravity selection unit 307 acquires “diplomatic” that is one word constituting the search condition.

ステップ３０２）同一検索条件である「外交問題」を構成する単語のうち、ステップ３０１で取得した単語以外の１単語、すなわち「問題」を取得する。 Step 302) Among the words constituting the “diplomatic problem” that is the same search condition, one word other than the word acquired in step 301, that is, “problem” is acquired.

ステップ３０３）本文中に出現する「外交」という単語の各々について、「問題」との距離を調べ、最も近い「問題」との距離をバッファ（図示せず）内に加算する。例えば、図８の７０２の「外交」と本文中の位置が最も近い「問題」が７０１の「問題」であり、これらの２単語間の距離が３０バイトであったとすると、７０２の「外交」のスコアとして３０を加算し、バッファ（図示せず）に記憶させる。続いて、７０３の「外交」については、最も近い「問題」が７０４に隣接しているため、７０３の「外交」のスコアとしては距離０を加算し、バッファ（図示せず）に記憶する。 Step 303) For each of the words “diplomatic” appearing in the text, the distance to “problem” is checked, and the distance to the nearest “problem” is added in a buffer (not shown). For example, if “Diplomacy” in 702 of FIG. 8 and “Problem” that is closest in the text are “Problem” in 701 and the distance between these two words is 30 bytes, “Diplomacy” in 702 30 is added as a score and stored in a buffer (not shown). Subsequently, with respect to “Diplomacy” of 703, since the closest “Problem” is adjacent to 704, the distance “0” is added as the score of “Diplomacy” of 703 and stored in a buffer (not shown).

ステップ３０４）「外交」以外の全ての単語との最短距離を加算したところで、ステップ３０５に移行する。 Step 304) After adding the shortest distances to all words other than “diplomatic”, the process proceeds to step 305.

ステップ３０５）２回出現している「外交」のうち、バッファ（図示せず）内に記憶されているスコアの最も小さいものを「外交」の重心位置として選ぶ。つまり、ユーザが入力した検索条件が、本文中で近い位置に密集して出現している箇所が重心位置として選ばれる。上記の図８の例では、７０２と７０３の「外交」のステップ３０４Ｙｅｓの時点でのスコアである“３０”と“０”を比較した結果、最も小さいスコアを持つ７０３の「外交」が選択され、重心位置と判定する。同じスコアが複数存在する場合は、それらを全て重心位置と判定する。 Step 305) Among the “diplomacy” appearing twice, the one having the smallest score stored in the buffer (not shown) is selected as the barycentric position of “diplomacy”. That is, a location where the search conditions input by the user appear close together in the text is selected as the position of the center of gravity. In the example of FIG. 8 above, as a result of comparing “30” and “0”, which are the scores at the time of step 304 Yes of “Diplomacy” of 702 and 703, “Diplomacy” of 703 having the smallest score is selected. The center of gravity is determined. When there are a plurality of the same scores, all of them are determined as the barycentric position.

ステップ３０６）「問題」についても同様にステップ３０１〜３０５の処理を繰り返し、「外交」「問題」それぞれの重心位置を決定する。７０３の「外交」と７０４の「問題」が隣接しているため、７０４の「問題」のスコアが７０１や７０５の「問題」のスコアに比べて小さくなり、「問題」の重心位置は７０４に決定される。 Step 306) For “problem”, the processes of steps 301 to 305 are repeated in the same manner, and the center positions of “diplomatic” and “problem” are determined. Since “Diplomacy” in 703 and “Problem” in 704 are adjacent to each other, the “Problem” score in 704 is smaller than the “Problem” score in 701 and 705, and the center of gravity of “Problem” is 704 It is determined.

ここまで、図８を用いて説明したが、実際には重心選択部３０７は、検索要求文を単語記録部３０３の情報と照合させ、図６のように記録されている“ＳＴＡＲＴ”、“ＥＮＤ”の位置情報を用いて最短距離を探索する。 Up to this point, the description has been made with reference to FIG. 8, but in actuality, the center-of-gravity selection unit 307 collates the search request sentence with the information in the word recording unit 303 and records “START” and “END” recorded as shown in FIG. The shortest distance is searched using the position information “”.

上述の重心選択部３０７における重心位置決定方法は、例えば「外交問題」だけでなく「失業者問題」「問題発言」などのように利用者の検索の意図と異なる意味で「問題」という単語が用いられている文書があった場合に、「外交問題」という文字列を正しく発見することで、以後に行う文書の整理処理の正確性を高めるものである。また、単純に検索条件を本文と照合させて重心位置を決定しているのではなく、検索条件を構成する単語間の距離を用いているため、「外交の問題」のように他の文字を挟んでいる場合でも、正しく重心として検出できる。 The center-of-gravity position determination method in the above-described center-of-gravity selection unit 307 is not limited to the “diplomatic problem” but the word “problem” in a meaning different from the user's search intention, such as “unemployed person problem” and “problem remark”. When there is a document being used, the correct character string “diplomatic problem” is found to improve the accuracy of the subsequent document organization process. Also, instead of simply matching the search condition with the text to determine the position of the center of gravity, it uses the distance between the words that make up the search condition, so other characters like “diplomatic issues” Even when pinched, it can be detected correctly as the center of gravity.

なお、例えば、「内閣」のように、形態素解析処理を行っても複数の単語には分割されない検索条件が与えられた場合、上記のように他の単語との距離が測定できない。この場合には、最も文頭に近い位置で出現した「内閣」を優先し、重心位置とする。 For example, when a search condition that is not divided into a plurality of words even if morphological analysis processing is performed, such as “Cabinet”, the distance from other words cannot be measured as described above. In this case, priority is given to the “Cabinet” that appears at the position closest to the beginning of the sentence, and the center of gravity is set.

重心選択部３０７は、１文書中の重心位置決定を終えると、文書ＩＤと重心位置をベクトル生成部３０８に送出する。例えば、文書ＩＤ“９８７６”の文書において、各文書の重心位置の開始点が、「内閣」が“０”、「総理」が“４”、「大臣」が“８”（つまり文頭から連続して「内閣総理大臣」が出現）であったとし、同様に「外交」が“１００”、「問題」が“１０４”であるとき、ベクトル生成部３０８には、ＩＤ“９８７６”と重心位置０，４，８，１００，１０４が伝達される。 When the centroid selection unit 307 finishes determining the centroid position in one document, the centroid selection unit 307 sends the document ID and the centroid position to the vector generation unit 308. For example, in the document with the document ID “9876”, the starting point of the center of gravity of each document is “0” for “Cabinet”, “4” for “Prime Minister”, and “8” for “Minister”. Similarly, when “Diplomacy” is “100” and “Problem” is “104”, the vector generation unit 308 has ID “9876” and the center of gravity position 0. , 4, 8, 100, 104 are transmitted.

ベクトル生成部３０８は、取得した文書ＩＤと重心位置を単語記録部３０３の内容と照合し、当該文書中で各重心位置を中心とする前後一定範囲（例えば、前後５０バイト）内に出現する単語を取得する。上述の「ＩＤ“９８７６”」の例では、０，４，８，１００，１０４の重心位置に対して前後５０バイトの範囲は、それぞれ−５０〜５０，−４６〜５４，４２〜５８，５０〜１５０，５４〜１５４となる。ＩＤ“９８７６”の文書中からこれらのいずれかの範囲に含まれる単語を全て取得する。これらの範囲から重なりや負値を除くと「０〜５８または５０〜１５４」が取得対象の範囲となる。続いてＩＤ“９８７６”の文書において上記範囲中の単語の出現回数を集計した値を単語の重みとして、文書ＩＤと共にベクトル記録部３０９に記録する。 The vector generation unit 308 collates the acquired document ID and centroid position with the contents of the word recording unit 303, and the word appearing within a certain range before and after the centroid position in the document (for example, 50 bytes before and after). To get. In the above-described example of “ID“ 9876 ””, the ranges of 50 bytes before and after the center of gravity of 0, 4, 8, 100, and 104 are −50 to 50, −46 to 54, 42 to 58, and 50, respectively. ~ 150, 54 ~ 154. All words included in any of these ranges are acquired from the document with ID “9876”. If an overlap or a negative value is removed from these ranges, “0 to 58 or 50 to 154” is the range to be acquired. Subsequently, in the document with ID “9876”, the total number of occurrences of the word in the above range is recorded in the vector recording unit 309 together with the document ID as a word weight.

ベクトルの生成方法は、上記に限らず他の既存技術を使用してもよい。 The vector generation method is not limited to the above, and other existing techniques may be used.

以上を、文書検索部３０６が重心選択部３０７に送出した全ての文書について実施すると図１０のような文書毎のベクトルが得られる。各ベクトルは、単語と単語重み（本実施の形態では出現回数）の羅列になっている。ここまで述べた処理により、利用者が検索条件を入力して検索を実行すると、条件に合致する文書それぞれについて、重心位置の決定とベクトルの生成、保存が行われる。 When the above is performed for all the documents sent from the document search unit 306 to the gravity center selection unit 307, a vector for each document as shown in FIG. 10 is obtained. Each vector is a list of words and word weights (the number of appearances in the present embodiment). Through the processing described so far, when a user inputs a search condition and executes a search, the center of gravity position is determined, and a vector is generated and stored for each document that matches the condition.

本実施の形態では、ベクトル生成部３０８が単語の重みとして各単語の出現回数を用いたが、例えば、自然言語処理分野でよく用いられるＴＦ−ＩＤＦ（ベクトル空間法）によって、各単語の重みを計算してもよい。 In the present embodiment, the vector generation unit 308 uses the number of appearances of each word as the word weight. For example, the weight of each word is set by TF-IDF (vector space method) often used in the natural language processing field. You may calculate.

クラスタ生成部３１０は、ベクトル記憶部３０９から文書のベクトルを全て取得し、類似するベクトル同士のグループに分割するクラスタリング処理を行う。 The cluster generation unit 310 performs clustering processing for acquiring all document vectors from the vector storage unit 309 and dividing the vectors into groups of similar vectors.

クラスタリングを行う処理方法には複数の既存技術が存在する。ここでは最長距離法と呼ばれるクラスタリング手法を用いる例を説明するが、他のクラスタリング手法を用いてもよい。 There are a plurality of existing techniques for processing methods for performing clustering. Although an example using a clustering method called the longest distance method is described here, other clustering methods may be used.

最長距離法によるクラスタリングは、次の手順で行う。 Clustering by the longest distance method is performed according to the following procedure.

最初に、クラスタリング処理を行う対象の文書をそれぞれクラスタと見做し、クラスタ間の距離を算出する。クラスタ間の距離は、クラスタを構成する文書のうち、最も遠い文書間の距離であり、ベクトル記録部３０９から取得する各ベクトル間のユークリッド距離を文書間の距離と見做す。但し、本処理の時点では、全クラスタがそれぞれ１文書で構成されているため、クラスタ間の距離を求めることは文書間の距離を求めることに等しい。 First, each document to be clustered is regarded as a cluster, and the distance between the clusters is calculated. The distance between the clusters is the distance between the farthest documents among the documents constituting the cluster, and the Euclidean distance between the vectors acquired from the vector recording unit 309 is regarded as the distance between the documents. However, since all the clusters are each composed of one document at the time of this processing, obtaining the distance between clusters is equivalent to obtaining the distance between documents.

続いて、最も距離の近い２クラスタを統合して新しいクラスタを形成させ、新しいクラスタと、他の既存のクラスタとの距離を前述の方法で再度算出して更新する。 Subsequently, the two clusters having the closest distance are integrated to form a new cluster, and the distance between the new cluster and another existing cluster is calculated again by the method described above and updated.

以後は、この統合の処理を繰り返してクラスタを成長させるが、統合の対象となる２クラスタ間の距離（最も近い２クラスタ間の距離）が、別途定めておく定数値を上回っていれば統合せずにクラスタリング処理を終える。 Thereafter, this integration process is repeated to grow the clusters. If the distance between the two clusters to be integrated (the distance between the two nearest clusters) exceeds a predetermined constant value, the integration is performed. And finish the clustering process.

以上の処理により、ベクトル間の距離が小さい、すなわち同じ形態素を多く含む文書同士が同じクラスタに統合される結果、類似文書からなるクラスタが複数得られる。 As a result of the above processing, as a result of integration of documents having a small distance between vectors, that is, documents containing many of the same morpheme into the same cluster, a plurality of clusters of similar documents are obtained.

クラスタ生成部３１０は、クラスタリング処理を行った後、図１１のようにクラスタ記録部３１１に処理結果を記録する。生成されたクラスタには、Ｃ００１、Ｃ００２、…と一意なクラスタＩＤを付与し、それぞれのクラスタに属する文書の文書ＩＤが管理されている。クラスタの生成の過程において他のどのクラスタとも統合されなかったクラスタは、除いて記録しておく。 After performing the clustering process, the cluster generation unit 310 records the processing result in the cluster recording unit 311 as shown in FIG. A unique cluster ID such as C001, C002,... Is assigned to the generated cluster, and document IDs of documents belonging to the respective clusters are managed. Clusters that have not been integrated with any other clusters in the process of cluster generation are excluded and recorded.

主要キーワード取得部３１２は、クラスタ記録部３１１から各クラスタ内の文書の文書ＩＤ一覧を１クラスタ分ずつ取得し、それぞれのクラスタを代表する主要キーワードを該クラスタ内の文書中から抽出し、主要キーワード記録部３１３に記録する。 The main keyword acquisition unit 312 acquires a document ID list of documents in each cluster from the cluster recording unit 311 one cluster at a time, extracts a main keyword representing each cluster from the documents in the cluster, Records in the recording unit 313.

主要キーワードは、各クラスタを代表する単語であり、該クラスタ内の文書の内容を表す単語を選択する必要がある。本実施の形態では、クラスタ生成部３１０が行うクラスタリング処理において、各クラスタを形成する過程で強く影響を及ぼした単語を選択するために、以下の手法を用いる。 The main keyword is a word representing each cluster, and it is necessary to select a word representing the content of the document in the cluster. In the present embodiment, in the clustering process performed by the cluster generation unit 310, the following method is used to select words that have a strong influence in the process of forming each cluster.

主要キーワード取得部３１２における、主要キーワードの取得は、クラスタ毎に処理を行うが、図１２を用いて１クラスタ分の主要キーワード取得処理を説明する。 The acquisition of the main keyword in the main keyword acquisition unit 312 is performed for each cluster. The main keyword acquisition process for one cluster will be described with reference to FIG.

図１２は、本発明の一実施の形態における主要キーワード取得処理のフローチャートである。 FIG. 12 is a flowchart of main keyword acquisition processing according to an embodiment of the present invention.

ステップ４０１）主要キーワード取得部３１２は、クラスタ記録部３１１から１クラスタ分の文書ＩＤを取得する。本ステップでは、例えば、クラスタＣ１２３に属する文書の文書ＩＤである“９８７５”、“９８７７”、“９８９８”、“９９００”の４個のＩＤを取得する。 Step 401) The main keyword acquisition unit 312 acquires a document ID for one cluster from the cluster recording unit 311. In this step, for example, four IDs “9875”, “9877”, “9898”, and “9900”, which are document IDs of documents belonging to the cluster C123, are acquired.

ステップ４０２）最初の１文書“９８７５”について、ベクトル記録部３０９より対応するベクトルを取得する。 Step 402) For the first document “9875”, the vector recording unit 309 acquires a corresponding vector.

ステップ４０３）各単語の重みを加算集計する。図１０に示す例で、文書ＩＤ“９８７５”のベクトルは「国連＝３、イラク＝４、本日＝１、決議＝２、表明＝１、…」であるため、これらそれぞれの単語の重みを主要キーワード取得部３１２内のバッファ（図示せず）で同一単語毎に加算して保持する。但し、最初の１文書目の処理時点では、バッファ（図示せず）内は空であるため、各単語の重みをそのまま保持する。 Step 403) The weights of each word are added and totaled. In the example shown in FIG. 10, the vector of the document ID “9875” is “UN = 3, Iraq = 4, today = 1, resolution = 2, assertion = 1,...”. A buffer (not shown) in the keyword acquisition unit 312 adds and holds the same words. However, since the buffer (not shown) is empty at the time of processing the first first document, the weight of each word is held as it is.

ステップ４０４）上記のステップ４０２、ステップ４０３の処理を、ステップ４０１で取得した１クラスタ分の文書ＩＤ全てについて終えるまで繰り返し、１クラスタ分の文書のベクトルを構成する単語それぞれについて、重みの総和を得る。 Step 404) The above-described processing of Step 402 and Step 403 is repeated until all the document IDs for one cluster acquired in Step 401 are completed, and the sum of the weights is obtained for each word constituting the document vector for one cluster. .

ステップ４０５）重みの総和が大きい単語から上位Ｎ語を選択し、クラスタＩＤであるＣ１２３と共に主要キーワード記録部３１３に記録する。 Step 405) The top N words are selected from the words having the largest sum of weights, and are recorded in the main keyword recording unit 313 together with the cluster ID C123.

図１３に、図１２のステップ４０１〜ステップ４０５のステップをクラスタ毎に繰り返した後の時点での、主要キーワード記録部３１３に記録されたデータの例を示す。クラスタＣ１２３からは「イラク」「自衛隊」「派遣」といった主要キーワードが選択されている。例では、Ｎ＝３として各クラスタから３語ずつ主要キーワードを取得している。クラスタリング処理では、重みが大きい単語はクラスタ形成に強く影響を及ぼすため、各単語の重みをクラスタ内の文書全てに渡って加算することにより、単語自体の重みが大きく、出現する文書数の多い単語が選択される。 FIG. 13 shows an example of data recorded in the main keyword recording unit 313 at the time after steps 401 to 405 in FIG. 12 are repeated for each cluster. Main keywords such as “Iraq”, “Self Defense Force” and “Dispatch” are selected from the cluster C123. In the example, N = 3, and three main keywords are acquired from each cluster. In the clustering process, words with large weights have a strong effect on cluster formation, so adding the weight of each word across all the documents in the cluster increases the weight of the word itself and causes many words to appear. Is selected.

代表情報生成部３１４は、主要キーワード記録部３１３から各クラスタの主要キーワードを取得し、クラスタの代表情報をクラスタ毎に生成して、検索結果表示部３１５に送出する。 The representative information generation unit 314 acquires the main keywords of each cluster from the main keyword recording unit 313, generates cluster representative information for each cluster, and sends it to the search result display unit 315.

代表情報は、クラスタ内の文書群の概要把握を目的として利用者に提示する情報であるため、主要キーワード自体を代表情報としてもよいが、本実施の形態では、クラスタ内の各文書の一部（以下、重要箇所と呼ぶ）を抽出して提示する方法を説明する。 Since the representative information is information presented to the user for the purpose of grasping the outline of the group of documents in the cluster, the main keyword itself may be used as the representative information. In this embodiment, a part of each document in the cluster is used. A method of extracting and presenting (hereinafter referred to as an important part) will be described.

重要箇所は、重心選択部３０７が各文書の重心位置を決定し、ベクトル生成部３０８が重心を中心とする一定範囲中の単語を取得する処理と同様の方法で抽出する。つまり、主要キーワード記録部３１３から１クラスタ分の３語の主要キーワードを取得し、さらに、該クラスタ内の文書ＩＤをクラスタ記録部３１１から取得し、各文書中でそれぞれ主要キーワードが出現する位置を単語記録部３０３の内容と照合し、各主要キーワードの重心位置を前述の重心位置決定の手順で決定する。但し、全ての文書に全ての主要キーワードが出現するとは限らないため、文書中に主要キーワードが含まれない場合は、当該主要キーワードの重心位置の決定は行わない。続いて、文書記録部３０１より、各主要キーワードに対応する重心位置を中心とする、前後一定範囲の文字列を取得し、文書の重要箇所とする。 The important part is extracted by a method similar to the process in which the center-of-gravity selection unit 307 determines the position of the center of gravity of each document, and the vector generation unit 308 acquires words in a certain range centered on the center of gravity. That is, three main keywords for one cluster are acquired from the main keyword recording unit 313, and document IDs in the cluster are acquired from the cluster recording unit 311, and the position where the main keyword appears in each document is determined. The content of the word recording unit 303 is collated, and the centroid position of each main keyword is determined by the above-described centroid position determination procedure. However, since not all main keywords appear in all documents, when the main keyword is not included in the document, the position of the center of gravity of the main keyword is not determined. Subsequently, a character string in a certain range before and after the center of gravity corresponding to each main keyword is acquired from the document recording unit 301 and used as an important part of the document.

以上の処理をクラスタ内の各文書について実施し、更に他のクラスタについても実施すれば各クラスタの代表情報が得られる。取得した代表情報は、検索結果表示部３１５に送出され、当該検索結果表示部３１５の画面上に表示させる。 If the above processing is performed for each document in the cluster and further performed for other clusters, representative information of each cluster can be obtained. The acquired representative information is sent to the search result display unit 315 and displayed on the screen of the search result display unit 315.

図１４に示す検索結果表示部３１５の画面の表示例のように、クラスタ内の各文書の重要箇所の抜粋が代表情報として、クラスタ毎に表示される。下線は、主要キーワードを表す。利用者がいずれかの重要箇所を選択すると、対応する文書の本文を表示する。 Like the display example of the screen of the search result display unit 315 shown in FIG. 14, excerpts of important parts of each document in the cluster are displayed for each cluster as representative information. The underline represents the main keyword. When the user selects any important part, the text of the corresponding document is displayed.

本実施の形態では、文書解析部３０２が文書記録部３０１に蓄積されている文書データを解析し、利用者が検索を行える構成としたが、これは発明に必須のものではなく、既存の他の検索方法で得られた文書群を本発明の入力としてもよい。 In the present embodiment, the document analysis unit 302 analyzes the document data stored in the document recording unit 301 and allows the user to perform a search. The document group obtained by the search method may be used as the input of the present invention.

また、クラスタ生成部３１０が生成した各クラスタから、以後の処理で代表情報を生成したが、代表情報の生成は、本発明に必須の処理ではなく、生成したクラスタを、他の目的で利用してもよい。 Further, the representative information is generated from each cluster generated by the cluster generation unit 310 in the subsequent processing. However, the generation of the representative information is not essential to the present invention, and the generated cluster is used for other purposes. May be.

また、代表情報生成部３１４が生成した代表情報は、検索結果表示部３１５上に表示する例を示したが、代表情報を画面上に表示する代わりに、記憶装置に記憶するなどし、他の目的で利用してもよい。 In addition, the representative information generated by the representative information generation unit 314 is displayed on the search result display unit 315. However, instead of displaying the representative information on the screen, the representative information is stored in a storage device. It may be used for purposes.

なお、図３に示す文書検索支援装置の機能をプログラムとして構築し、文書検索支援装置として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能である。 It is possible to construct the function of the document search support apparatus shown in FIG. 3 as a program, install it on a computer used as the document search support apparatus, execute it, or distribute it via a network.

また、構築されたプログラムをハードディスクや、フレキシブルディスク、ＣＤ−ＲＯＭ等の可搬記憶媒体に格納しておき、本発明を実施する際に、コンピュータにインストールする、または、配布することが可能である。 Further, the constructed program can be stored in a portable storage medium such as a hard disk, a flexible disk, or a CD-ROM, and installed in a computer or distributed when the present invention is implemented. .

なお、本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made within the scope of the claims.

本発明は、文書検索を支援するための、文書群を類似文書毎にクラスタリングする装置、そのクラスタの主要語を決定する装置、及び、文書中の重要箇所を決定する装置等に適用可能である。 The present invention is applicable to an apparatus for clustering a document group for each similar document, an apparatus for determining a main word of the cluster, an apparatus for determining an important portion in a document, and the like for supporting document search. .

本発明の原理を説明するための図である。It is a figure for demonstrating the principle of this invention. 本発明の原理構成図である。It is a principle block diagram of this invention. 本発明の一実施の形態における文書検索支援装置の構成図である。It is a block diagram of the document search assistance apparatus in one embodiment of this invention. 本発明の一実施の形態における文書記録部における文書データの例である。It is an example of the document data in the document recording part in one embodiment of this invention. 本発明の一実施の形態における文書解析部が行う処理のフローチャートである。It is a flowchart of the process which the document analysis part in one embodiment of this invention performs. 本発明の一実施の形態における単語記録部における単語データの例である。It is an example of the word data in the word recording part in one embodiment of this invention. 本発明の一実施の形態における検索要求文生成処理のフローチャートである。It is a flowchart of the search request sentence production | generation process in one embodiment of this invention. 本発明の一実施の形態における重心位置の決定方法を説明するための図である。It is a figure for demonstrating the determination method of the gravity center position in one embodiment of this invention. 本発明の一実施の形態における重心位置の決定処理のフローチャートである。It is a flowchart of the determination process of the gravity center position in one embodiment of this invention. 本発明の一実施の形態におけるベクトル記録部に記録されたベクトルの例である。It is an example of the vector recorded on the vector recording part in one embodiment of this invention. 本発明の一実施の形態におけるクラスタ記録部に記録されたデータの例である。It is an example of the data recorded on the cluster recording part in one embodiment of this invention. 本発明の一実施の形態における主要キーワード取得処理のフローチャートである。It is a flowchart of the main keyword acquisition process in one embodiment of this invention. 本発明の一実施の形態における主要キーワード記録部に記録されたデータの例である。It is an example of the data recorded on the main keyword recording part in one embodiment of this invention. 本発明の一実施の形態における検索結果表示部の画面の表示例である。It is an example of a display of the screen of the search result display part in one embodiment of the present invention.

Explanation of symbols

３０１文書記憶手段、文書記録部
３０２文書解析部
３０３単語記録部
３０４検索要求文生成部
３０５検索要求文記録部
３０６文検索手段、文検索部
３０７重心選択手段、重心選択部
３０８ベクトル生成手段、ベクトル生成部
３０９ベクトル記憶手段、ベクトル記憶部
３１０クラスタ生成手段、クラスタ生成部
３１１クラスタ記憶手段、クラスタ記録部
３１２主要キーワード取得手段、主要キーワード取得部
３１３主要キーワード記録部
３１４代表情報生成手段、代表情報生成部
３１５検索結果表示部 301 Document storage unit, document recording unit 302 Document analysis unit 303 Word recording unit 304 Search request sentence generation unit 305 Search request sentence recording unit 306 Sentence search unit, sentence search unit 307 Center of gravity selection unit, center of gravity selection unit 308 Vector generation unit, vector Generation unit 309 Vector storage unit, vector storage unit 310 Cluster generation unit, cluster generation unit 311 Cluster storage unit, cluster recording unit 312 Main keyword acquisition unit, main keyword acquisition unit 313 Main keyword recording unit 314 Representative information generation unit, representative information generation Part 315 Search result display part

Claims

Retrieval conditions requested by the user in the document retrieval support apparatus having document storage means, document retrieval means, centroid selection means, vector generation means, vector storage means, cluster generation means, and cluster storage means for storing documents to be searched A document search support method for searching for documents that match
A document search step in which the document search means acquires from the document storage means a document that matches the search condition input by the user;
For each document acquired by the centroid selection means in the document search step, a position where a word A included in each input search condition and a word B included in the same search condition other than the word A appear in the document Based on the distance, a score is assigned to the word A appearing in the document, and the center of gravity for performing processing for specifying the position in the document of the word with the smallest score as the center of gravity position for all words included in the search condition A selection step;
The vector generation unit obtains a vector for each document by totaling words appearing within a certain range centered on the center of gravity position and the number of appearances of the word from each document acquired from the document storage unit. A vector generation step of storing in the vector storage means;
A cluster generation step in which the cluster generation unit calculates a distance between vectors of each document stored in the vector storage unit, forms a cluster by integrating documents having a short distance, and stores the cluster in the cluster storage unit When,
A document search support method characterized in that:

In a document search support device further having a main keyword acquisition means,
Based on the cluster configuration document stored in the cluster storage unit, the main keyword acquisition unit refers to the vector in the vector storage unit to obtain a sum of weights for each word, and the word having a large sum The main keyword acquisition step of selecting the top N words as the main keyword from
The document search support method according to claim 1, further performed.

In the document search support apparatus further comprising representative information generation means,
A representative information generating step of identifying an appearance position of the main keyword for each of the clustered documents stored in the cluster storage means, and acquiring a certain range centered on the appearance position as an important part of the document; ,
The document search support method according to claim 2 further performed.

In the center of gravity selection step,
When the centroid selection unit can divide the search condition input from the user into a plurality of words, the centroid selection unit divides the search condition into words W _{1 to} W _n, and when the word W ₁ appears multiple times in the document, weights determined by the distance between W ₃ located closest to the first occurrence position of the weight determined by the distance, the words W ₁ and W ₂ located closest to the first occurrence position of the word W _1, ..., weights determined by the distance between the Wn located closest to the occurrence position of the word W _1, the sum recorded in the storage means as the first score of the W _1, Similarly, the second said word W _1, third ,... Is also calculated, and the appearance position of W ₁ having the smallest score is determined as the centroid of the word W _{1 in} the document. Similarly, the centroids of the other words W _{2 to} W _n . By determining the position, the barycentric positions of the words W _{1 to} W _n constituting the search condition are determined. Decide for each document,
If the search condition cannot be divided into a plurality of words, the appearance position closest to the beginning of the sentence in the document of the search condition is determined as the center of gravity.
The document search support method according to claim 1.

A document search support device for searching for a document that matches a search condition requested by a user,
A document search means for acquiring from the document storage means a document that matches the search condition input by the user;
For each document acquired by the document search means, based on the distance between the position where the word A included in each input search condition and the word B included in the same search condition other than the word A appear in the document , A centroid selection unit that assigns a score to the word A that appears in the document, and performs a process of specifying the position in the document of the word having the smallest score as the centroid position for all words included in the search condition ;
From each document acquired from the document storage means, the words appearing within a certain range centered on the center of gravity position and the number of appearances of the words are totaled to obtain a vector for each document, and the vector storage means Vector generation means for storing;
Calculating a distance between vectors of each document stored in the vector storage means, integrating documents having a short distance to form a cluster, and storing the cluster generation means in the cluster storage means;
A document search support apparatus comprising:

Based on the cluster configuration document stored in the cluster storage means, the vector of the vector storage means is referred to find the sum of the weights for each word, and the top N words from the words with the highest sum are used as the main keywords. The main keyword acquisition means to select
The document search support apparatus according to claim 5, further comprising:

Representative information generating means for specifying the appearance position of the main keyword for each of the cluster-structured documents stored in the cluster storage means and acquiring a certain range centered on the appearance position as an important part of the document; ,
The document search support apparatus according to claim 6, further comprising:

The center of gravity selection means includes
Wherein when dividable search conditions input by the user into a plurality of words, divided into words W ₁ to _W-n, when the word W ₁ appears more than once in the document, the first word W ₁ weights, ..., occurrence position of a word W ₁ that is determined by the distance between W ₃ located closest to the weight, the first occurrence position of a word W ₁ that is determined by the distance between the W ₂ located closest to the occurrence position of the nearest determined by the distance between the W _n in the position weight, the total value is recorded in the storage means as the first score of the W _1, Similarly, the second said word W _1, third, ... occurrence position from also calculates a score for the occurrence position of the score is the smallest W _1, determines that the center of gravity of the word W ₁ in the document in the same manner, to determine the position of the center of gravity of the other word W ₂ to _W-n that Then, the centroid position of the words W _{1 to} Wn constituting the search condition is determined for each document,
8. A document search support apparatus according to claim 5, further comprising means for determining, when the search condition cannot be divided into a plurality of words, the position of the search condition closest to the beginning of the sentence as a centroid. .

A document search support program for causing a computer to function as each means constituting the document search support apparatus according to claim 5.

A computer-readable recording medium storing the document search support program according to claim 9.