JP2003050807A

JP2003050807A - Method for extracting important term/phrase/sentence

Info

Publication number: JP2003050807A
Application number: JP2002158163A
Authority: JP
Inventors: Takahiko Kawatani; 隆彦川谷
Original assignee: Hewlett Packard Co
Current assignee: HP Inc
Priority date: 2001-05-30
Filing date: 2002-05-30
Publication date: 2003-02-21
Anticipated expiration: 2022-05-30
Also published as: JP4349480B2

Abstract

PROBLEM TO BE SOLVED: To solve the problems that words and phrases not so related to the concept of a document are considered as important and the words and phrases which are just frequently used are extracted as important words and phrases since it is not always clear how much the extracted important words and phrases reflect the central concept of the document in a conventional method in the case of automatically extracting the important words and phrases from the document. SOLUTION: In this method for extracting the important words and phrases, an input document is divided into the document segments of an appropriate unit and a sum-of-squares matrix is generated from the document segments. The importance of a term and the phrase under consideration is decided on the basis of the intrinsic vector and intrinsic value of the matrix. Thus, the important term/phase/sentence related to the central concept of the document are selected.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は文書から重要な語・句・
文を自動的に抽出する方法に関するものであり、特に文
書の表現方法の高度化と語や句の重要度に関する新しい
尺度の導入によってその性能の改善を図るものである。BACKGROUND OF THE INVENTION The present invention relates to important words, phrases, ...
The present invention relates to a method of automatically extracting a sentence, and particularly, to improve its performance by improving the expression method of a document and introducing a new measure for the importance of words and phrases.

【０００２】[0002]

【従来の技術】文書から重要な語や句を自動的に抽出す
る方法は文書検索や情報検索の分野で古くから研究開発
が行われてきた。これらはヒューリスティクスな方法と
統計的な方法とに大別できる。ヒューリスティクスな方
法としては、文書の見出し情報、文書における位置情
報、手がかり的な表現を用いる方法などが知られてい
る。文書の見出し情報を用いる方法では、“文書のタイ
トルや見出しは文書の内容を簡潔に表現しており、重要
な用語が含まれている”との考えのもとに、タイトルや
見出しに含まれる用語群から冠詞や前置詞などの明らか
に重要でない用語を除いた用語を重要語とするものであ
る。この方法ではタイトルや見出しの存在が前提となっ
ており、これらが存在しない文書には適用できない。文
書における位置情報を用いる方法は、新聞記事などでは
意図的に重要な文が前の方に書かれているということに
着目し、記事の中から前の方に存在する文から重要語を
抽出するものである。この方法は新聞記事のように文書
の重要な部分の存在する場所が事前に分かる場合にのみ
適用できる方法である。手がかり的な表現を用いる方法
は、“As a result”のように特定のフレーズで始まる
文は重要であるとの前提のもとに、そのようなフレーズ
を自然言語解析によって抽出し、重要語の抽出の範囲を
それらを含む文に限定するというものである。この方法
は前提となる手がかり的な表現がなければ適用できな
い。2. Description of the Related Art A method for automatically extracting important words or phrases from a document has been researched and developed for a long time in the field of document retrieval and information retrieval. These can be roughly divided into a heuristic method and a statistical method. As a heuristic method, there are known a method of using heading information of a document, position information of the document, and a clue expression. In the method that uses the document heading information, it is included in the title and heading based on the idea that "the document title and heading express the contents of the document concisely and contain important terms". Important terms are terms that exclude apparently insignificant terms such as articles and prepositions from the term group. This method assumes the presence of titles and headings and cannot be applied to documents that do not have them. The method of using position information in a document is to extract important words from sentences existing in the front of the article, paying attention to the fact that important sentences are intentionally written in the front of newspaper articles. To do. This method can be applied only when the location of an important part of a document is known in advance, such as a newspaper article. The method using clues is to extract such phrases by natural language analysis, assuming that a sentence starting with a particular phrase such as “As a result” is important, The scope of extraction is limited to the sentences containing them. This method cannot be applied without the underlying clues.

【０００３】統計的な方法で古くから知られているの
は、対象とする文書の中で頻繁に現れる用語を重要語と
する方法である。この方法では文書内の出現頻度（tf）
を重要度の尺度としている。しかし、この方法にはひと
つの文書の中で頻度が高く出現する用語が常に重要とは
限らないという問題があった。この問題を解消するため
の方法として、tf-idfモデルがある。tf-idfモデルで
は、「多数の文書に現れる用語は重要度が低く、現れる
文書の数が少ないほど重要度は高い」との考えのもと
に、対象とする文書が含まれるコーパスの中で各用語に
つきその用語を含む文書数（df）を求め、その逆数（id
f）をコーパス内の用語重要度とし、文書内重要度とし
てのtfとの積tf-idfを求めて用語重要度とする。このモ
デルはよく知られている方法であるが、コーパス内の用
語重要度と文書内重要度との積で定義するため、依然と
して文書内重要度を如何に精度良く求めるかという問題
は存在する。A statistical method that has been known for a long time is a method in which a term that frequently appears in a target document is regarded as an important word. With this method, the frequency of occurrence in documents (tf)
Is used as a measure of importance. However, this method has a problem that terms frequently appearing in one document are not always important. As a method to solve this problem, there is a tf-idf model. In the tf-idf model, the term that appears in many documents is less important, and the less the number of documents that appear is, the more important it is. For each term, find the number of documents containing that term (df), and calculate the reciprocal (id
Let f) be the term importance in the corpus, and find the product tf-idf with tf as the in-document importance to obtain the term importance. This model is a well-known method, but since it is defined by the product of the term importance in the corpus and the in-document importance, there is still the problem of how to accurately calculate the in-document importance.

【０００４】[0004]

【発明が解消しようとする課題】上記のようにひとつの
文書が与えられたとき、各用語の文書内重要度を如何に
求めるかが重要な課題となる。この文書内重要度の算出
は所与の文書に含まれる情報のみ用いるということが前
提である。上記のコーパス内の用語重要度は各用語がひ
とつの文書に出現する確率に関わる量であり、情報量と
関係する量である。一方、文書内重要度はひとつの文書
に閉じて求められるものであるから、文書の内容、即ち
文書の概念をどの程度代表しうるかの尺度となっている
ことが望ましい。従って、文書からの重要語句抽出にお
いては、その文書が中心的に表している概念に近い概念
を表す語句を優先して抽出すべきである。このためには
文書が表す中心概念の抽出、各語句と文書の中心概念の
関係の把握が必須である。しかしながら、従来の方法に
おいては抽出された重要語句が文書の中心的な概念をど
れだけ反映しているかは必ずしも明確ではなかった。こ
のため、文書の概念とは関係の薄い語句が重要と見なさ
れたり、単に頻度が高いだけの語句が重要語句として抽
出されたりしていた。When one document is given as described above, how to obtain the in-document importance of each term becomes an important subject. The calculation of the degree of importance within a document is based on the premise that only the information included in a given document is used. The term importance in the corpus is a quantity related to the probability that each term appears in one document, and is a quantity related to the amount of information. On the other hand, since the degree of importance in a document is obtained by closing it in one document, it is desirable that it be a measure of how much the content of the document, that is, the concept of the document can be represented. Therefore, when extracting important words and phrases from a document, words and phrases representing a concept close to the concept that the document mainly represents should be preferentially extracted. For this purpose, it is essential to extract the central concept represented by the document and grasp the relationship between each word and the central concept of the document. However, in the conventional method, it is not always clear how the extracted important words reflect the central concept of the document. For this reason, words and phrases that have little relation to the concept of the document are regarded as important, and words and phrases with a high frequency are extracted as important words and phrases.

【０００５】[0005]

【課題を解決するための手段】上記のような問題を解決
するため、この発明による重要語句抽出方法は、入力文
書に出現する用語を検出し、入力文書を適当な単位の文
書セグメントに区分けし、文書セグメントに出現する用
語の出現頻度を成分とする文書セグメントのベクトルを
生成し、文書セグメントベクトルの平方和行列の固有ベ
クトルおよび固有値を算出し、全固有ベクトルから重要
語句決定のための一定数の固有ベクトルを選択し、着目
する用語に対応する成分が値1をとり他は0となる用語ベ
クトル、もしくは着目する句に現れる用語に対応する成
分が値1をとり他は0となる句ベクトルを前記固有ベクト
ルに射影し、その射影値の2乗と対応する固有値との積
を求め、この値に基づいて着目する用語や句の重要度を
決定していく。In order to solve the above problems, the method of extracting important words according to the present invention detects a term appearing in an input document and divides the input document into document segments of appropriate units. , Generate a vector of the document segment that has the frequency of appearance of the term that appears in the document segment, calculate the eigenvectors and eigenvalues of the sum of squares matrix of the document segment vector, and use a fixed number of eigenvectors for determining important terms from all the eigenvectors. Select a term vector whose component corresponding to the term of interest has a value of 1 and is 0 otherwise, or a phrase vector whose component corresponding to the term appearing in the phrase of interest has a value of 1 and is 0 Then, the product of the square of the projected value and the corresponding eigenvalue is calculated, and the importance of the term or phrase of interest is determined based on this value.

【０００６】文書セグメントベクトルとは文書セグメン
ト中で各用語が現れる回数すなわち頻度をもとに決定し
た値を成分とするベクトルであり、その文書セグメント
の概念を表す。文書セグメントの最も自然な単位は、文
章である。次いで文書セグメントベクトルの集合に対し
て求められる平方和行列の固有ベクトル、固有値を求め
ることにより、文書セグメントベクトルの集合を互いに
直交する固有ベクトルおよび固有値により展開する。固
有ベクトルは用語の組合わせにより表現されるベクトル
なのでそれ自体が概念を持つ。固有ベクトルは文書固有
に決まるので固有ベクトルが表す概念を固有概念と呼ん
でもよい。また、固有値は固有ベクトルの表す概念の強
さ、もしくはエネルギーと見なすことができる。従って
大きな固有値に対応する固有ベクトル、即ち低次の固有
ベクトルは文書の中心的な概念を表すとみなすことがで
きる。The document segment vector is a vector whose component is a value determined based on the number of times each term appears in the document segment, that is, the frequency, and represents the concept of the document segment. The most natural unit of a document segment is a sentence. Then, the eigenvectors and eigenvalues of the sum-of-squares matrix obtained for the set of document segment vectors are obtained to develop the set of document segment vectors by eigenvectors and eigenvalues orthogonal to each other. The eigenvector has a concept in itself because it is a vector expressed by a combination of terms. Since the eigenvector is unique to the document, the concept represented by the eigenvector may be called an eigenconcept. Further, the eigenvalue can be regarded as the strength of the concept represented by the eigenvector or the energy. Therefore, eigenvectors corresponding to large eigenvalues, that is, low-order eigenvectors can be regarded as representing a central concept of a document.

【０００７】用語ベクトル、もしくは句ベクトルをある
固有ベクトルに射影した値は、用語ベクトル、もしくは
句ベクトルがその固有ベクトルに対応する固有概念方向
に持つ成分であり、射影値を2乗した値は上記成分のエ
ネルギーを表す。ここでは上記射影値を2乗した値と対
応する固有値との積を、着目する用語ベクトル、もしく
は句ベクトルの着目する固有概念に対する重要度とし、
これをもとに重要な用語や句を選択していく。そのため
文書が有する概念と関連を持つ用語や句が選択されるよ
うになる。A value obtained by projecting a term vector or phrase vector onto a certain eigenvector is a component that the term vector or phrase vector has in the proper concept direction corresponding to the eigenvector, and the value obtained by squaring the projection value is the above component. Represents energy. Here, the product of the value obtained by squaring the projection value and the corresponding eigenvalue, the term vector of interest, or the importance to the unique concept of interest of the phrase vector,
Select important terms and phrases based on this. Therefore, terms and phrases related to the concept of the document will be selected.

【０００８】[0008]

【実施例】図１は重要用語を抽出する本発明の第一の
実施例を示す。この発明の方法は、汎用コンピュータ上
でこの発明を組み込んだプログラムを走らせることによ
って実施することができる。図１は、そのようなプログ
ラムを走らせている状態でのコンピュータのフローチャ
ートである。ブロック11は用語検出部、ブロック12は形
態素解析部、ブロック13は文書セグメント区分け部であ
る。ブロック14は文書セグメントベクトル作成部、ブロ
ック15は平方和行列算出部、ブロック16は固有値・固有
ベクトル算出部、ブロック17は主要固有ベクトル選択
部、ブロック18は重要度算出部である。また、ブロック
19は重要用語出力部を表す。以下、英文文書を例に実施
例を説明する。Embodiment FIG. 1 shows a first embodiment of the present invention for extracting important terms. The method of the present invention can be implemented by running a program incorporating the present invention on a general-purpose computer. FIG. 1 is a flow chart of a computer in a state where such a program is running. A block 11 is a term detection unit, a block 12 is a morphological analysis unit, and a block 13 is a document segment division unit. A block 14 is a document segment vector creation unit, a block 15 is a sum of squares matrix calculation unit, a block 16 is an eigenvalue / eigenvector calculation unit, a block 17 is a main eigenvector selection unit, and a block 18 is an importance degree calculation unit. Also block
19 represents an important term output part. An embodiment will be described below by taking an English document as an example.

【０００９】入力された文書について、先ず用語検出部
11において、文書から単語及び数詞などの記号系列を検
出する。ここでは、単語や記号系列を総称して用語と呼
ぶ。英文の場合、単語同士を分けて書く正書法が確立し
ているので単語の検出は容易である。次に、形態素解析
部12は、用語の品詞付けなどの形態素解析を行う。次に
文書セグメントへの区分けを行う。文書セグメントへの
最も基本的な単位は文である。英文の場合、文はピリオ
ドで終わり、その後ろにスペースが続くので文の切出し
は容易に行うことができる。その他の文書セグメントへ
の区分け法としては、ひとつの文が複文からなる場合主
節と従属節に分けておく方法、用語の数がほぼ同じにな
るように複数の文をまとめて文書セグメントとする方
法、文書の先頭から含まれる用語の数が同じになるよう
に文とは関係なく区分けする方法などがある。Regarding the input document, first, the term detection unit
At 11, the symbol series such as words and numbers is detected from the document. Here, words and symbol sequences are generically called terms. In the case of English sentences, it is easy to detect words because the orthography for writing words separately is established. Next, the morpheme analysis unit 12 performs a morpheme analysis such as adding a word part of speech. Next, the document segmentation is performed. The most basic unit for a document segment is a sentence. In the case of English sentences, the sentence ends with a period and a space after it, making it easy to cut out a sentence. As a method of dividing into other document segments, when one sentence consists of multiple sentences, it is divided into main clauses and subordinate clauses, and multiple sentences are grouped into document segments so that the number of terms is almost the same. There are methods, such as a method of classifying regardless of the sentence so that the number of terms included from the beginning of the document is the same.

【００１０】文書セグメントベクトル作成部14は、先ず
文書全体に出現する用語から作成すべきベクトルの次元
数および各次元と各用語との対応を決定する。この際に
出現する全ての用語の種類にベクトルの成分を対応させ
なければならないということはなく、品詞付け処理の結
果を用い、例えば名詞と動詞と判定された用語のみを用
いてベクトルを作成するようにしてもよい。次いで、各
文書セグメントに出現する用語の種類とその頻度を求
め、その値に重みを乗じて対応する成分の値を決定し、
文書セグメントベクトルを作成する。重みの与え方とし
ては従来の技術を用いることができる。The document segment vector creation unit 14 first determines the number of dimensions of a vector to be created from the terms appearing in the entire document and the correspondence between each dimension and each term. It is not necessary to make the components of the vector correspond to all types of terms that appear at this time, and the vector is created using only the terms determined to be nouns and verbs, for example, using the result of the part-of-speech processing. You may do it. Then, determine the type of term that appears in each document segment and its frequency, and multiply the value by the weight to determine the value of the corresponding component,
Create a document segment vector. Conventional techniques can be used to give weights.

【００１１】平方和行列算出部15は、各文書セグメント
ベクトルの平方和行列の算出を行う。N個の用語が現れ
る入力文書がM個の文書セグメントに区分けされたとし
て、m番目の文書セグメントベクトルd_m (m=1,..,M)を(d
_m1,..,d_mN) ^Tにより表すと、平方和行列S=(S_ij)は、次
式により算出することができる。ここで、Tはベクトル
の転置を表わす。The sum-of-squares matrix calculator 15 calculates the sum-of-squares matrix of each document segment vector. Assuming that the input document in which N terms appear is divided into M document segments, the m-th document segment vector d _m (m = 1, .., M) is (d
_When expressed by _m1 , .., d _mN ) ^T , the sum of squares matrix S = (S _ij ) can be calculated by the following equation. Here, T represents the transposition of the vector.

【００１２】[0012]

【数1】固有値・固有ベクトル算出部16は、行列Sの固有値・固有
ベクトルの算出を行う。求められたk次の固有ベクト
ル、固有値をΦ_k、λ_kとする。Φ₁は各文書セグメント
ベクトルを射影した時の射影値の自乗和を最大にする軸
であるので、各文書セグメントに最も共通する概念を表
すことになる。また、λ₁はその射影値の自乗和そのも
のであり、Φ₁が表す概念の強さ、もしくはエネルギー
を表すとみなすことができる。Φ₂はΦ₁と直交すると言
う条件のもとで射影値の自乗和を最大にする軸である。
Φ₃以降も同様である。このようにして求められた固有
ベクトルが文書セグメントベクトルの集合を近似する部
分空間の基底となる。固有ベクトルをL次まで用いれば
部分空間の次元数はLとなり、入力文書の概念が互いに
直交する概念を持つL個の固有ベクトルにより展開され
たことになる。[Equation 1] The eigenvalue / eigenvector calculation unit 16 calculates the eigenvalue / eigenvector of the matrix S. The k-th order eigenvectors and eigenvalues obtained are Φ _k and λ _k . Since Φ ₁ is the axis that maximizes the sum of squares of the projection values when projecting each document segment vector, it represents the concept most common to each document segment. Further, λ ₁ is the sum of squares of the projection values itself, and can be regarded as representing the strength of the concept represented by Φ ₁ or the energy. Φ ₂ is the axis that maximizes the sum of squares of the projection values under the condition that it is orthogonal to Φ ₁ .
The same applies to Φ _{3 and} after. The eigenvectors thus obtained become the basis of the subspace approximating the set of document segment vectors. If the eigenvectors are used up to the Lth order, the dimensionality of the subspace is L, and the concept of the input document is expanded by L eigenvectors having the concepts that are orthogonal to each other.

【００１３】主要固有ベクトル選択部17は、Lの値を具
体的に決定する。行列SのランクをRとするとSからはR個
の固有ベクトルが求められるので、その文書は本来はR
個の固有概念を有することになる。部分空間は、このう
ちの（R−L）個の固有概念を捨ててL個の固有概念の組
み合わせで文書の中心概念を表す。部分空間の基底ベク
トルはL次までの固有ベクトルである。中心概念が本来
の概念に対してどの程度の割合を占めていたかは、次式
で表される。この式はLの値を実際に決めるときの目安
にすることができる。The main eigenvector selection unit 17 concretely determines the value of L. If the rank of the matrix S is R, then S eigenvectors are obtained from S, so the document is originally R
Will have unique concepts. The subspace expresses the central concept of the document by abandoning (R−L) unique concepts among these and combining L unique concepts. The subspace basis vectors are eigenvectors up to the Lth order. The ratio of the central concept to the original concept is expressed by the following equation. This formula can be used as a guide when actually determining the value of L.

【００１４】[0014]

【数2】重要度算出部18では選択された固有ベクトルに対応する
固有概念に対する重要度を各用語について求める。n番
目の用語w_nに着目することとし、着目用語のみが現れる
文書セグメントベクトル、即n番目の成分だけが値1、他
は0となるベクトル（用語ベクトル）をv_nとする。先
ず、全ての文書セグメントベクトルのv_nへの射影エネル
ギーEを求めてみる。Eは次式で与えられる。[Equation 2] The importance calculation unit 18 obtains the importance of each term for the eigenconcept corresponding to the selected eigenvector. Let us focus on the n-th term w _n , and let v _n be the document segment vector in which only the term of interest appears, the vector (term vector) in which only the n-th component has the value 1 and the others have 0. First, try to find the projection energy E of all document segment vectors to v _n . E is given by the following equation.

【００１５】[0015]

【数3】各文書セグメントで同じ用語が2回以上現れなければS_nn
はn番目の用語w_nの文書中の出現頻度となり、文書内の
出現頻度（tf）をw_nの重要度とみなす従来の方法はEをw
_nの重要度とみなすのと同等となる。本発明では、数3を
次のように変形する。[Equation 3] S _nn if the same term does not appear more than once in each document segment
Is the frequency of occurrence of the n-th term w _{n in} the document, and the conventional method that regards the frequency of occurrence (tf) in the document as the importance of w _n is E
_It is equivalent to the importance of _n . In the present invention, Equation 3 is transformed as follows.

【００１６】[0016]

【数4】数4の導出ではの関係を用いている。数4はEを各次数の固有概念からの
寄与の総和として与えるものである。従って、imp(w_n,
k)をw_nのｋ番目の固有概念に対する重要度とすると、im
p(w_n, k)は次式で定義することができる。[Equation 4] In the derivation of number 4, Is used. Equation 4 gives E as the sum of contributions from eigenconcepts of each degree. Therefore, imp (w _n ,
_Let k) be the importance of w _{n for} the kth eigenconcept, then im
p (w _n , k) can be defined by the following equation.

【００１７】[0017]

【数5】ここで、Φ_knはΦ_kのn番目の成分である。数5の定義で
は、imp(w_n, k)はw_nの出現頻度の値に直接影響を受け
る。そのため、数5においてS_nnで正規化した値を重要度
とすることもできる。この場合には重要度は以下のよう
になる。[Equation 5] Here, Φ _kn is the n-th component of Φ _k . In the definition of Expression 5, imp (w _n , k) is directly affected by the value of the frequency of occurrence of w _n . Therefore, the value normalized by S _nn in _Formula 5 can be used as the importance. In this case, the importance is as follows.

【００１８】[0018]

【数6】重要用語出力部19は各imp(w_n, k)の値をもとに入力文書
の重要用語を決定し出力する。これには次のような2つ
の方法が考えられる。（１）最初の方法では、数式５あるいは数式６に従い一
定次数Lまでの各固有概念に対し重要語を重要度の順に
一定個抽出して出力する。各kに対し何個の重要語を抽
出するかが問題となるが、例えばλ_kの値に応じて各kの
抽出重要語数を決めるという方法がある。（２）2番目の方法では、数式７あるいは数式８に従い
先ず各imp(w_n, k)についてｋ＝１からＲまでの和を取り
全体の文書に対する重要度を定義し、次いで全体の文書
に対する重要度の大きい順に一定個の用語を抽出し出力
する。前述のように低次の固有概念ほど文書の中心概念
に近いと考えられるので、低次のimp(w_n,k)ほど強調す
べきと考えられる。用語w_nの文書に対する重要度をimp
(w_n)とするとき、ω_kをk次の固有概念に対する重みとし
て、[Equation 6] The important term output unit 19 determines and outputs the important term of the input document based on the value of each imp (w _n , k). There are two possible ways to do this. (1) In the first method, a certain number of important words in order of importance are extracted and output for each unique concept up to a constant degree L according to Equation 5 or Equation 6. The problem is how many important words are extracted for each k. For example, there is a method of determining the number of important words extracted for each k according to the value of λ _k . (2) In the second method, the sum of k = 1 to R is first taken for each imp (w _n , k) according to Equation 7 or Equation 8 to define the importance for the entire document, and then for the entire document. A certain number of terms are extracted and output in descending order of importance. As described above, the lower-order eigenconcepts are considered to be closer to the central concept of the document, so it is considered that the lower-order imp (w _n , k) should be emphasized. Imp to the importance of the term w _{n to} the document
(w _n ), ω _k is the weight for the kth eigenconcept, and

【００１９】[0019]

【数7】もしくは[Equation 7] Or

【００２０】[0020]

【数8】により定義することができる。ω_kは低次ほど大きな値
を設定すべきなので、与え方としては、例えば、[Equation 8] Can be defined by Since ω _k should be set to a larger value for lower orders, as a way of giving, for example,

【００２１】[0021]

【数9】を用いることができる。ここでω_kはｋが大きいほど小
さい値をとるので、数８においてｋ＝１からＲまでの和
ではなく、ｋ＝1からＬまでの和としてもよい。[Equation 9] Can be used. Here, since ω _k takes a smaller value as k increases, the sum from k = 1 to L in Equation 8 may be used instead of the sum from k = 1 to R.

【００２２】図2は重要句を抽出する本発明の第二の実
施例を示す。この発明の方法は、汎用コンピュータ上で
この発明を組み込んだプログラムを走らせることによっ
て実施することができる。図２は、そのようなプログラ
ムを走らせている状態でのコンピュータのフローチャー
トである。ブロック11は用語検出部、ブロック22は形態
素解析・統語解析部、ブロック13は文書セグメント区分
け部である。ブロック14は文書セグメントベクトル作成
部、ブロック28は重要度算出部である。また、ブロック
29は重要句出力部を表す。これらのうち、ブロック11、
13、14までは図1に示したものと同じである。ブロック2
2は図1のブロック12で行う形態素解析以外に統語解析を
行い、重要度の評価対象としての句の検出を行う。句と
はいくつかの用語が組み合わされてひとつの品詞の働き
をするものである。ここで検出された句の中から重要な
句が選択される。FIG. 2 shows a second embodiment of the present invention for extracting important phrases. The method of the present invention can be implemented by running a program incorporating the present invention on a general-purpose computer. FIG. 2 is a flowchart of the computer in a state where such a program is running. A block 11 is a term detection unit, a block 22 is a morphological analysis / syntactic analysis unit, and a block 13 is a document segment classification unit. Block 14 is a document segment vector creation unit, and block 28 is an importance calculation unit. Also block
29 indicates an important phrase output part. Of these, block 11,
13 and 14 are the same as those shown in FIG. Block 2
2 performs syntactic analysis in addition to the morphological analysis performed in block 12 of FIG. 1 to detect a phrase as an object of importance evaluation. A phrase is a combination of several terms that serves as a part of speech. An important phrase is selected from the phrases detected here.

【００２３】重要度算出部28では各句に対してその重要
度を求める。複数の用語から成るひとつの句の句ベクト
ルをp=(p_1,.., p_N)^Tとする。pは、句を構成する用語に
対応する成分には句の中の出現数が与えられ、他の部分
は０となるベクトルである。ここでは句の重要度imp(p)
を全ての文書セグメントベクトルとpとの内積の2乗和に
より定義する。imp(p)は次のように記述できる。The importance calculator 28 calculates the importance of each phrase. Let p = (p _{1, ..,} p _N ) ^T be the phrase vector of one phrase consisting of multiple terms. p is a vector in which the number of occurrences in the phrase is given to the components corresponding to the terms that make up the phrase, and 0 in the other parts. Here, the importance of the phrase imp (p)
Is defined as the sum of squares of the inner products of all document segment vectors and p. imp (p) can be described as follows.

【００２４】[0024]

【数12】これは、句ベクトル方向の文書エネルギー×句エネルギ
ーが重要度を表すとみなしたものである。ところで数12
の定義では、句の長さが重要度に影響する可能性があ
る。そこで句ベクトルのノルムの2乗で正規化した[Equation 12] This assumes that the document energy in the phrase vector direction × phrase energy represents the degree of importance. By the way, the number 12
In the definition of, phrase length can affect importance. Therefore, we normalized the norm squared of the phrase vector.

【００２５】[0025]

【数13】を重要度としてもよい。また数12、数13に共通して現れ
る全ての文書セグメントベクトルとpとの内積の2乗和
は、図１の平方和行列算出部15で求められる平方和行列
S=(S_ij)を用いて次のように算出することができる。[Equation 13] May be used as the degree of importance. Further, the sum of squares of the inner products of all the document segment vectors that appear commonly in Expressions 12 and 13 and p is the sum of squares matrix calculated by the sum of squares matrix calculating unit 15 in FIG.
It can be calculated as follows using S = (S _ij ).

【００２６】[0026]

【数14】従って、図２において図１の平方和行列算出部15と同様
の処理を行うようにし、数14を用いて数12または数13で
定義される重要度を求めるようにしてもよい。重要句出
力部29は各句に対して求められた重要度の大きい順に一
定個の句を選択して出力する。[Numerical equation 14] Therefore, in FIG. 2, the same process as that of the sum of squares matrix calculating unit 15 in FIG. 1 may be performed, and the importance defined by the formula 12 or the formula 13 may be obtained by using the formula 14. The important phrase output unit 29 selects and outputs a certain number of phrases in descending order of importance calculated for each phrase.

【００２７】第二の実施例においては以下のようにする
ことにより、重要な文を抽出することができる。即ち、
ブロック13において、ブロック14に用いる文書セグメン
ト以外に全ての文を抽出しておき、ブロック28では句ベ
クトルの代わりに、文の中での用語の出現数を対応する
成分の値とする文ベクトルを用いればよい。In the second embodiment, important sentences can be extracted by the following procedure. That is,
In block 13, all sentences other than the document segment used in block 14 are extracted, and in block 28, instead of the phrase vector, a sentence vector whose value of the corresponding component is the number of appearances of the term in the sentence is used. You can use it.

【００２８】図3は重要句を抽出する本発明の第三の実
施例を示す。この発明の方法は、汎用コンピュータ上で
この発明を組み込んだプログラムを走らせることによっ
て実施することができる。図３は、そのようなプログラ
ムを走らせている状態でのコンピュータのフローチャー
トである。ブロック11は用語検出部、ブロック22は形態
素解析・統語解析部、ブロック13は文書セグメント区分
け部である。ブロック14は文書セグメントベクトル作成
部、ブロック15は平方和行列算出部、ブロック16は固有
値・固有ベクトル算出部、ブロック17は主要固有ベクト
ル選択部、ブロック38は重要度算出部である。また、ブ
ロック39は重要句出力部を表す。これらのうち、ブロッ
ク11、及びブロック13から17までは図1に示したものと
同じである。また、ブロック22は図2に示したものと同
じである。FIG. 3 shows a third embodiment of the present invention for extracting important phrases. The method of the present invention can be implemented by running a program incorporating the present invention on a general-purpose computer. FIG. 3 is a flow chart of a computer in a state where such a program is running. A block 11 is a term detection unit, a block 22 is a morphological analysis / syntactic analysis unit, and a block 13 is a document segment classification unit. A block 14 is a document segment vector creation unit, a block 15 is a sum of squares matrix calculation unit, a block 16 is an eigenvalue / eigenvector calculation unit, a block 17 is a main eigenvector selection unit, and a block 38 is an importance degree calculation unit. Block 39 represents an important phrase output section. Of these, block 11 and blocks 13 to 17 are the same as those shown in FIG. The block 22 is the same as that shown in FIG.

【００２９】重要度算出部38では選択された固有ベクト
ルに対応する固有概念に対する重要度を各句について求
める。図2と同様に複数の用語から成るひとつの句の句
ベクトルをp=(p_1,.., p_N)^Tとする。数12で定義される重
要度imp(p)はThe importance calculating section 38 obtains the importance of each unique phrase for the unique concept corresponding to the selected eigenvector. As in Fig. 2, let p = (p _{1, ..,} p _N ) ^T be the phrase vector of one phrase consisting of multiple terms. The importance imp (p) defined by equation 12 is

【００３０】[0030]

【数15】のように書け、imp(p)は各次の固有概念に対する句の重
要度の和をとったものとみなすことが出来る。そこでこ
こでは、imp(p,k)を次のように定義してその算出を行
う。[Equation 15] Can be written as, and imp (p) can be regarded as the sum of the importance of the phrases for each unique concept of each degree. Therefore, here, imp (p, k) is defined as follows and its calculation is performed.

【００３１】[0031]

【数16】また、数16の定義では重要度は数15で定義されるimp(p)
の値が大きいほど、またpのノルムの2乗値が大きいほ
ど、大きくなる傾向がある。そこで数14で与えられるim
p(p)の値で正規化し、imp(p,k)をimp(p)の値によらない
相対的な値として定義するようにしてもよい。この場
合、imp(p,k)は次のように定義できる。[Equation 16] Also, in the definition of Equation 16, the importance is defined by Equation 15 imp (p)
The larger the value of, and the larger the squared norm of p, the larger the tendency. Then im given by the number 14
The value of p (p) may be normalized, and imp (p, k) may be defined as a relative value that does not depend on the value of imp (p). In this case, imp (p, k) can be defined as follows.

【００３２】[0032]

【数17】もしくは、各imp(p,k)の値とpのノルムの2乗値とを独立
にするため、[Numerical formula 17] Or, to make each imp (p, k) value and the squared norm of p independent,

【数18】によりimp(p,k)を算出してもよい。重要句出力部39は各
imp(p,k)の値をもとに入力文書の重要句を決定し出力す
る。これには図1におけるブロック19と同様に次のよう
な2つの方法が考えられる。（１）最初の方法では、数式１６あるいは数式１７に従
い一定次数Lまでの各固有概念に対し重要句を重要度の
順に一定個抽出して出力する。各kに対し何個の重要句
を抽出するかが問題となるが、例えばλ_kの値に応じて
各kの抽出重要句数を決めるという方法がある。（２）2番目の方法では先ず数式１９に従い各imp(p,k)
についてｋ＝１からＲまでの和を取り全体の文書に対す
る重要度imp(p)を定義し直し、次いで全体の文書に対す
る重要度の大きい順に一定個の句を抽出し出力する。前
述のように低次の固有概念ほど文書の中心概念に近いと
考えられるので、低次のimp(p,k)ほど強調すべきと考え
られる。ω_kをk次の固有概念に対する重みとして、[Equation 18] May be used to calculate imp (p, k). Each important phrase output section 39
The important phrase of the input document is determined and output based on the value of imp (p, k). There are two possible methods for this, as in block 19 of FIG. (1) In the first method, a certain number of important phrases are extracted and output in order of importance for each eigenconcept up to a constant degree L according to Formula 16 or Formula 17. The problem is how many important phrases are extracted for each k. For example, there is a method of determining the number of extracted important phrases for each k according to the value of λ _k . (2) In the second method, first imp (p, k)
For k = 1 to R, the importance imp (p) for the entire document is redefined, and then a certain number of phrases are extracted and output in descending order of importance for the entire document. As mentioned above, the lower-order eigenconcepts are considered to be closer to the central concept of the document, so it is thought that the lower-order imp (p, k) should be emphasized. Let ω _{k be the} weight for the unique concept of order _k ,

【００３３】[0033]

【数19】により定義することができる。ω_kの与え方としては、
数9に示すような方法を用いることができる。ここでω_k
はｋが大きいほど小さい値をとるので、数１９において
ｋ＝１からＲまでの和ではなく、ｋ＝1からＬまでの和
としてもよい。[Formula 19] Can be defined by To give ω _k ,
It is possible to use the method as shown in Formula 9. Where ω _k
Since k takes a smaller value as k increases, the sum from k = 1 to L in Equation 19 may be used instead of the sum from k = 1 to R.

【００３４】第三の実施例においては以下のようにする
ことにより、重要な文を抽出することができる。即ち、
ブロック13において、ブロック14に用いる文書セグメン
ト以外に全ての文を抽出しておき、ブロック38では句ベ
クトルの代わりに、文の中での用語の出現数を対応する
成分の値とする文ベクトルを用いればよい。図４は、本
願発明の重要用語、重要句、重要文抽出装置１００の基
本構成図を示す。入力部１１０からユーザが抽出したい
用語、句、文を含んだ文書を入力する。ユーザ操作部１
３０から、抽出したい用語、句、文等の単位を指定す
る。演算部１２０で、本願発明に従い、重要用語、重要
句、重要文を抽出する。出力部１４０から、抽出された
重要用語、重要句、重要文を出力する。In the third embodiment, important sentences can be extracted by the following procedure. That is,
In block 13, all sentences other than the document segment used in block 14 are extracted, and in block 38, instead of the phrase vector, a sentence vector whose value is the number of occurrences of the term in the sentence is used as the corresponding component value. You can use it. FIG. 4 is a basic configuration diagram of the important term, important phrase, and important sentence extracting device 100 of the present invention. From the input unit 110, the user inputs a document including terms, phrases, and sentences that the user wants to extract. User operation unit 1
From 30, the unit of terms, phrases, sentences, etc. to be extracted is specified. The arithmetic unit 120 extracts important terms, important phrases, and important sentences according to the present invention. The output unit 140 outputs the extracted important terms, important phrases, and important sentences.

【００３５】[0035]

【発明の効果】58個の文からなる英文文書から2回以上
出現する44個の名詞を用語として用い、文単位に文書セ
グメントベクトルを作成し、重要語の抽出を行った結果
では得られた結果は人間の感覚とよく一致しており、人
間が重要と思う語が重要な語として抽出されていた。こ
のように本発明によれば文書の中心概念に沿った語が重
要語として抽出されるので、重要語抽出の能力が著しく
高められる。[Effects of the Invention] The results obtained by extracting important words by creating a document segment vector for each sentence using 44 nouns that appear twice or more from an English document consisting of 58 sentences as terms. The results were in good agreement with human senses, and words that humans think were important were extracted as important words. As described above, according to the present invention, the words along the central concept of the document are extracted as the important words, so that the ability of extracting the important words is significantly enhanced.

[Brief description of drawings]

【図１】本発明の第１の実施例を示す図である。FIG. 1 is a diagram showing a first embodiment of the present invention.

【図２】本発明の第２の実施例を示す図である。FIG. 2 is a diagram showing a second embodiment of the present invention.

【図３】本発明の第３の実施例を示す図である。FIG. 3 is a diagram showing a third embodiment of the present invention.

【図４】本発明の装置の基本構成図である。FIG. 4 is a basic configuration diagram of the device of the present invention.

[Explanation of symbols]

１００：重要用語・句・文抽出装置１１０：入力部１２０：演算部１３０：ユーザ操作部１４０：出力部 100: Important term / phrase / sentence extraction device 110: Input section 120: arithmetic unit 130: User operation unit 140: output unit

フロントページの続きＦターム(参考） 5B009 QA01 QA12 5B075 ND03 NK02 NK32 NR05 NR20 PP25 PQ75 PR04 5B091 AA15 CA01 EA24 Continued front page F-term (reference) 5B009 QA01 QA12 5B075 ND03 NK02 NK32 NR05 NR20 PP25 PQ75 PR04 5B091 AA15 CA01 EA24

Claims

[Claims]

1. A method for extracting important terms from an input document including one or a plurality of document segments, comprising the following steps (a) to (f): (a) the document for each document segment Generating a document segment vector having a value related to the frequency of appearance of a term appearing in the segment, and (b) generating a square sum matrix from the document segment vector, and calculating an eigenvector and an eigenvalue of the square sum matrix. And (c) selecting a predetermined number of eigenvectors and eigenvalues from the eigenvectors and eigenvalues, and (d) for the term in the input document, only the component corresponding to the term is 1 and the other Is 0
And (e) determining the importance of the term by using the predetermined number of eigenvectors, the predetermined number of eigenvalues, and the term vector, and (f) the importance. Selecting and outputting key terms of the input document using.

2. The number of the terms is N, the number of the document segments is M, and the m-th document segment vector is
d _m = (d _m1 , .., d _mN ) ^T (m = 1, .., M), where T represents the transposition of the vector, and d _mn is the frequency of occurrence of the n-th term in the document segment. , Where the sum of squares matrix is The method of claim 1, wherein the method is calculated as:

3. The degree of importance of the term with respect to the eigenvector of each degree is determined by the product of the squared value of the inner product of the eigenvector of each degree and the term vector and the eigenvalue of each degree. The method according to 1.

4. The importance of the term with respect to the entire input document is calculated by using a weighting factor calculated from the eigenvalue of each degree, and the squared value of the dot product of the eigenvector of each degree and the term vector and the degree of each of the degrees. The method according to claim 1, wherein the weighting sum is obtained by multiplying a product of an eigenvalue and an order of the predetermined number.

5. The method according to claim 3, wherein the importance of the term is normalized by a value of a diagonal term corresponding to each term in the sum of squares matrix or a value related thereto.

6. A method for extracting an important phrase from an input document containing one or more document segments, comprising the following steps (a) to (d): (a) a term appearing in the document segment. Generating a document segment vector having a value related to the frequency of appearance as a component, and (b) for a phrase in the input document, a component corresponding to a term included in the phrase is Generating a phrase vector in which the number of occurrences of a term is given and the others are 0;
(C) determining the degree of importance of the phrase by using the sum of squares of inner products of the phrase vector and all document segment vectors, (d) selecting the important phrase of the input document using the degree of importance And output.

7. The method according to claim 6, wherein the importance of the phrase is normalized by the square of the norm of the phrase vector.

8. A method of extracting an important phrase from an input document including one or a plurality of document segments, comprising the following steps (a) to (f): (a) the document for each document segment Generating a document segment vector having a value related to the frequency of appearance of a term appearing in the segment, and (b) generating a square sum matrix from the document segment vector, and calculating an eigenvector and an eigenvalue of the square sum matrix. And (c) selecting a predetermined number of eigenvectors and eigenvalues from the eigenvectors and eigenvalues, and (d) for a phrase in the input document, a component corresponding to a term included in the phrase Generating a phrase vector in which the number of occurrences of the term in the phrase is given and the others are 0, and (e) the importance of the phrase,
Obtaining using the predetermined number of eigenvectors, the predetermined number of eigenvalues, and the phrase vector, and (f) selecting and outputting an important phrase of the input document using the importance.

9. The number of the terms is N, the number of the document segments is M, and the m-th document segment vector is
d _m = (d _m1 , .., d _mN ) ^T (m = 1, .., M), where T represents the transposition of the vector, and d _mn is the frequency of occurrence of the n-th term in the document segment. , Where the sum of squares matrix is 9. The method of claim 8 calculated as:

10. The degree of importance of the phrase with respect to the eigenvector of each degree is determined by the product of the squared value of the inner product of the eigenvector of each degree and the phrase vector and the eigenvalue of each degree. The method according to 8.

11. The degree of importance of the phrase with respect to the entire input document is calculated using the weighting factor calculated from the eigenvalues of the respective degrees, and the squared value of the inner product of the eigenvectors of the respective degrees and the phrase vector, and the degree of each of the degrees. 9. The method according to claim 8, wherein the weighting sum is calculated over the predetermined number of orders of products of eigenvalues.

12. The importance of the phrase is normalized by a squared value of the norm of the phrase vector, a sum of squared dot products of all document segment vectors and the phrase vector, or a value related thereto. , 11.

13. A method of extracting an important sentence from an input document including one or a plurality of document segments, comprising the steps (a) to (d) below: (a) a term appearing in the document segment Generating a document segment vector whose component is a value related to the frequency of appearance, and (b)
Generating a sentence vector in which, with respect to the sentence in the input document, the number of occurrences of the term in the sentence is given to the component corresponding to the term included in the sentence, and 0 otherwise.
(C) determining the importance of the sentence using the sum of squares of inner products of the sentence vector and all document segment vectors, (d) selecting the important sentence of the input document using the importance And output.

14. The degree of importance of the sentence is normalized by a square value of the norm of the sentence vector.
The method described.

15. A method for extracting an important sentence from an input document including one or a plurality of document segments, comprising the following steps (a) to (f): (a) the document for each document segment. Generating a document segment vector having a value related to the frequency of appearance of a term appearing in the segment, and (b) generating a square sum matrix from the document segment vector, and calculating an eigenvector and an eigenvalue of the square sum matrix. And (c) selecting a predetermined number of eigenvectors and eigenvalues from the eigenvectors and eigenvalues, and (d) for a sentence in the input document, a component corresponding to a term included in the sentence Generating a sentence vector in which the number of occurrences of the term in the sentence is given and 0 in the other cases, and (e) the importance of the sentence is fixed to the predetermined number. Vector, said predetermined number of eigenvalues,
And a step of obtaining using the sentence vector,
(F) A step of selecting and outputting an important sentence of the input document using the degree of importance.

16. The number of the terms is N, the number of the document segments is M, and the m-th document segment vector is d _m = (d _m1 , .., d _mN ) ^T (m = 1, .., M), where T represents the transpose of the vector and d _mn represents the value associated with the frequency of occurrence of the nth term appearing in the document segment, then the sum of squares matrix is The method of claim 15, wherein the method is calculated as:

17. The degree of importance of the sentence with respect to the eigenvector of each degree is calculated by multiplying the square value of the inner product of the eigenvector of each degree and the sentence vector by the eigenvalue of each degree. 15. The method according to 15.

18. The importance of the sentence with respect to the entire input document is calculated by using a weighting factor calculated from the eigenvalue of each degree, and the square value of the inner product of the eigenvector of each degree and the sentence vector, and the degree of each of the degrees. 16. The method according to claim 15, wherein the weighting sum is obtained by multiplying the product of an eigenvalue and the predetermined number of orders.

19. The method according to claim 17, wherein the importance of the sentence is normalized by a value of a sum of squares of inner products of all document segment vectors and a sentence vector or a value related thereto.

20. An apparatus for extracting an important phrase from an input document including one or a plurality of document segments, which comprises the following means (a) to (d): (a) a term appearing in the document segment Means for generating a document segment vector having a value related to the frequency of appearance as a component, and (b) for a phrase in the input document, a component corresponding to a term included in the phrase is Given the number of occurrences of the term,
A means for generating a phrase vector that is otherwise 0; and (c) a means for determining the importance of the phrase by using the sum of squares of inner products of the phrase vector and all document segment vectors,
(D) A means for selecting and outputting an important phrase of the input document using the importance.

21. An apparatus for extracting an important sentence from an input document including one or a plurality of document segments, comprising: (a) to (f) below; (a) the document for each document segment Means for generating a document segment vector having a value associated with the frequency of appearance of a term appearing in a segment; and (b) generating a sum of squares matrix from the document segment vectors, and calculating an eigenvector and an eigenvalue of the sum of squares matrix. Means, (c) means for selecting a predetermined number of eigenvectors and eigenvalues from the eigenvectors and eigenvalues, and (d) for a sentence in the input document, a component corresponding to a term included in the sentence. A means for generating a sentence vector in which the number of occurrences of the term in the sentence is given and 0 for the others; and (e) the importance of the sentence, the predetermined number of eigenvectors, the predetermined number. Eigenvalues, and means for determining using said sentence vector, (f) means for outputting selecting key sentences of the input document using the importance degree.

22. The number of the terms is N, the number of the document segments is M, and the m-th document segment vector is d _m = (d _m1 , .., d _mN ) ^T (m = 1, .., M), where T represents the transpose of the vector and d _mn represents the value associated with the frequency of occurrence of the nth term appearing in the document segment, then the sum of squares matrix is 22. The device of claim 21, calculated as:

23. The importance of the sentence with respect to the eigenvector of each degree is obtained by multiplying the squared value of the inner product of the eigenvector of each degree and the sentence vector by the eigenvalue of each degree. 21. The device according to 21.

24. The importance of the sentence with respect to the entire input document is calculated by using a weighting factor calculated from the eigenvalue of each degree, and the squared value of the inner product of the eigenvector of each degree and the sentence vector, and the degree of each degree. 22. The apparatus according to claim 21, wherein the apparatus obtains a weighted sum of products of eigenvalues over the predetermined number of orders.