JP4639388B2

JP4639388B2 - Important word extraction method, important word extraction apparatus, computer program, and program storage medium in document database

Info

Publication number: JP4639388B2
Application number: JP2004268702A
Authority: JP
Inventors: 信義清水; 知義堀澤; 克江大長
Original assignee: Keio University
Current assignee: Keio University
Priority date: 2004-09-15
Filing date: 2004-09-15
Publication date: 2011-02-23
Anticipated expiration: 2024-09-15
Also published as: JP2006085374A

Description

この発明は、プログラムされたコンピュータにより、所定の学術分野など特定の分野についての文書を多数集約してなる文書データベースから重要語を抽出する方法に関する。 The present invention relates to a method of extracting important words from a document database formed by aggregating a large number of documents in a specific field such as a predetermined academic field by a programmed computer.

多数の文書を集約した文書データベースにおいて、文書間の相違を識別するために、データベース全体の中で特定の文書に偏って高頻度で出現する用語（重要語）を抽出する、という思想がある。重要語は、あらゆる文書に数多く出現する助詞や副詞などとは異なり、出現回数の絶対数こそ多くは無いが、特定の文書に限って他の文書より多く出現したり、その用語だけで文書の内容をある程度把握できたりする単語や連語である。 In a document database in which a large number of documents are aggregated, in order to identify differences between documents, there is a concept of extracting terms (important words) that frequently appear in a specific document and are biased toward a specific document. Important words, unlike particles and adverbs that appear many times in every document, the absolute number of occurrences is not large, but only in a specific document appears more than other documents, Words or collocations that can be understood to some extent.

一般的に、データベースに含まれる用語について、それぞれが重要語であるか否かを判定するためには、その用語の出現頻度を用いる。出現頻度の算出方法としては、ｔｆ法、ｉｄｆ法、ｔｆ・ｉｄｆ法などが周知である。そして、これらの方法で算出された出現頻度を重要度とし、その重要度が大きい用語を重要語として定義する。また数値化された重要度に基づいて、特定の文書を検索するのに当たってデータベースに照会するキーワードが適切であるか否かを判定したり、各文書を特徴付ける重要語が何であるのかを特定したりする。なお、以下の非特許文献１に、重要語の抽出に関わる情報処理方法が記載されている。
長尾真、外５名著，「文字と音の情報処理」，第１刷，２０００年１月２１日，ｐ２９−ｐ３５ In general, in order to determine whether each term included in a database is an important word, the appearance frequency of the term is used. Known methods for calculating the appearance frequency include the tf method, idf method, tf · idf method, and the like. Then, the appearance frequency calculated by these methods is defined as an importance level, and a term having a high importance level is defined as an important word. Also, based on the quantified importance, it is possible to determine whether or not the keyword to be referred to the database is appropriate for searching a specific document, and to identify what the important word characterizing each document is. To do. The following non-patent document 1 describes an information processing method related to extraction of important words.
Makoto Nagao, 5 other authors, “Information Processing of Characters and Sounds”, 1st edition, January 21, 2000, p29-p35

本発明者らは、所定の学術分野など、特定の分野についての文書を多数集約した文書データベースでは、特定の用語が特定の文書に限って高頻度で出現した場合でも、その用語がその特定の文書にとっての重要語となるとは限らない、ということを経験的に知見している。すなわち、上記重要度の算出方法では文書を特徴付ける適切なキーワードを特定することができない。 In the document database in which a large number of documents in a specific field such as a predetermined academic field are aggregated, even when a specific term appears frequently only in a specific document, the term We know empirically that it is not necessarily an important word for documents. In other words, the importance calculation method cannot identify an appropriate keyword that characterizes a document.

また、学術論文などの専門分野に関する文書では、その文書に含まれる用語の全てについて重要度と各用語の相関関係とを検討しないと、その内容を把握できない、ということも知見している。すなわち、学術的に特徴的なある用語が極めて少数の特定の文書に含まれていたとしても、その文書が全て同じ内容であるとは限らない。例えば、遺伝病などのヒトの遺伝形質に関する医学的・分子遺伝学的解説（エントリ）を集約した文書データベース「ＯＭＩＭ（Online Mendelian Inheritance in Man）」において、各エントリについての重要語を従来の出現頻度に基づく方法で抽出しようとすると、ＯＭＩＭにある各エントリは、もともと同一の学術分野に関するものであるから、複数のエントリで同じ用語を重要語として抽出してしまう。例えば、特定の遺伝子疾患について解説したエントリを検索するために、エントリを特徴付ける用語を従来の方法で算出した重要度に基づいて提示したとして、しかもその用語が特定の内容（疾患）を特徴付けるものであったとしても、その用語を含む複数のエントリが提示される可能性が高い。提示された複数のエントリはそれぞれ内容（疾患）が異なるため、研究者らは、調査したい疾患についてのエントリを探そうとすると、結局提示された全エントリに目を通さなければならず、目的とするエントリを見出すまで多大な労力と時間を要する。 In addition, it is also known that in documents related to specialized fields such as academic papers, the contents cannot be grasped unless the importance and correlation between the terms are examined for all the terms included in the document. That is, even if an academically characteristic term is included in a very small number of specific documents, the documents do not necessarily have the same content. For example, in the document database “OMIM (Online Mendelian Inheritance in Man)” that summarizes medical and molecular genetic explanations (entries) on human genetic traits such as genetic diseases, the key words for each entry are the frequency of occurrence in the past. When trying to extract by the method based on the above, each entry in the OMIM is originally related to the same academic field, so the same term is extracted as an important word in a plurality of entries. For example, in order to search for an entry describing a specific genetic disease, a term characterizing the entry is presented based on the importance calculated by a conventional method, and the term characterizes a specific content (disease). Even if there are, there is a high possibility that a plurality of entries including the term are presented. Because multiple presented entries are different in content (disease), researchers must look through all the presented entries in the end when they try to find an entry for the disease they want to investigate. It takes a lot of labor and time to find an entry.

本発明者らは、所定の学術分野など、特定の分野についての文書を多数集約した文書データベースにおいて、まず、個々の文書の重要語を精度よく特定できるように重要度の計算方法を検討した。そして、用語の出現頻度を算出した上で、さらにその出現頻度を使って重要度を求める特殊な計算式を見出した。また、その特殊な計算方法によって求めた重要度に基づいて、個々の文書の内容を一瞥するだけで把握できるように文書に含まれる用語の提示の仕方を検討し、それを見出した。 In the document database in which a large number of documents in a specific field such as a predetermined academic field are aggregated, the present inventors first studied a calculation method of importance so that important words of individual documents can be accurately identified. Then, after calculating the appearance frequency of the terms, a special calculation formula for finding the importance using the appearance frequency was found. Also, based on the importance obtained by the special calculation method, we examined and found out how to present the terms contained in the document so that it can be grasped by glances at the contents of each document.

本発明は、これら知見に基づきなされたもので、その目的は、特定の分野についての文書を多数集約した文書データベースにおいて、各文書を特徴付ける重要語を精度よく特定できるとともに、各文書の内容を一瞥して把握できるようにするための重要語抽出方法を提供することにある。 The present invention has been made on the basis of these findings. The purpose of the present invention is to accurately identify important words that characterize each document in a document database in which a large number of documents about a specific field are aggregated, and to understand the contents of each document. It is to provide a method for extracting important words so that it can be grasped.

上記目的を達成するための本発明は、プログラムされたコンピュータにより、所定の学術分野に関するｎ個の文書を集約した文書データベースを検索し、当該データベースに含まれる用語の重要度を算出して前記特定分野に関して重要性の高い用語を抽出する方法であって、
前記データベースに含まれる用語の全数ｍと、それぞれの用語Ｔ_ｊ（ｊ＝１，２，３，…，ｍ）を取得し、各用語Ｔ_ｊを識別管理する用語記憶ステップと、
文書Ｄ_ｉにおける用語Ｔ_ｊに関する出現頻度Ｗ_ｉｊを計算する出現頻度計算ステップと、
用語Ｔ_ｊについての出現頻度Ｗ_ｉｊ値の分散Ｓ^２ _ｊを計算する分散計算ステップと、
文書Ｄ_ｉにおける用語Ｔ_ｊの出現回数をＵ_ｉｊとして、文書Ｄ_ｉにおける用語Ｔ_ｊの重要度Ｖ_ｉｊを
Ｖ_ｉｊ＝Ｕ_ｉｊ×Ｓ^２ _ｊ
により計算する重要度計算ステップと、
用語Ｔ_ｊをＶ_ｉｊに基づいてリストアップした用語リストを作成して出力するリスト作成ステップと、
前記文書データベースに含まれるｎ個の文書から１つ以上の文書Ｄ _ｈを抽出するステップと、
文書Ｄ _ｈに含まれる用語の全数ｘを取得するステップと、
前記リスト作成ステップにより作成された文書Ｄ _ｈについての用語リストに含まれる用語Ｔ _ｇ（ｇ＝１，２，３，…，ｘ）を出力するとともに、利用者入力により当該用語Ｔ _ｇから１つ以上の用語Ｔ _ｋ（ｋ＝１，２，３，…，≦ｘ）の指定を受け付けるステップと、
指定されたＴ _ｋの数ａを取得するステップと、
文書Ｄ _ｈにおける用語Ｔ _ｇについての重要度Ｖ _ｈｇに基づいてｙ個の用語Ｔ _ｆ（ｆ＝１，２，３，…，ｙ）を抽出するステップと、
用語の数ｙを可変設定しながら、用語Ｔ _ｆのうち、指定された用語Ｔ _ｋに一致する用語の数ｂを取得するステップと、
文書Ｄ _ｈについての用語抽出精度Ｚｈを、
Ｚ _ｈ＝ｂ／ａ＋｛ｘ−（ａ＋ｙ−ｂ）｝／（ｘ-ａ）
の式により計算するステップと、
Ｚ _ｈの値が最大となるときのｘとｙを取得するとともに、当該ｘとｙとの関係を近似する関数ｙ＝ｆ（ｘ）を求めるステップと、
前記関数ｙ＝ｆ（ｘ）に基づいて、ｘ _ｉ個の用語を含む文書Ｄ _i についての用語リストにリストアップする用語数ｙ _ｉをｙ _ｉ＝ｆ（ｘ _ｉ）により算出し、当該算出された用語数ｙ _ｉの用語を掲載した用語リストを再作成するステップと、
を含み、
前記出現頻度計算ステップでは、全文書における用語Ｔ_ｊの出現回数をＵ_ｊとし、文書Ｄ_ｉにおける用語Ｔ_ｊの出現回数をＵ_ｉｊとし、取得したｍ個の全用語についての出現回数の合計をＵとして、前記出現頻度Ｗ_ｉｊを
Ｗ_ｉｊ＝（Ｕ_ｉｊ／Ｕ_ｊ）×ｌｏｇ（Ｕ／Ｕ_ｊ）
により計算し、
前記分散計算ステップは、全文書についての用語Ｔ_ｊの出現頻度の平均値をＷとして、前記分散値Ｓ^２ _ｊを
Ｓ^２ _ｊ＝｛（Ｗ_１ｊ−Ｗ）^２＋（Ｗ_２ｊ−Ｗ）^２＋…＋（Ｗ_ｎｊ−Ｗ）^２｝／ｎ
により計算する、
ことを特徴とする文書データベースにおける重要語抽出方法としている。 In order to achieve the above object, the present invention searches a document database in which n documents related to a predetermined academic field are collected by a programmed computer, calculates the importance of terms contained in the database, and identifies the specified database. A method for extracting terms that are highly relevant to a field,
A term storage step of acquiring the total number m of terms included in the database and each term T _j (j = 1, 2, 3,..., M), and identifying and managing each term T _j ;
An appearance frequency calculating step of calculating an appearance frequency W _ij for the term T _j in the document D _i ;
A variance calculation step for calculating a variance S ² _j of appearance frequency W _ij values for the term T _j ;
The number of occurrences of the term T _j in the document D _i is U _ij , and the importance V _ij of the term T _j in the document D _i is V _ij = U _ij × S ² _j
Importance calculation step calculated by
A list creation step of creating and outputting a term list in which terms T _j are listed based on V _ij ;
Extracting one or more documents D _h from n documents included in the document database ;
Obtaining the total number x of terms contained in the document D _h ;
The term T _g (g = 1, 2, 3,..., X) included in the term list for the document D _h created by the list creation step is output, and one from the term T _{g is} inputted by the user input. Receiving a designation of the above term T _k (k = 1, 2, 3,..., X);
Obtaining a designated T _k number a;
Extracting y terms T _f (f = 1, 2, 3,..., Y) based on the importance V _hg for the term T _g in the document D _h ;
Obtaining the number b of terms that match the designated term T _k out of the terms T _f while variably setting the number y of terms ;
Term extraction accuracy Zh for document D _h is
Z _h = b / a + {x− (a + y−b)} / (x−a)
A step of calculating according to the formula:
Obtaining x and y when the value of Z _h is maximum, and obtaining a function y = f (x) approximating the relationship between x and y;
Based on the function y = f (x), the number y _i of terms to be listed in the term list for the document D _i including x _i terms is calculated by y _i = f (x _i ), and the calculated Re-creating a term list with terms of the number of terms y _i ,
Including
In the occurrence frequency calculation step, the number of occurrences of term T _j in all documents and U _j, the number of occurrences of term T _j in the document D _i and U _ij, the total number of occurrences of the obtained m-number of all terms As U, the appearance frequency W _ij is _expressed as W _ij = (U _ij / U _j ) × log (U / U _j )
Calculated by
In the variance calculation step, the average value of the appearance frequencies of the term T _j for all documents is W, and the variance value S ² _j is S ² _j = {(W _1j −W) ² + (W _2j −W) ² + ... + (W _nj −W) ² } / n
Calculated by
This is a key word extraction method in a document database characterized by this.

上記重要語抽出方法に、次の要件（１）〜（４）のいずれかをさらに備えた文書データベースにおける重要語抽出方法も本発明の範囲である。 An important word extraction method in a document database that further includes any one of the following requirements (1) to (4) in the above important word extraction method is also within the scope of the present invention.

（１）利用者入力により文書Ｄ_ｉの指定を受け付けるステップを含み、前記リスト作成ステップは、指定された文書Ｄ_ｉに含まれる用語Ｔ_ｊを重要度に基づく順番でリストアップしたリストを作成する。 (1) including a step of accepting designation of the document D _i by user input, and the list creation step creates a list in which the terms T _j included in the designated document D _i are listed in order based on importance. .

（２）前記リスト作成ステップは、全文書のそれぞれについて、重要度の高い用語を順にリストアップした用語リストを作成するとともに、利用者入力によりキーワードの指定を受け付けるステップと、全文書のそれぞれの用語リストのうち、当該キーワードに該当する用語が所定の重要度Ｖ_ｉｊとなる場合の文書Ｄ_ｉ用語リストを出力するステップとを含むこと (2) the list creation step, for each of all the documents, as well as create a list of terms that sequentially lists of high importance terminology, a step of accepting a designation of a keyword by the user input, each term of all documents among the list, the term corresponding to the keyword and outputting the document D _i terms list when a predetermined importance V _ij

（３）特定の用語を収録した辞書データベースにアクセスするステップを含み、前記リスト作成ステップは、当該辞書データベースに存在する用語を前記用語リストに掲載しない。 (3) including a step of accessing a dictionary database storing specific terms, wherein the list creation step does not list the terms existing in the dictionary database in the term list.

（４）特定の用語と係数とを対応付けして記憶した係数データベースにアクセスするステップと、用語Ｔ_ｊの重要度Ｖ_ｉｊに対応の係数を乗算した値を新規の重要度とするステップとを含み、前記リスト作成ステップは、当該新規の重要度に基づいてリストを作成する。 (4) accessing a coefficient database in which a specific term and a coefficient are stored in association with each other; and a step of setting a value obtained by multiplying the importance V _ij of the term T _{j by} the corresponding coefficient as a new importance. And the list creating step creates a list based on the new importance.

なお本発明は、コンピュータにより構成されて、上記方法に含まれているステップを実行する重要語抽出装置と、コンピュータにインストールされて、当該コンピュータに上記いずれかの方法に含まれているステップを実行させるコンピュータプログラム、および、そのコンピュータプログラムを記録したコンピュータにより読み取り可能なプログラム格納媒体にも及んでいる。 The present invention includes a keyword extraction device that is configured by a computer and executes the steps included in the above method, and is installed in the computer and executes the steps included in any of the above methods on the computer. And a program storage medium readable by a computer recording the computer program.

本発明の重要語抽出方法によれば、特定の分野についての文書を多数集約した文書データベースにおいて、各文書に含まれる用語からその文書を特徴付ける重要語を精度よく特定できるとともに、各文書の内容を一瞥して把握することができる。 According to the important word extraction method of the present invention, in a document database in which a large number of documents about a specific field are aggregated, it is possible to accurately identify important words that characterize the document from terms included in each document, and to determine the contents of each document. You can grasp at a glance.

＝＝＝重要語抽出方法の概略＝＝＝
本発明の一実施形態として、特定の分野についての文書を多数集約した文書データベースにアクセスするとともに、本発明の方法によって文書に含まれる重要語を抽出するようにプログラムされたコンピュータ（重要語抽出装置：以下、抽出装置）を例示する。本実施例の抽出装置によれば、重要語の抽出に際し、文書データベースにおける各文書に含まれる用語の重要度を特殊な計算式により測定し、その測定結果として、文書別に重要度の高い用語を順にリストアップした用語リストを作成して出力する。なお文書データベースは抽出装置に付帯していてもよいし、外部にあってもよい。また、リストの出力は、そのリスト自体を所定の記憶資源に記憶することであってもよいし、文書データベースの利用者に閲覧可能に出力することであってもよい。 === Outline of the keyword extraction method ===
As one embodiment of the present invention, a computer (important word extracting device) programmed to access a document database in which a large number of documents about a specific field are aggregated and extract important words contained in a document by the method of the present invention. : Extraction device) below. According to the extraction apparatus of the present embodiment, when extracting important words, the importance of terms included in each document in the document database is measured by a special calculation formula, and as a result of the measurement, terms having high importance are classified for each document. Create and output a list of terms listed in order. The document database may be attached to the extraction apparatus or may be external. The output of the list may be to store the list itself in a predetermined storage resource, or to output the list so that it can be viewed by the user of the document database.

＝＝＝文書データベース＝＝＝
本実施例において、抽出装置は、ＯＭＩＭを重要語の抽出対象としている。よく知られているように、ＯＭＩＭは、遺伝病などのヒトの遺伝形質に関する医学的・分子遺伝学的解説を集約した事典「ＭＩＭ（Mendelian Inheritance in Man）」を文書データベース化したものであり、ＯＭＩＭに含まれる論文（エントリ）数は、２００４年１月現在、１５，０００件以上にのぼる。そのエントリのうち、異なる遺伝子疾患についてのエントリが約４，５００件ある。このＯＭＩＭは、インターネット上のＷＷＷサーバーによってオンラインでの検索・閲覧が可能となっている。本実施例において、抽出装置は、インターネットを介してＯＭＩＭにアクセスする構成となっている。もちろん、抽出装置にＯＭＩＭが付帯する構成としてもよい。 === Document database ===
In the present embodiment, the extraction apparatus uses OMIM as an important word extraction target. As is well known, OMIM is a document database of “MIM (Mendelian Inheritance in Man)”, an encyclopedia that summarizes medical and molecular genetic descriptions of human genetic traits such as genetic diseases. As of January 2004, the number of articles (entries) included in the OMIM is more than 15,000. Among the entries, there are about 4,500 entries for different genetic diseases. This OMIM can be searched and viewed online by a WWW server on the Internet. In this embodiment, the extraction apparatus is configured to access the OMIM via the Internet. Of course, an OMIM may be attached to the extraction device.

＝＝＝重要度の計算＝＝＝
図１（Ａ）〜（Ｄ）に抽出装置における重要度の算出処理の概略を示した。抽出装置は、ＯＭＩＭの全エントリを対象として用語を抽出し、ｎ個の全エントリとｍ個の全用語を取得するとともに、各エントリに識別子Ｄ_ｉ（ｉ＝１，２，３，…，ｎ）を付与し、用語に識別子Ｔ_ｊ（ｊ＝１，２，３，…，ｍ）を付与し、ｎ個の全エントリとｍ個の全用語を識別管理する。また、各エントリごとに各用語Ｔｊの出現回数をカウントしてそれを記憶する。そして、ｎ行ｍ列の行列（マトリクス）を作成し、そのマトリクスの各交点（セル）に、エントリＤ_ｉにおける用語Ｔ_ｊの出現回数を格納する。したがって、セルの行列（ｉ、ｊ）を指定すれば、特定のエントリにおける特定の用語の出現回数がわかる。ここで、その特定のエントリＤ_ｉにおける特定の用語Ｔ_ｊの出現回数（用語出現回数）をＵ_ｉｊ、１列に含まれる各セルの出現回数合計、すなわち全エントリを通じての特定の用語Ｔ_ｊの出現回数（用語総出現回数数）をＵ_ｊとする（Ａ）。また、全エントリにおける全用語についての出現回数（全用語総出現回数）をＵとする。 === Calculation of importance ===
1A to 1D show an outline of importance calculation processing in the extraction apparatus. The extraction device extracts terms for all entries in the OMIM, obtains all n entries and all m terms, and identifies each entry with an identifier D _i (i = 1, 2, 3,..., N). ), Identifiers T _j (j = 1, 2, 3,..., M) are assigned to the terms, and all the n entries and all the m terms are identified and managed. The number of occurrences of each term Tj is counted for each entry and stored. Then, a matrix of n rows and m columns is created, and the number of appearances of the term T _j in the entry D _i is stored in each intersection (cell) of the matrix. Therefore, if the cell matrix (i, j) is designated, the number of occurrences of a specific term in a specific entry can be known. Here, the number of appearances of a specific term T _j in that specific entry D _i (term appearance number) is U _ij , the total number of appearances of each cell included in one column, that is, the specific term T _j through all entries. Let U _j be the number of appearances (total number of appearances of terms) (A). Moreover, U represents the number of appearances for all terms in all entries (total number of appearances of all terms).

つぎに、エントリＤ_ｉにおける用語Ｔ_ｊに関する出現頻度Ｗ_ｉｊを次の式（１）
Ｗ_ｉｊ＝（Ｕ_ｉｊ／Ｕ_ｊ）×ｌｏｇ（Ｕ／Ｕ_ｉｊ）…式（１）
により計算し、このＷ_ｉｊの値を各セルに格納する（Ｂ）。 Next, the appearance frequency W _ij regarding the term T _j in the entry D _i is _expressed by the following equation (1).
W _ij = (U _ij / U _j ) × log (U / U _ij ) (1)
And the value of _Wij is stored in each cell (B).

本実施例では、さらに、用語Ｔ_ｊごとのＷ_ｉｊ値の分散値をＳ^２ _ｊを計算する（Ｃ）。
すなわち、各セルに出現頻度Ｗ_ｉｊを格納したマトリクス（Ｂ）において、各一列のＷ_ｉｊの平均値をＷとして、各列ごとに分散値Ｓ^２ _ｊを周知の以下の式（２）
Ｓ^２ _ｊ＝｛（Ｗ_１ｊ−Ｗ）^２＋（Ｗ_２ｊ−Ｗ）^２＋…＋（Ｗ_ｎｊ−Ｗ）^２｝／ｎ …式（２）により計算する（Ｃ）。 In the present embodiment, S ² _j is further calculated as a variance value of W _ij values for each term T _j (C).
That is, in the matrix (B) in which the appearance frequency W _ij is stored in each cell, the average value of W _ij in each column is W, and the variance value S ² _j for each column is known as the following formula (2)
S ² _j = {(W _1j −W) ² + (W _2j −W) ² +... + (W _nj −W) ² } / n (C).

次に、Ｓ^２ _ｊに基づいて、エントリＤ_ｉにおける用語Ｔ_ｊの重要度Ｖ_ｉｊを次の式（３）
Ｖ_ｉｊ＝Ｕ_ｉｊ×Ｓ^２ _ｊ…式（３）
により計算し、その計算結果を対応する各セルに格納する（Ｄ）。 Next, based on S ² _j , the importance V _ij of the term T _j in the entry D _i is _expressed by the following equation (3).
_{_{^{V ij = U ij × S 2}}} j ... Equation (3)
And the calculation result is stored in each corresponding cell (D).

なお本実施例では、より重要度を高精度で算出するために、出現頻度を新規に見出した上記式（１）により算出しているが、出現頻度の算出については、従来のｔｆ法、ｉｄｆ法、ｔｆ・ｉｄｆ法を採用してもよい。本発明の思想は、重要度として出現頻度を採用する、という従来の概念を捨て、特定の文書における特定の用語について、その出現頻度の分散値と用語出現回数との乗算値を重要度とする点にある。 In this embodiment, in order to calculate the degree of importance with higher accuracy, the appearance frequency is calculated by the above-described equation (1). However, for the calculation of the appearance frequency, the conventional tf method, idf The tf · idf method may be employed. The idea of the present invention is to abandon the conventional concept of adopting the appearance frequency as the importance, and for the specific term in the specific document, the importance value is the product of the dispersion value of the appearance frequency and the term appearance frequency. In the point.

＝＝＝重要語の提示＝＝＝
抽出装置は、上記式（１）〜（３）により、特定のエントリにおける特定の用語の重要度を算出すると、その重要度に応じた順位でリストアップした用語リストをエントリ別に作成する。本実施例では、エントリごとに重要度の高い用語を順にリストアップした用語リストを作成して記憶する。 === Presentation of important words ===
When the extraction device calculates the importance level of a specific term in a specific entry using the above formulas (1) to (3), it creates a term list listed for each entry in the order according to the importance level. In this embodiment, a term list is created and stored in order of the most important terms for each entry.

作成したリストは、例えば、抽出装置に付帯するディスプレイや、抽出装置にアクセス可能なコンピュータにて閲覧可能にして出力すればよい。それによって、特定のエントリについての用語がその重要度に応じて複数示され、研究者などの専門家がその用語リストを一瞥すれば、特定のエントリ中にある複数の用語の重要度とその相関関係がわかり、エントリの内容を確実に把握することができる。 The created list may be output by making it viewable on, for example, a display attached to the extraction device or a computer accessible to the extraction device. As a result, multiple terms for a specific entry are displayed according to their importance, and if an expert such as a researcher glances at the list of terms, the importance of multiple terms in a specific entry and their correlation You can understand the relationship and know the contents of the entry.

図２に、ある特定のエントリ（エントリ番号＃１３７７５０）について、本実施例の方法に基づいて作成した用語リストを示した。エントリ＃１３７７５０のタイトルはGLAUCOMA, PRIMARY OPEN ANGLE, JUVENILE-ONSET, 1; JOAG（若年性開放隅角緑内障）という遺伝子疾患について記載されたエントリであり、図２には、エントリ＃１３７７５０において、重要度１２の高い用語（１１ａ，１１ｂ）が上から順にリストアップされたリスト１０が示されている。また、参考までに用語ごとの分散値１３もリスト１０に添えて示した。分散値が低い用語でも、特定のエントリにおいて出現回数が多いとそのエントリでは重要度の値が高くなり、その特定のエントリについての用語リストでは、上位にリストアップされる。したがって、特定のエントリにおいて重要語となり得る用語を確実に上位にリストアップすることができる。 FIG. 2 shows a term list created based on the method of this embodiment for a specific entry (entry number # 137750). The title of entry # 137750 is an entry that describes the genetic disease GLAUCOMA, PRIMARY OPEN ANGLE, JUVENILE-ONSET, 1; JOAG (juvenile open angle glaucoma), and FIG. A list 10 in which 12 high terms (11a, 11b) are listed in order from the top is shown. For reference, the variance value 13 for each term is also shown in the list 10. Even for a term having a low variance value, if the number of appearances in a specific entry is large, the importance value of the entry becomes high, and the term list for the specific entry is listed higher. Therefore, it is possible to reliably list terms that can be important words in a specific entry.

なお、本実施例により算出した重要度の信頼性を証明するために、エントリ＃１３７７５０を実際に研究者などの専門家に読んでもらい、その専門家にエントリの内容を把握する上で実際に重要語として採用できる用語１１ａを指定してもらった。専門家が認めた重要語１１ａが抽出装置が作成したリスト１０の上位にリストアップされている。したがって、本実施例の方法に採用した重要度の計算方式は精度よく重要語を抽出するための指標となることが判明した。また、各エントリごとに用語を重要度順にリストアップしているので、エントリに含まれている複数の用語の相関がわかる。すなわち、複数の用語の重要度を比較することができ、抽出装置により作成された用語リスト中の上位の用語を一瞥すれば、そのエントリの内容を正確に認識することができる。 In order to prove the reliability of the importance calculated in this embodiment, entry # 137750 is actually read by an expert such as a researcher, and the expert actually understands the contents of the entry. I was asked to specify a term 11a that could be adopted as an important word. The important word 11a recognized by the expert is listed at the top of the list 10 created by the extraction device. Therefore, it has been found that the importance calculation method employed in the method of this embodiment is an index for extracting important words with high accuracy. In addition, since the terms are listed in order of importance for each entry, the correlation of a plurality of terms included in the entry can be known. That is, the importance levels of a plurality of terms can be compared, and the contents of the entry can be accurately recognized if a high-order term is listed in the term list created by the extraction device.

＝＝＝ユーザインタフェース＝＝＝
本実施例における抽出装置を利用者が実際に使用する場面でのユーザインタフェースとしては、エントリの指定入力を受け付けてそのエントリの用語リストを提示したり、キーワードの指定入力を受け付けて、キーワードに該当する用語の重要度が高い用語リストを提示したりする方式が考えられる。 === User interface ===
As a user interface when the user actually uses the extraction device in the present embodiment, it accepts entry entry and presents a term list of the entry, or accepts keyword entry and corresponds to a keyword. A method of presenting a term list having a high importance of the terms to be considered is conceivable.

また、利用者からのエントリやキーワードの指定入力を受け付けたり、その入力を起源とした用語抽出結果を提示したりする方式としては、抽出装置自体にキーボードやディスプレイなどのユーザインタフェースを備えさせ、そのユーザインタフェースを介して入出力する方式でもよいが、抽出装置にＷＷＷサーバーとしての機能を実装してインターネットに接続させておく方式も考えられる。そしてそのＷＷＷサーバー機能により、エントリやキーワードの指定入力を受け付けるためのフォームを含んだＷｅｂページを抽出装置に用意しておき、利用者はパーソナルコンピュータなどブラウザを実装したコンピュータ（ブラウザ端末）により、そのＷｅｂページを取り寄せ、そのページにて入力したエントリやキーワードを抽出装置に送付する。抽出装置は、指定のエントリの用語リストや、キーワードに該当する用語の重要度が高いエントリについての用語リストをＷｅｂページに作成してブラウザ端末に返送すればよい。 In addition, as a method of accepting user-specified entry and keyword specification input or presenting the term extraction result originating from that input, the extraction device itself is equipped with a user interface such as a keyboard and display, Although a method of inputting / outputting via a user interface may be used, a method of mounting a function as a WWW server in the extraction apparatus and connecting it to the Internet is also conceivable. The WWW server function prepares a Web page including a form for accepting entry and keyword designation input in the extraction device, and the user can use the computer (browser terminal) such as a personal computer to implement the browser. A Web page is obtained and the entry or keyword input on the page is sent to the extraction device. The extraction device may create a term list for a specified entry or a term list for an entry having a high degree of importance of a term corresponding to a keyword on a Web page and send it back to the browser terminal.

＝＝＝重要語抽出精度の向上について＝＝＝
図２に示したように、本実施例の計算式によって重要度を計算した場合、少数ではあるが、研究者にとってはさほど重要ではない用語１１ｂが用語リスト１０の上位にリストアップされている。そのような用語１１ｂも可能な限り排除できれば、より好ましい。そこで、用語リストから削除すべき用語を収録した辞書を用意しておき、作成した用語リストの中で、その辞書に記載されている用語については、リストから削除すればよい。本実施例が対象としているＯＭＩＭデーベースでは、人名、特有の変異名やマーカーなどを削除対象とすることができる。 === About improvement of key word extraction accuracy ===
As shown in FIG. 2, when the importance is calculated by the calculation formula of the present embodiment, a small number of terms 11 b that are not so important to the researcher are listed at the top of the term list 10. It is more preferable if such term 11b can be eliminated as much as possible. Therefore, a dictionary containing terms to be deleted from the term list is prepared, and the terms described in the dictionary in the created term list may be deleted from the list. In the OMIM database targeted by this embodiment, a person name, a unique mutation name, a marker, or the like can be a deletion target.

＝＝＝重要語の重み付け＝＝＝
用語リストの下位にある用語でも実は研究者にとっては重要となり得る場合もある。そこで、データベースに含まれる用語について、あらかじめ重要度に乗算する係数を対応付けして所定のデータベースに記憶管理しておく。そして抽出装置が、あるエントリについての用語リストを提示する際、リスト中の各用語について、式（１）〜式（３）によって得られた重要度に、それぞれ対応の係数を乗算して重要度を更新し、その更新した重要度に基づいて用語リストを作成する。それによって、式（１）〜式（３）によって計算された重要度に基づいて下位にリストアップされた用語でも、実質的な重要度に見合うように上位にリストアップされる。それによって、研究者らは用語リストの下位の用語まで調べなくても、エントリの内容をより詳細に把握することができる。 === Weighting of important words ===
Even terms below the term list may actually be important to researchers. Therefore, the terms included in the database are stored and managed in a predetermined database in association with a coefficient to be multiplied by the importance in advance. When the extraction device presents a term list for a certain entry, the importance obtained by the expressions (1) to (3) is multiplied by a corresponding coefficient for each term in the list, and the importance is obtained. And create a term list based on the updated importance. As a result, the terms listed below based on the importance calculated by the equations (1) to (3) are also listed higher so as to match the substantial importance. As a result, researchers can grasp the contents of the entry in more detail without having to look up the terms in the term list.

なお、用語と係数との対応付けしたデータベースは抽出装置に付帯する内部データベースであってもよいし、抽出装置がアクセス可能な外部データベースであってもよい。また、各用語の係数を決定するためには、例えば、助詞や副詞など全く不要な用語については係数を０にしてリストに掲載されないようにしたり、医学辞書に掲載されている特定の種別の用語（遺伝子シンボル、器官、組織、症状、疾患など）については、分散値が低い割にはエントリの内容を確実に示唆する用語なので一律に高い係数を対応付けしておいてリストの上位にリストアップされるようにしたり、ＯＭＩＭのタイトルなど自動的に上位にリストアップされる用語については、確実にリストアップされる程度の所定の係数を一律に対応付けしたりするなど、用語に付与する係数は適宜に設定すればよい。 The database in which terms and coefficients are associated with each other may be an internal database attached to the extraction device, or an external database accessible by the extraction device. In addition, in order to determine the coefficient of each term, for example, a completely unnecessary term such as a particle or adverb is set to 0 so that it is not listed, or a specific type of term that is listed in a medical dictionary For gene symbols, organs, tissues, symptoms, diseases, etc., the low variance value is a term that certainly suggests the contents of the entry, so it is uniformly listed with a high coefficient and listed at the top of the list. For terms that are automatically listed at the top, such as OMIM titles, the coefficients assigned to the terms are such as associating predetermined coefficients to the extent that they are surely listed. What is necessary is just to set suitably.

＝＝＝重要語の抽出数の最適化＝＝＝
当然のことながら、用語リストに全く不要な用語まで載せる必要はない。適当な数の用語さえリストアップされていれば、その用語だけでエントリの内容を把握することができる。もちろん、無駄な用語を含んだリストは当然データ量が大きく、その大容量データはそれを扱う抽出装置に過大な負荷を掛ける。しかし、リストアップする用語の数を全てのエントリについて一律に限定してしまえば、文書が長く用語の全数の多いエントリでは重要な用語がリストから欠落する可能性がある。したがって、エントリごとにリストアップする用語の数を最適化する必要がある。 === Optimization of the number of key words extracted ===
Of course, it is not necessary to put a completely unnecessary term in the term list. If an appropriate number of terms are listed, the contents of the entry can be grasped only by the terms. Of course, a list including useless terms naturally has a large amount of data, and the large volume of data places an excessive load on the extraction device that handles the data. However, if the number of terms to be listed is uniformly limited for all entries, important terms may be missing from the list for entries with long documents and a large number of terms. Therefore, it is necessary to optimize the number of terms listed for each entry.

ここで、その最適化のための手法を例示する。概略的には、あるエントリを研究者など用語の重要性を判断できる専門家に見てもらい、そのエントリ中から重要語を選出してもらう。そして、その選出した重要語と抽出装置が作成した用語リスト中の用語とを比較し、用語リスト中の用語と専門家が選出した実際の重要語との一致度に基づいて全てのエントリに適用できる法則を見出す。そして、その法則に従って用語リスト中に掲載する最適な用語数をエントリ別に決定する。 Here, a technique for the optimization is illustrated. In general, a specialist such as a researcher who can judge the importance of a term is viewed and an important word is selected from the entry. Then, the selected important word is compared with the term in the term list created by the extractor, and applied to all entries based on the degree of matching between the term in the term list and the actual important word selected by the expert. Find the laws that can be done. Then, according to the rule, the optimum number of terms to be posted in the term list is determined for each entry.

具体的には、抽出装置が、専門家などの利用者から、全エントリ中から適当な複数のエントリＤ_ｈの指定と、そのＤ_ｈに含まれる用語から利用者が選出した重要語Ｔ_ｋ（ｋ＝１，２，３，…）の指定とをユーザインタフェースを介して受け付け、指定されたＴ_ｋの数ａを指定のエントリＤ_ｈ別に取得する。 Specifically, the extraction device designates a plurality of appropriate entries D _h from all the entries from a user such as an expert, and the important word T _k (the user selects from terms included in the D _h ). k = 1, 2, 3, ... specification and) received through the user interface, separately obtains the specified entry _{D h} the number a of the specified _{T k.}

また抽出装置は、上記式（１）〜（３）に基づいて作成したエントリＤ_ｈについての用語リストについて、そのリストに含まれる用語の全数ｘと、リスト中の各用語Ｔ_ｇ（ｇ＝１，２，３，…，ｘ）とを取得する。次に、用語リスト中に掲載すべき用語の数をｙとし、用語Ｔ_ｇから重要度Ｄ_ｈｇに基づいてｙ個の用語Ｔ_ｆ（ｆ＝１，２，３，…，ｙ）をリストアップする。そして、用語Ｔ_ｆのうち利用者により指定された用語Ｔ_ｋに一致する用語の数ｂを取得する。 Further, the extraction device, regarding the term list for the entry D _h created based on the above formulas (1) to (3), the total number x of terms included in the list and each term T _g (g = 1) in the list. , 2, 3,..., X). Next, y is the number of terms to be included in the term list, and y terms T _f (f = 1, 2, 3,..., Y) are listed based on the importance D _hg from the term T _g. To do. Then, the number b of terms that match the term T _k specified by the user among the terms T _f is acquired.

なお、ｙ個の用語の抽出に際しては、ｙの値自体を設定し、各用語Ｔ_ｇにおける重要度Ｖ_ｈｇの値が高い方から順にｙ番目までの用語を抽出してもよいし、重要度の値を可変設定し、その重要度の値以上の用語Ｔ_ｇをＴ_ｆとして抽出し、そのＴ_ｆの数をｙとするなど、ｙの値は重要度に基づいて適宜に可変設定すればよい。 When extracting y terms, the y value itself may be set, and the terms up to y-th may be extracted in descending order of importance V _hg in each term T _g . the value variably set, and extracted their importance values more terms T _g as T _f, such as the number of T _f and y, if the variable set as appropriate based on the value of y is the degree of importance Good.

つぎに、ｙの値を可変設定していきながら、文書Ｄ_ｈについてリストアップした用語Ｔ_ｆの精度Ｚ_ｈを次式（４）、
Ｚ_ｈ＝ｂ／ａ＋｛ｘ−（ａ＋ｙ−ｂ）｝／（ｘ-ａ）…式（４）
により計算し、Ｚ_ｈが最大値を取るときのｘとｙを取得し、このｘとｙとの関係を近似する関数ｙ＝ｆ（ｘ）を求める。 Then, while gradually variably setting the value of y, the following equation accuracy Z _h terms T _f that lists the documents D _h (4),
Z _h = b / a + {x− (a + y−b)} / (x−a) (4)
To obtain x and y when Z _h takes the maximum value, and obtain a function y = f (x) that approximates the relationship between x and y.

このｙ＝ｆ（ｘ）を他のエントリＤ_ｉにも適用し、全エントリＤ_ｉにおける用語リストにリストアップする用語数を決定する。すなわち、エントリＤ_iにｘ_ｉ個の用語が含まれている場合、そのエントリＤ_ｉについての用語リストに掲載する用語の数ｙ_ｉを上記関数
ｙ_ｉ＝ｆ（ｘ_ｉ）
により算出する。そして、エントリＤ_ｉについて、決定した用語数ｙ_ｉを含んだ用語リストを再作成する。 This y = f (x) is also applied to the other entries D _i to determine the number of terms to be listed in the term list in all entries D _i . That is, if it contains _{x i} number of terms in the entry D _i, the entry _D above function the number _{y i} of the terms listed in terms list for _i y _i = f _(x i)
Calculated by Then, a term list including the determined term number y _i is recreated for the entry D _i .

したがって、一度用語数が決定してしまえば、全エントリについての用語リストが作成されることになり、この作成済みの用語リストを参照可能に用意しておけば、データベース検索におけるクエリーを受け付けた際に、上記計算式（１）〜（３）による計算処理を再度行う必要が無くなる。抽出装置は作成済みの用語リストを参照して、クエリーに対する検索結果を提示すればよい。 Therefore, once the number of terms has been determined, a term list for all entries will be created, and if this created term list is prepared so that it can be referenced, a query for database search will be accepted. In addition, it is not necessary to perform the calculation processing by the above formulas (1) to (3) again. The extraction device may present a search result for the query with reference to the created term list.

なお、式（４）を求める過程で選出されたエントリＤ_ｈは、利用者入力により抽出されなくてもよい。抽出装置側であらかじめ設定されていてもいいし、ランダムなど適宜に抽出するようにしてもよい。もちろんＤ_ｈは、上記式（４）で表現される関数の信頼性は多少落ちるが、１つのエントリであってもよい。 Note that the entry D _h selected in the process of obtaining Expression (4) may not be extracted by user input. It may be set in advance on the extraction device side, or may be appropriately extracted such as random. Of course D _h is the reliability of the function expressed by the formula (4) is somewhat fall, it may be one entry.

＝＝＝適用例＝＝＝
本発明の重要語抽出方法は、当然のことながら、ＯＭＩＭに限らず特定の分野についての文書を集約したデータベースに適用することができる。また、データベースは論文集などの、文章を集約したものに限らず、例えば、各文書を特定の分野についての用語に関する説明や定義などの解説文などとし、データベースはその用語についての解説文を集約した辞書（辞典）・事典であってもよい。 === Application example ===
The important word extraction method of the present invention can be applied to a database in which documents in a specific field are aggregated without being limited to OMIM. In addition, the database is not limited to a collection of articles such as a collection of papers. For example, each document is used as an explanation of definitions and definitions for terms in a specific field, and the database collects explanations about the terms. It may be a dictionary (dictionary) / encyclopedia.

本発明の実施例における重要語抽出方法の概念を説明する図である。It is a figure explaining the concept of the important word extraction method in the Example of this invention. 上記方法により作成される用語リストの概略図である。It is the schematic of the term list produced by the said method.

Explanation of symbols

１０用語リスト
１１ａ，１１ｂ用語
１２重要度 10 Term list 11a, 11b Term 12 Importance

Claims

A method of searching a document database in which n documents related to a predetermined academic field are aggregated by a programmed computer, calculating the importance of terms included in the database, and extracting terms having high importance in the specific field Because
A term storage step of acquiring the total number m of terms included in the database and each term T _j (j = 1, 2, 3,..., M), and identifying and managing each term T _j ;
An appearance frequency calculating step of calculating an appearance frequency W _ij for the term T _j in the document D _i ;
A variance calculation step for calculating a variance S ² _j of appearance frequency W _ij values for the term T _j ;
The number of occurrences of the term T _j in the document D _i is U _ij , and the importance V _ij of the term T _j in the document D _i is V _ij = U _ij × S ² _j
Importance calculation step calculated by
A list creation step of creating and outputting a term list in which terms T _j are listed based on V _ij ;
Extracting one or more documents D _h from n documents included in the document database ;
Obtaining the total number x of terms contained in the document D _h ;
The term T _g (g = 1, 2, 3,..., X) included in the term list for the document D _h created by the list creation step is output, and one from the term T _{g is} inputted by the user input. Receiving a designation of the above term T _k (k = 1, 2, 3,..., X);
Obtaining a designated T _k number a;
Extracting y terms T _f (f = 1, 2, 3,..., Y) based on the importance V _hg for the term T _g in the document D _h ;
Obtaining the number b of terms that match the designated term T _k out of the terms T _f while variably setting the number y of terms ;
Term extraction accuracy Zh for document D _h is
Z _h = b / a + {x− (a + y−b)} / (x−a)
A step of calculating according to the formula:
Obtaining x and y when the value of Z _h is maximum, and obtaining a function y = f (x) approximating the relationship between x and y;
Based on the function y = f (x), the number y _i of terms to be listed in the term list for the document D _i including x _i terms is calculated by y _i = f (x _i ), and the calculated Re-creating a term list with terms of the number of terms y _i ,
Including
In the occurrence frequency calculation step, the number of occurrences of term T _j in all documents and U _j, the number of occurrences of term T _j in the document D _i and U _ij, the total number of occurrences of the obtained m-number of all terms As U, the appearance frequency W _ij is _expressed as W _ij = (U _ij / U _j ) × log (U / U _j )
Calculated by
In the variance calculation step, the average value of the appearance frequencies of the term T _j for all documents is W, and the variance value S ² _j is S ² _j = {(W _1j −W) ² + (W _2j −W) ² + ... + (W _nj −W) ² } / n
Calculated by
A method for extracting important words from a document database.

2. The method according to claim 1, further comprising a step of accepting specification of the document D _i by user input, wherein the list creation step is based on the importance calculated by the importance calculation step for the term T _j included in the specified document D _i. An important word extraction method in a document database, characterized in that a list listed in order is created.

The list creation step according to claim 1, wherein the list creation step creates a term list in which high-priority terms are listed in order for each of all documents, receives a keyword specification by user input, And outputting a document D _i term list when a term corresponding to the keyword has a predetermined importance V _ij .

2. The document database according to claim 1, further comprising a step of accessing a dictionary database containing specific terms, wherein the list creation step does not list the terms existing in the dictionary database in the term list. Word extraction method.

2. The step of accessing a coefficient database in which a specific term and a coefficient are stored in association with each other, and a step of setting a value obtained by multiplying the importance V _ij of the term T _{j by} the corresponding coefficient as a new importance. And the list creating step creates a term list based on the new importance level.

It is configured by a computer, key words and executes the steps involved in the method of any of claims 1-5 extractor.

A computer program which is installed in a computer and causes the computer to execute the steps included in the method of any one of claims 1 to 5 .

A computer-readable program storage medium storing the computer program according to claim 7 .