JP2005141428A

JP2005141428A - Word string extracting method and device, and recording medium with word string extracting program recorded

Info

Publication number: JP2005141428A
Application number: JP2003376196A
Authority: JP
Inventors: Tsutomu Hirao; 努平尾; Hideki Isozaki; 秀樹磯崎; Jun Suzuki; 潤鈴木
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2003-11-05
Filing date: 2003-11-05
Publication date: 2005-06-02

Abstract

<P>PROBLEM TO BE SOLVED: To decide a score under the consideration of not only respective words but also the combination of words. <P>SOLUTION: When a document set belonging to a certain domain of a document DB10 is applied, a word string extracting device extracts a word string, and performs a low order square test between the word string and the previously extracted word string for a document group included in the predetermined domain and the others, and compares it with a threshold to authorize the word string which is characteristics to the domain, and calculates scores by applying predetermined weight to the authorized word string to extract a sentence whose score is high from among a plurality of documents belonging to the certain domain. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、文書データベース中の所定の分類項目（ドメイン）に含まれる文書群の特徴を表す文を抽出し、文書の要約を作成する用途に好適な、単語列抽出方法、装置及び単語列抽出プログラムを記録した記録媒体に関する。 The present invention relates to a word string extraction method, apparatus, and word string extraction suitable for use in extracting a sentence representing the characteristics of a document group included in a predetermined classification item (domain) in a document database and creating a document summary. The present invention relates to a recording medium on which a program is recorded.

通信技術や半導体技術の進歩により電子化文書の作成保管が容易になった。このため至るところに電子化文書が氾濫しており、こうした状況を背景に文書の要約技術に対する期待が高まっている。例えば、ある話題に関する一連の文書集合を纏めて要約を作成することができれば、文書の理解に必要となる負担を大いに減らすことができる。 Advances in communication technology and semiconductor technology have made it easier to create and store electronic documents. For this reason, computerized documents are flooding everywhere, and the expectations for document summarization techniques are increasing against this background. For example, if a summary can be created by collecting a series of documents related to a certain topic, the burden required for understanding the document can be greatly reduced.

前記した要約の作成に関して、対象となる文書から関係する文を抽出する手法に文スコア決定法がある。従来の文スコア決定法は、文中に出現する個々の単語に重みを与え、その和を文スコアとし、スコアの高い文を候補として抽出する手法である（例えば、非特許文献１参照）。 A sentence score determination method is known as a technique for extracting a related sentence from a target document regarding the creation of the summary. The conventional sentence score determination method is a method of assigning weights to individual words appearing in a sentence, extracting the sum as a sentence score, and extracting a sentence with a high score as a candidate (see, for example, Non-Patent Document 1).

非特許文献１によれば、文を形態素解析して単語に分割し、重み付け対象となる単語を品詞によって決定した後、個々の単語の重みを決定して文のスコアとしている。正確な文スコアの定義式は以下の（１）式で表される。 According to Non-Patent Document 1, a sentence is divided into words by morphological analysis, and after the word to be weighted is determined based on the part of speech, the weight of each word is determined and used as the score of the sentence. An accurate sentence score defining formula is expressed by the following formula (1).

ここで、tf(t,Si)は、単語ｔの文Siでの頻度、ω(t)は単語ｔの重みである。ω(t)としては、ＴＦ（Term Frequency）、ＩＤＦ（Inverse Document Frequency）、ＴＦ・ＩＤＦなどが用いられる。 Here, tf (t, Si) is the frequency of the word t in the sentence Si, and ω (t) is the weight of the word t. As ω (t), TF (Term Frequency), IDF (Inverse Document Frequency), TF / IDF, or the like is used.

以下、「暗証番号を入力する。」を例文として、従来の文スコア決定方法について説明する。まず、形態素解析を行うことにより、以下の表１で示す結果が得られる。 In the following, a conventional sentence score determination method will be described using the example of “input password” as an example sentence. First, the results shown in Table 1 below are obtained by performing morphological analysis.

いま、重み付け対象単語を名詞及び動詞に絞ると、「暗証」、「番号」、「入力」、「する」が選ばれる。従って、文のスコアはこれら単語の重みの総和となるため、６＋３＋３．５＋０．９＝１３．４となる。なお、単語の重みには以下の（２）式で示されるＩＤＦを用いた。 Now, if the word to be weighted is narrowed down to nouns and verbs, “password”, “number”, “input”, and “do” are selected. Therefore, since the sentence score is the sum of the weights of these words, 6 + 3 + 3.5 + 0.9 = 13.4. In addition, IDF shown by the following (2) formula was used for the weight of a word.

ここで、ＤＢは文書データベースであり、｜ＤＢ｜は、ＤＢに含まれる要素の数である。df(t)は、ＤＢ中で単語ｔを含む文書の数である。｜ＤＢ｜は大きければ大きいほどよいが、特に指定はなく、新聞記事１０年分程度あればさらに良い。
Klaus Zechner、”Fast Generation of Abstracts from General Document”、The 16th International Conference Computational Linguistics、pp．986-989、1996 Here, DB is a document database, and | DB | is the number of elements included in the DB. df (t) is the number of documents including the word t in the DB. | DB | is better as it is larger, but is not particularly specified, and it is better if it is about 10 years of newspaper articles.
Klaus Zechner, “Fast Generation of Abstracts from General Document”, The 16th International Conference Computational Linguistics, pp. 986-989, 1996

ところで前記した従来の文スコア決定法によれば、文中に出現する個々の単語を独立に評価することにより実行される。しかしながら、文の重要度は個々の単語だけではなく、単語の組み合わせも考慮しなければならない。
本発明は前記した事情に鑑みてなされたものであり、文中のｎ（≧１）個以上の単語の組み合わせ（単語列）を抽出し、さらにそれらから所定の分類項目（ドメイン）に特徴的な単語列を選抜して文のスコアを決定することで、検索結果等を高い信頼性の下で効率良く要約することができ、文書を読む際の人間への負担の軽減を可能とする、単語列抽出方法、装置及び単語列抽出プログラムを記録した記録媒体を提供することを目的とする。 By the way, according to the conventional sentence score determination method described above, it is executed by independently evaluating individual words appearing in the sentence. However, the importance of a sentence must consider not only individual words but also word combinations.
The present invention has been made in view of the above circumstances, and extracts a combination (word string) of n (≧ 1) or more words in a sentence, and is further characterized by a predetermined classification item (domain). By selecting a word string and determining the score of a sentence, the search results can be summarized efficiently with high reliability, and the word that can reduce the burden on humans when reading a document. It is an object of the present invention to provide a recording medium on which a sequence extraction method, an apparatus, and a word sequence extraction program are recorded.

前記課題を解決するため、本発明は以下のような構成とした。請求項１に記載の発明である文書データベース中の所定のドメインに属する文書群の特徴を表す単語列を抽出する単語列抽出方法は、単語列抽出処理機能と、単語列認定処理機能と、重み付与機能と、スコア決定処理機能を備えており、あるドメインに属する文書集合が与えられた場合に、単語列抽出処理機能が、前記文書データベースから長さ１以上の単語列を抽出するステップと、単語列認定処理機能が、前記抽出された単語列から前記ドメインに特徴的な単語列を認定するステップと、重み付与機能が、前記認定された単語列に所定の重み付けを付与するステップと、スコア決定処理機能が、前記単語列の重みに基づいて文のスコアを決定し、スコアの高い単語列を抽出するステップとを実行する構成とした。 In order to solve the above problems, the present invention has the following configuration. A word string extraction method for extracting a word string representing a feature of a document group belonging to a predetermined domain in a document database according to claim 1, a word string extraction processing function, a word string recognition processing function, a weight A step of extracting a word string having a length of 1 or more from the document database when a document set belonging to a certain domain is provided, the word string extraction processing function having a grant function and a score determination processing function; A step of authorizing a word string characteristic of the domain from the extracted word string, a step of assigning a predetermined weight to the certified word string, The determination processing function determines a sentence score based on the weight of the word string and extracts a word string having a high score.

本発明によれば、あるドメインに属する文書集合が与えられた場合に文書データベースから長さ１以上の単語列を抽出し、抽出された単語列からドメインに特徴的な単語列を認定し、認定された単語列に所定の重みを付与して文のスコアを決定しスコアの高い単語列を抽出することで、個々の単語を重み付けの対象とせずに単語列に対して重み付けを行いスコア評価することができ、このことにより単語等を用いて検索した結果である文書を信頼性高く効率的に要約することが可能となり、文書を読む側の負担軽減がはかれる。 According to the present invention, when a document set belonging to a certain domain is given, a word string having a length of 1 or more is extracted from the document database, a word string characteristic of the domain is identified from the extracted word string, and recognition is performed. By assigning a predetermined weight to the word sequence, the score of the sentence is determined, and a word sequence having a high score is extracted, so that each word sequence is weighted without being subjected to weighting, and score evaluation is performed. As a result, it is possible to efficiently and efficiently summarize a document that is a result of searching using a word or the like, thereby reducing the burden on the side of reading the document.

請求項２に記載の発明である単語列を認定するステップは、前記文書データベース中の所定のドメインに含まれる文書群とそれ以外とに対し、前記抽出された単語列との間でカイ二乗検定を行い、その結果と閾値との比較を行って前記ドメインに特徴的な単語列を認定するサブステップを含む構成とした。 The step of recognizing a word string according to claim 2 comprises a chi-square test between a document group included in a predetermined domain in the document database and the others and the extracted word string. And comparing the result with a threshold value to include a sub-step for recognizing a word string characteristic of the domain.

本発明によれば、文書データベース中の所定のドメインに含まれる文書群とそれ以外に対し、抽出された単語列との間でカイ二乗検定を行い、閾値との比較を行うことでドメインに特徴的な単語列を認定することができる。 According to the present invention, it is possible to perform a chi-square test between a document group included in a predetermined domain in the document database and the extracted word string for the other document groups, and compare the result with a threshold value. A typical word string can be recognized.

請求項３に記載の発明である文書データベース中の所定のドメインに属する文書群の特徴を表す単語列を抽出する単語列抽出装置は、あるドメインに属する文書集合が与えられた場合に、前記文書データベースから長さ１以上の単語列を抽出する単語列抽出処理部と、
前記抽出された単語列から前記ドメインに特徴的な単語列を認定する単語列認定処理部と、前記認定された単語列に所定の重みを付与する重み付与部と、前記単語列の重みに基づいて文のスコアを決定し、スコアの高い単語列を抽出するスコア決定処理部とを備える構成とした。 A word string extraction device for extracting a word string representing a feature of a document group belonging to a predetermined domain in a document database according to the invention described in claim 3 is provided when a document set belonging to a certain domain is given. A word string extraction processing unit that extracts a word string having a length of 1 or more from a database;
Based on a word string recognition processing unit that recognizes a word string characteristic of the domain from the extracted word string, a weighting unit that gives a predetermined weight to the recognized word string, and a weight of the word string And a score determination processing unit for determining a sentence score and extracting a word string having a high score.

本発明によれば、単語列認定処理部が、あるドメインに属する文書集合が与えられた場合に単語列抽出処理部によって文書データベースから抽出される単語列からドメインに特徴的な単語列を認定する。そして、重み付与部によって認定された単語列に所定の重み付けが付与され、スコア決定部でその重みに基づいて文のスコアが決定され、スコアの高い単語列を抽出することで、個々の単語を重み付けの対象とせずに単語列に対して重み付けを行いスコア評価することができ、このことにより単語等を用いて検索した結果である文書を信頼性高く効率的に要約することが可能となり、文書を読む側の負担軽減がはかれる単語列抽出装置を提供することができる。 According to the present invention, the word string recognition processing unit recognizes a word string characteristic of a domain from the word string extracted from the document database by the word string extraction processing unit when a document set belonging to a certain domain is given. . Then, a predetermined weight is assigned to the word string recognized by the weight assigning unit, the score determination unit determines the score of the sentence based on the weight, and the individual word is extracted by extracting the word sequence having a high score. A word string can be weighted without being subject to weighting and score evaluation can be performed, which makes it possible to efficiently and efficiently summarize documents obtained as a result of searching using words, etc. It is possible to provide a word string extraction device that can reduce the burden on the reader.

請求項４に記載の発明である文書データベース中の所定のドメインに属する文書群の特徴を表す単語列を抽出する単語列抽出プログラムを記録したコンピュータ読み取り可能な記録媒体は、あるドメインに属する文書集合が与えられた場合に、前記コンピュータに、前記文書データベースから長さ１以上の単語列を抽出する単語列抽出処理機能と、前記抽出された単語列から前記ドメインに特徴的な単語列を認定する単語列認定処理機能と、前記認定された単語列に所定の重み付けを付与する重み付与機能と、前記単語列の重みに基づいて文のスコアを決定し、スコアの高い文を抽出するスコア決定処理機能とを実行させる構成とした。 A computer-readable recording medium on which a word string extraction program for extracting a word string representing a feature of a document group belonging to a predetermined domain in a document database according to the invention is a document set belonging to a certain domain Is given to the computer, a word string extraction processing function for extracting a word string having a length of 1 or more from the document database, and a word string characteristic for the domain from the extracted word string are recognized. A word string recognition processing function, a weighting function for assigning a predetermined weight to the recognized word string, and a score determination process for determining a sentence score based on the weight of the word string and extracting a sentence with a high score The function is executed.

本発明によれば、前記したプログラムをコンピュータに実行させることにより、あるドメインに属する文書集合が与えられた場合に文書データベースから長さ１以上の単語列を抽出し、抽出された単語列からドメインに特徴的な単語列を認定し、認定された単語列に所定の重み付けを付与して文のスコアを決定しスコアの高い単語列を抽出することができ、個々の単語を重み付けの対象とせずに単語列に対して重み付けを行いスコア評価し、このことにより検索結果などを信頼性高く効率的に要約することが可能となり、文書を読む側の負担軽減をはかることができる。 According to the present invention, by causing a computer to execute the above-described program, when a document set belonging to a certain domain is given, a word string having a length of 1 or more is extracted from the document database, and the domain is extracted from the extracted word string. It is possible to identify word strings that are distinctive to each other, assign a predetermined weight to the recognized word strings, determine sentence scores, and extract high-scoring word strings, so that individual words are not subject to weighting. The word strings are weighted and score evaluation is performed. This makes it possible to summarize search results and the like reliably and efficiently, and to reduce the burden on the side of reading the document.

本発明によれば、文中の単語の組み合わせ（単語列）を抽出し、さらに所定の分類項目（ドメイン）に特徴的な単語列を選抜して文のスコアを決定することで検索結果などを信頼性高く効率的に要約することができる。このため、電子化文書が氾濫している中で、ある話題に対する一連の文書を纏めて要約することなどが可能となり、読む側の負担が大幅に軽減される。 According to the present invention, a combination of words in a sentence (word string) is extracted, and a word string characteristic of a predetermined classification item (domain) is selected to determine a sentence score, thereby trusting a search result or the like. Summarize efficiently and efficiently. For this reason, it is possible to summarize a series of documents for a certain topic in a flood of digitized documents, which greatly reduces the burden on the reading side.

以下、本発明実施形態につき図面を参照して説明する。図１は、本実施形態の単語列抽出装置の内部構成を示すブロック図であり、（ａ）に機能展開して示した構成図を、（ｂ）にハードウェア構成図を示す。
本実施形態の単語列抽出装置１は、機能的に大別すれば、文書データベース（以後、文書ＤＢと略記する）１０と、単語列抽出処理部１１と、単語列認定処理部１２と、重み付与部１３と、スコア決定処理部１４で構成される。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing the internal configuration of the word string extraction apparatus of the present embodiment, in which (a) shows a functional diagram and (b) shows a hardware configuration diagram.
The word string extraction device 1 according to the present embodiment can be roughly classified into a document database (hereinafter abbreviated as a document DB) 10, a word string extraction processing unit 11, a word string recognition processing unit 12, and a weight. It comprises an assigning unit 13 and a score determination processing unit 14.

文書ＤＢ１０には、あらかじめ文書群が格納されてあるものとする。単語列抽出処理部１１は、文書ＤＢ１０中の所定の分類項目（ドメイン）に属する文書集合が与えられた場合に、文書ＤＢ１０から長さ１以上の単語列を抽出する機能を持ち、ここで抽出された単語列は、単語列認定処理部１２へ供給される。単語列認定処理部１２は、抽出された単語列からドメインに特徴的な単語列を認定する機能を持ち、具体的には、文書ＤＢ１０中の所定のドメインに含まれる文書群とそれ以外に対し、抽出された単語列との間でカイ二乗（χ²）検定を行い、その結果と閾値との比較を行ってドメインに特徴的な単語列を認定する。
また、重み付与部１３は、単語列認定処理部１２で認定された単語列に所定の演算式を実行して重みを付与する機能を持ち、スコア決定処理部１４は、単語列の重みに基づいて文のスコアを決定し、スコアの高い単語列を抽出する機能を持つ。 It is assumed that a document group is stored in the document DB 10 in advance. The word string extraction processing unit 11 has a function of extracting a word string having a length of 1 or more from the document DB 10 when a document set belonging to a predetermined classification item (domain) in the document DB 10 is given. The obtained word string is supplied to the word string recognition processing unit 12. The word string recognition processing unit 12 has a function of recognizing a word string characteristic of the domain from the extracted word string, specifically, for a document group included in a predetermined domain in the document DB 10 and the others Then, a chi-square (χ ² ) test is performed on the extracted word string, and the result is compared with a threshold value to identify a word string characteristic of the domain.
The weight assigning unit 13 has a function of executing a predetermined arithmetic expression on the word string recognized by the word string authorization processing unit 12 and assigning a weight, and the score determination processing unit 14 is based on the weight of the word string. The sentence score is determined, and a word string having a high score is extracted.

なお、図１（ｂ）に示すハードウェア構成図は、前記した単語列抽出処理部１１と、単語列認定処理部１２と、重み付与部１３と、スコア決定処理部１４が持つ機能を実行する主制御装置２１と、文書ＤＢ１０が構築される記憶装置２２と、主制御装置２１に対して入力されるドメイン指定、あるいは主制御装置２１から出力されるスコア決定出力を表示するなどのマンマシンインタフェースとなる入出力装置２３およびＮＩＣ（Network Interface Unit）２４が、アドレス、データ、コントロールのためのラインが複数本で構成されるシステムバスを介して共通接続され、構成される。
なお、主制御装置２１は、プログラムが格納されるＲＡＭと、ＲＡＭに格納されたプログラムを読み出し逐次実行するＣＰＵで構成される。 Note that the hardware configuration diagram shown in FIG. 1B executes the functions of the word string extraction processing unit 11, the word string recognition processing unit 12, the weighting unit 13, and the score determination processing unit 14. A man-machine interface for displaying the main control device 21, the storage device 22 in which the document DB 10 is constructed, the domain designation input to the main control device 21, or the score determination output output from the main control device 21 The input / output device 23 and the NIC (Network Interface Unit) 24 are commonly connected and configured via a system bus including a plurality of lines for address, data, and control.
The main controller 21 includes a RAM that stores a program and a CPU that reads and sequentially executes the program stored in the RAM.

本発明の単語列抽出方法について以下に詳細説明を行う。ここでは、文書ＤＢ１０を｛Ｄ₁、Ｄ₂、…、Ｄ_i、…、Ｄ_n｝と表し、そこに含まれる文集合をＳ＝｛Ｓ₁、Ｓ₂、…、Ｓ_j、…、Ｓ_m｝と表すものとする。ここで、Ｓ中の文に含まれる単語列の抽出には、系列パターンマイニングアルゴリズムを適用することができる。このためには、例えば、「ＰｒｅｆｉｘＳｐａｎ」等、頻出パターンを抽出する周知のデーマイニングツールを用いればよい。 The word string extraction method of the present invention will be described in detail below. Here, the document _{_{DB10 {D 1, D 2,}} ..., D i, ..., D n} and represents, S = {S ₁ a set of sentences contained _{_{therein, S 2, ..., S j}} , ..., S _m }. Here, a sequence pattern mining algorithm can be applied to extraction of a word string included in a sentence in S. For this purpose, for example, a known day mining tool for extracting a frequent pattern such as “PrefixSpan” may be used.

ここで、文書ＤＢ１０中の文書群を、あるドメイン（たとえば、「経済」）とそれ以外に分割することを考える。そして、ドメインに属する文書群をＤ_dom、それ以外をＤ_otherと表し、それぞれに属する文集合をＳ_dom 、Ｓ_other と表す。
いま、Ｓ_domから得られた単語列の集合をＰとし、Ｐ中の要素をＰ_iとする。Ｐ_iに関して表２で示す分割表を得ることができる。 Here, it is considered that a document group in the document DB 10 is divided into a certain domain (for example, “economy”) and the others. A group of documents belonging to the domain is represented as D _dom , the others are represented as D _other, and a sentence set belonging to each is represented as S _dom and S _other .
Now, let P be a set of word strings obtained from S _dom , and let P _i be an element in P. The contingency table shown in Table 2 for P _i can be obtained.

ｎ₁₁は、Ｓ_domにおいて、単語列Ｐ_iが出現する文の数であり、ｎ₁₂は、Ｓotherで、Ｐiが出現する文の数であり、ｎ₂₁は、ＳdomＰ_iが出現しない文の数である。ここで「¬Ｐ_i」は、Ｐ_iが存在することの否定であり、Ｐ_iが出現しないことを意味する。Ｐ_iがドメインに特徴的か否かをχ²検定を用いて決定する。χ²値は以下の（３）式により求められる。 n ₁₁ is the number of sentences in which word string P _i appears in S _dom , n ₁₂ is the number of sentences in which Pi appears in Sother, and n ₂₁ is the number of sentences in which Sdom P _i does not appear. It is. Here, the "¬P _i" is a denial of the presence of P _i, which means that P _i does not appear. Determine whether P _i is characteristic of the domain using the χ ² test. The χ ² value is obtained by the following equation (3).

ここで、χ²値が閾値である３．８４１５以上（自由度１の、χ²分布から求まる有意水準）であれば、Ｐ_iはドメイン特徴的な単語列であるといえる。こうして得たχ²値が３．８５１４以上である単語列集合をＰ_sigとする。単語列集合Ｐ_sigに属する単語列ｐに対して、以下の（４）式を実行することにより重み付けを行う。 Here, if the χ ² value is equal to or higher than the threshold value of 3.8415 (significance level obtained from the χ ² distribution with 1 degree of freedom), it can be said that P _i is a domain characteristic word string. A word string set having a χ ² value of 3.8514 or more obtained in this way is defined as P _sig . The word string p belonging to the word string set P _sig is weighted by executing the following equation (4).

ここで、ｆ（p、Ｓ_dom）は、要約対象であるドメインに属する文集合Ｓ_domにおける単語列pの出現文数、ｆ（p、Ｓ）は全データにおける単語列ｐの出現文数であり、|Ｓ|は、Ｓの要素数である文数、l_en（p）は単語列の長さ（単語数）を示す。
なお、前記した（４）式は、ある一定の出現頻度があればそれぞれに大きな差をつけずに評価する意味で用いられるものであって、例えば、前記した（４）式の分母を除いた演算式、あるいは、分子の第１項からlogを除いた演算式等、種々考えられる。 Here, f (p, S _dom ) is the number of appearing sentences of the word string p in the sentence set S _dom belonging to the domain to be summarized, and f (p, S) is the number of appearing sentences of the word string p in all data. Yes, | S | indicates the number of sentences that is the number of elements of S, and l _en (p) indicates the length of the word string (number of words).
In addition, the above-described equation (4) is used for the purpose of evaluating without giving a large difference to each if there is a certain appearance frequency. For example, the denominator of the above-described equation (4) is excluded. Various formulas are conceivable, such as an arithmetic expression or an arithmetic expression obtained by removing log from the first term of the numerator.

最終的に、このようにして決定した単語列の重みω(p)を用いて以下の（５）式で文のスコアを決定する。 Finally, the sentence score is determined by the following equation (5) using the weight ω (p) of the word string thus determined.

図２は、本発明の単語列抽出装置の動作を説明するために引用したフローチャート、図３は、その具体例を説明するために引用した動作概念図である。なお、図２に示すフローチャートは、本発明の単語列抽出プログラムの処理手順も示している。
以下、図２、図３を参照しながら図１に示す単語列抽出装置の動作について詳細に説明する。 FIG. 2 is a flowchart cited for explaining the operation of the word string extraction apparatus of the present invention, and FIG. 3 is an operation conceptual diagram quoted for explaining a specific example thereof. The flowchart shown in FIG. 2 also shows the processing procedure of the word string extraction program of the present invention.
Hereinafter, the operation of the word string extraction apparatus shown in FIG. 1 will be described in detail with reference to FIGS.

まず、本発明の単語列抽出装置に対して、スコアを得たい文が入力される。これに対し、単語列抽出処理部１１は、形態素解析を行い（Ｓ２１）、品詞による絞込みを行って名詞と動詞を選択し（Ｓ２２）、ここで得られた単語のみを対象として系列パターンマイニングを適用して単語列を生成する（Ｓ２３）。 First, a sentence for which a score is to be obtained is input to the word string extraction device of the present invention. On the other hand, the word string extraction processing unit 11 performs morphological analysis (S21), narrows down by part of speech to select nouns and verbs (S22), and performs sequence pattern mining for only the words obtained here. A word string is generated by application (S23).

具体的には、例えば図３に示すように、文「暗証番号を入力する。」が入力されると、形態素解析による解析と品詞による絞り込みにより、名詞と動詞である「暗証番号入力する」が得られる。次に、これらに対して系列パターンマイニングを適用することにより、単語列「暗証−番号−入力−する」、「暗証−番号−入力」、「暗証−番号−する」、「暗証−入力−する」、「番号−入力−する」、「暗証−番号」、「暗証−入力」、「暗証−する」、「番号−入力」、「番号−する」、「入力−する」、「暗証」、「番号」、「入力」、「する」、が生成され、単語列認定処理部１２へ引き渡される。 Specifically, for example, as shown in FIG. 3, when a sentence “input a PIN” is input, a noun and a verb “PIN” are input by analysis by morphological analysis and narrowing down by part of speech. can get. Next, by applying sequence pattern mining to these, the word strings “password-number-input-”, “password-number-input”, “password-number-input”, “password-input-” are performed. ”,“ Number-input-enable ”,“ password-number ”,“ password-input ”,“ password-enable ”,“ number-input ”,“ number-enable ”,“ input-enable ”,“ password ”, “Number”, “Input”, and “Yes” are generated and delivered to the word string recognition processing unit 12.

単語列認定処理部１２では、文書ＤＢ１０中の所定のドメインに含まれる文書群とそれ以外に対し、単語列抽出処理部１１で抽出された単語列との間でχ²検定を行う（Ｓ２４）。その結果と閾値αとを比較することにより、その単語列がドメインに特徴的か否かを決定する（Ｓ２５）。 The word string recognition processing unit 12 performs a χ ² test between the document group included in the predetermined domain in the document DB 10 and the other word strings extracted by the word string extraction processing unit 11 (S24). . By comparing the result with the threshold value α, it is determined whether or not the word string is characteristic of the domain (S25).

ここでは、前記した単語列に対し、前記した演算式（３）を適用することにより、それぞれχ²値「６．７」、「５．５」、「４．５」、「３．５」、「２．１」、「５」、「３．１」、「３．３」、「２．１」、「１．１」、「０．９」、「５．８」、「２．２」、「１．２」、「０．２」が生成される（Ｓ２４）。そして、それぞれを閾値αである「３．８４１５」と比較することにより、その単語列がドメインに特徴的か否かを決定する（Ｓ２５）。ここでは、「暗証−番号−入力−する」、「暗証−番号−入力」、「暗証−番号−する」、「暗証−番号」、「暗証」の５つの単語列が選択され、ドメインに特徴的な単語列であるものとして抽出される。 Here, by applying the arithmetic expression (3) to the above-described word string, the χ ² values “6.7”, “5.5”, “4.5”, and “3.5” are respectively obtained. , “2.1”, “5”, “3.1”, “3.3”, “2.1”, “1.1”, “0.9”, “5.8”, “2. 2 ”,“ 1.2 ”, and“ 0.2 ”are generated (S24). Then, by comparing each with a threshold value “3.8415”, it is determined whether or not the word string is characteristic of the domain (S25). Here, five word strings of “password-number-input-do”, “password-number-input”, “password-number-do”, “password-number”, and “password” are selected and characterized by the domain. Extracted as a typical word string.

次に、重み付与部１３において、前記のように選択された単語列に対し（４）式を適用することにより重みが付与される（Ｓ２６）。図３に示す例では、抽出された単語列のそれぞれに、「１５．１」、「７．２」、「６．１」、「４．２」、「０．１」の重みが付与される。 Next, the weight assigning unit 13 assigns a weight by applying the expression (4) to the word string selected as described above (S26). In the example illustrated in FIG. 3, weights “15.1”, “7.2”, “6.1”, “4.2”, and “0.1” are assigned to each extracted word string. The

そして、スコア決定処理部１４において（５）式を適用することにより重みの総和を計算し、文のスコアを決定する（Ｓ２７）。図３に示す例では、１５．１＋７．２＋６．１＋４．２＋０．１＝３２．７によりスコアを得ている。 Then, the sum of weights is calculated by applying the formula (5) in the score determination processing unit 14, and the score of the sentence is determined (S27). In the example shown in FIG. 3, the score is obtained by 15.1 + 7.2 + 6.1 + 4.2 + 0.1 = 32.7.

このようにしてスコアを決定した後、スコアの高い単語列を選択して要約等の作成に用いる。 After determining the score in this way, a word string having a high score is selected and used for creating a summary or the like.

以上、説明したように本発明は、文書ＤＢ１０の所定のドメインに含まれる文書群の特徴を表す文を抽出し、文書の要約を作成することを可能にする。そのために、文書ＤＢ１０の所定のドメインに含まれる文書群とそれ以外とに対し、χ²検定を行い、その結果であるχ²値に閾値処理を施し、さらに閾値処理後の単語列に対して重みを与えてスコアを決定し、スコアの高い文を抽出する構成とするものである。すなわち、形態素解析の結果である個々の単語を重み付けの対象とするのではなく、単語列に対して重み付けを行うことで単語の組み合わせによるスコア評価を可能とするものである。 As described above, the present invention makes it possible to extract a sentence representing the characteristics of a document group included in a predetermined domain of the document DB 10 and create a document summary. For this purpose, the χ ² test is performed on the document group included in the predetermined domain of the document DB 10 and other documents, threshold value processing is performed on the resulting χ ² value, and the word string after the threshold processing is further processed. A score is determined by giving a weight, and a sentence with a high score is extracted. In other words, it is possible to perform score evaluation based on a combination of words by weighting a word string, instead of using individual words that are the result of morphological analysis as weighting targets.

このことにより、あるドメインに属する複数の文書からスコアの高い文を抽出することができ信頼性の高い要約等を作成することが可能となる。従って、検索結果などを効率的に要約することが可能となり、人間が文書を読む際の負担が軽減される。 This makes it possible to extract a sentence with a high score from a plurality of documents belonging to a certain domain, and to create a highly reliable summary or the like. Therefore, it is possible to efficiently summarize search results and the like, and the burden on humans when reading a document is reduced.

なお、本発明は、図１に示す単語列抽出処理部１１と、単語列認定処理部１２と、重み付与部１３と、スコア決定処理部１４のそれぞれで実行される手順をコンピュータ読み取り可能な記録媒体に記録し、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによっても本発明を実現することができるものである。ここでいうコンピュータシステムとは、ＯＳや周辺機器等のハードウェアを含む。 It should be noted that the present invention is a computer-readable record of the procedure executed by each of the word string extraction processing unit 11, the word string recognition processing unit 12, the weighting unit 13, and the score determination processing unit 14 shown in FIG. The present invention can also be realized by recording on a medium, causing a computer system to read and execute a program recorded on the recording medium. The computer system here includes an OS and hardware such as peripheral devices.

本発明実施形態の内部構成を示すブロック図である。It is a block diagram which shows the internal structure of this embodiment. 本発明実施形態の動作を説明するために引用したフローチャートである。It is the flowchart quoted in order to demonstrate operation | movement of this invention embodiment. 本発明実施形態の動作を概念的に示した動作概念図である。It is the operation | movement conceptual diagram which showed notionally the operation | movement of this invention embodiment.

Explanation of symbols

１０文書ＤＢ
１１単語列抽出処理部
１２単語列認定処理部
１３重み付与部
１４スコア決定処理部 10 Document DB
DESCRIPTION OF SYMBOLS 11 Word sequence extraction process part 12 Word string recognition process part 13 Weight assignment part 14 Score determination process part

Claims

A word string extraction method for extracting a word string representing a feature of a document group belonging to a predetermined domain in a document database, comprising a word string extraction processing function, a word string recognition processing function, a weighting function, and a score determination processing function Because
Given a set of documents belonging to a domain,
A word string extraction processing function extracting a word string having a length of 1 or more from the document database;
A word string recognition processing function for recognizing a word string characteristic of the domain from the extracted word string;
A step of assigning a predetermined weight to the recognized word string;
The score determination processing function includes a step of determining a score of a sentence based on the weight of the word string and extracting a word string having a high score.

The step of authorizing the word string includes
A chi-square test is performed between the extracted word string and the document group included in the predetermined domain in the document database and the others, and the result is compared with a threshold value to characterize the domain. The word string extraction method according to claim 1, further comprising a sub-step of identifying a typical word string.

A word string extraction device for extracting a word string representing the characteristics of a document group belonging to a predetermined domain in a document database,
Given a set of documents belonging to a domain,
A word string extraction processing unit for extracting a word string having a length of 1 or more from the document database;
A word string recognition processing unit that recognizes a word string characteristic of the domain from the extracted word string;
A weighting unit that gives a predetermined weight to the certified word string;
A word string extraction apparatus comprising: a score determination processing unit that determines a score of a sentence based on a weight of the word string and extracts a word string having a high score.

A computer-readable recording medium on which a word string extraction program for extracting a word string representing the characteristics of a document group belonging to a predetermined domain in a document database is recorded,
Given a set of documents belonging to a domain,
In the computer,
A word string extraction processing function for extracting a word string having a length of 1 or more from the document database;
A word string recognition processing function for recognizing a word string characteristic of the domain from the extracted word string;
A weighting function for giving a predetermined weight to the certified word string;
A computer-readable recording medium recorded with a word string extraction program, wherein a score determination processing function for determining a score of a sentence based on the weight of the word string and extracting a sentence with a high score is executed.