JP2016057652A

JP2016057652A - Word list generation device and word list generation method

Info

Publication number: JP2016057652A
Application number: JP2014180762A
Authority: JP
Inventors: 慶今沢; Kei Imazawa; 義輝勝村; Yoshiteru Katsumura; 隆介木村; Ryusuke Kimura
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2014-09-05
Filing date: 2014-09-05
Publication date: 2016-04-21

Abstract

PROBLEM TO BE SOLVED: To reduce the number of words required to be input by a user from among lacking words not described in a document described in natural language.SOLUTION: Clustered important words generated by using other documents and a causal relationship among these important words are formed into a database, when important words are extracted from a new document, and those among the important words of the database not included in the important words of the new document and the upstream words of the causal relationship included in the cluster of the important words are words not included in the important words, the words are included in a word list presented to require a user to input, and when those among the important words of the database not included in the important words of the new document and the upstream words of the causal relationship included in the cluster of the important words are words included in the important words, the words are not included in the word list presented to require a user to input, and thus a word list is created.SELECTED DRAWING: Figure 1

Description

本発明は、自然言語で記載された文書に不足している単語リストを生成する情報処理技術に関する。 The present invention relates to an information processing technique for generating a word list that is lacking in a document written in a natural language.

特開2004-302506号公報(特許文献1)［0024］には、「テキストマイニング手段１２２は、音声認識手段１２１で変換されたテキストデータから、必要な情報を抽出する。テキストマイニングは、複数のテキストデータの集計を行う場合に使用される場合があるが、言語処理に基づいてテキストからマイニングに役立ちそうな概念情報の抽出を行うことができる。この機能により、テキストデータからユーザの情報（氏名、連絡先を含む）、修理現場、対象機器、故障部位、故障状況他、必要な情報を抽出することができる。テキストマイニング手段１２２は、ユーザ２００が、自由に話す言葉から、必要なデータを抽出できることができるため好ましい。さらに、テキストマイニングの精度を上げるため、必要な項目に関して回答が得られるよう、修理依頼受付システム１０１が音声で所定の質問をする場合がある。具体的には音声送受信部１１０が、氏名、住所、電話番号、対象機器、故障部位、故障状況を、合成音声などで質問して、回答を得る方法に例示される。質問に対する回答であるため、回答内容の範囲が予想でき、テキストマイニングの精度が向上する。」と記載されている。 JP 2004-302506 A (Patent Document 1) [0024] states that “the text mining means 122 extracts necessary information from the text data converted by the speech recognition means 121. Although it may be used when text data is aggregated, it is possible to extract conceptual information that is useful for mining from text based on language processing.This function enables user information (names) to be extracted from text data (Including contact information), repair site, target device, failure location, failure status, etc. Necessary information can be extracted, and the text mining means 122 can extract necessary data from words that the user 200 speaks freely. In order to improve the accuracy of text mining, in order to get answers on necessary items, In some cases, the request receiving system 101 asks a predetermined question by voice, specifically, the voice transmission / reception unit 110 asks a name, address, telephone number, target device, faulty part, and fault status by a synthesized voice or the like. This is exemplified by a method for obtaining an answer, since it is an answer to a question, the range of the answer content can be predicted, and the accuracy of text mining is improved.

特開2004-302506JP2004-302506

特許文献1では、他の回答から必要な項目が得られる場合でも、必要な項目そのものを得られない場合、全て質問することになる。 In Patent Document 1, even if necessary items can be obtained from other answers, if the necessary items themselves cannot be obtained, all questions are asked.

しかし、単語の共起関係や因果関係を考慮すれば、回答済みの他の回答から導き出すことができる単語もあるはずである。 However, given the co-occurrence and causal relationships of words, there should be some words that can be derived from other answers that have already been answered.

本発明の目的は、自然言語で記述された文書に不足している情報のうち、ユーザに入力を求める情報量を減らすことで、短時間に必要な情報を収集することである。 An object of the present invention is to collect necessary information in a short time by reducing the amount of information that the user is required to input among information that is lacking in a document described in a natural language.

本願は上記課題を解決する手段を複数含んでいるが、その一例を挙げるならば、次のとおりである。 The present application includes a plurality of means for solving the above-described problems. An example of the means is as follows.

自然言語で記述された新規文書をコンピュータに入力し、コンピュータに当該新規文書に不足している単語からユーザに入力を求める単語リストを生成させる単語リスト生成方法であって、
自然言語で記述された他の文書から抽出した複数の重要単語をクラスタ化しておくとともに、当該重要単語を各クラスタ内における重要単語間の因果関係とともにデータベースに記憶させておき、
コンピュータに、
単語リストを作成する対象の新規文書から重要単語を抽出させ、
前記他の文書から抽出した重要単語のうちの前記抽出した新規文書の重要単語に含まれない重要単語であって、当該重要単語のクラスタに含まれる因果関係の上流単語が新規文書の上流単語に含まれない単語である場合に、当該単語をユーザに入力を求めるために提示する単語リストに含め、
前記他の文書から抽出した重要単語のうちの前記抽出した新規文書の重要単語に含まれない単語であって、当該単語のクラスタに含まれる因果関係の上流単語が新規文書の上流単語に含まれる単語である場合に、当該単語をユーザに入力を求めるために提示する単語リストに含めないことにより、単語リストを作成させることを特徴とする単語リスト生成方法。 A word list generation method for inputting a new document described in a natural language into a computer, and causing the computer to generate a word list that requests a user to input from words lacking in the new document,
Cluster multiple important words extracted from other documents written in natural language, store the important words in the database together with the causal relationship between the important words in each cluster,
On the computer,
Extract important words from a new document for which you want to create a word list,
Of important words extracted from the other document, important words that are not included in the extracted important word of the new document, and an upstream word of causal relationship included in the cluster of the important word becomes an upstream word of the new document If the word is not included, include it in the word list presented to the user for input,
Of the important words extracted from the other documents, the words that are not included in the important words of the extracted new document, and the causal upstream words included in the cluster of the words are included in the upstream words of the new document A word list generation method characterized by causing a word list to be created when a word is not included in a word list to be presented to the user for input when the word is a word.

本発明によれば、自然言語で記述された文書に不足している情報のうち、ユーザに入力を求める情報量を減らすことで、短時間に必要な情報を収集することができる。 According to the present invention, it is possible to collect necessary information in a short time by reducing the amount of information that the user is requested to input among information that is lacking in a document described in a natural language.

単語リスト生成装置のブロック図である。It is a block diagram of a word list production | generation apparatus. 単語リスト作成装置の処理フローを示す図である。It is a figure which shows the processing flow of a word list production apparatus. 重要単語選定ステップの処理フロー図である。It is a processing flowchart of an important word selection step. DB作成用文書のデータテーブルを示す図である。It is a figure which shows the data table of the document for DB creation. 図４のＤＢ作成用文書から作成した単語リストを示す図である。It is a figure which shows the word list produced from the document for DB creation of FIG. 話題推定モデルを示す図である。It is a figure which shows a topic estimation model. 単語分布を示す図である。It is a figure which shows word distribution. 共起・因果関係分析ステップの処理フローを示す図である。It is a figure which shows the processing flow of a co-occurrence / causal relationship analysis step. 各重要単語の使用実績データを示す図である。It is a figure which shows the use performance data of each important word. 重要単語の単語クラスタ情報を示す図である。It is a figure which shows the word cluster information of an important word. 各単語間の因果関係を示す図である。It is a figure which shows the causal relationship between each word. 単語リスト作成ステップの処理フローを示す図である。It is a figure which shows the processing flow of a word list preparation step. 単語リストを提示する画面イメージを示す図である。It is a figure which shows the screen image which presents a word list.

以下、本発明の実施例を説明するが、これらに限定されるものではない。 Examples of the present invention will be described below, but the present invention is not limited thereto.

以下、実施例について図面を用いて説明する。
［基本ブロック構成］
図１は、単語リスト生成装置のブロック図である。 Hereinafter, embodiments will be described with reference to the drawings.
[Basic block configuration]
FIG. 1 is a block diagram of a word list generation device.

本実施例の単語リスト生成装置は、データベース部1510と、データベース部1510とネットワークで接続されている単語リスト生成エンジン部1520と、単語リスト生成エンジン部1520とネットワークで接続されているユーザインターフェース部1530とを備えている。 The word list generation apparatus according to the present embodiment includes a database unit 1510, a word list generation engine unit 1520 connected to the database unit 1510 via a network, and a user interface unit 1530 connected to the word list generation engine unit 1520 via a network. And.

データベース部1510は、DB作成用文書DB1511と、重要単語DB1512と、共起・因果関係DB1513と、新規文書DB1514を備えている。 The database unit 1510 includes a DB creation document DB 1511, an important word DB 1512, a co-occurrence / causal relationship DB 1513, and a new document DB 1514.

DB作成用文書DB1511には、重要単語DB1512及び共起・因果関係DB1513の構築に用いる自然言語で記述された文書が記憶されている。 The DB creation document DB 1511 stores a document described in a natural language used to construct the important word DB 1512 and the co-occurrence / causal relationship DB 1513.

重要単語DB1512には、DB作成用文書DB1511から抽出した重要単語が記憶されている。 The important word DB 1512 stores important words extracted from the DB creation document DB 1511.

共起・因果関係DB1513には、重要単語DB1512に記憶されている重要単語間の共起関係及び因果関係が記憶されている。 The co-occurrence / causal relationship DB 1513 stores co-occurrence relationships and causal relationships between important words stored in the important word DB 1512.

新規文書DB1514には、単語リストの生成対象である自然言語で記述された文書が記憶されている。 The new document DB 1514 stores a document described in a natural language, which is a word list generation target.

単語リスト生成エンジン部1520は、形態素解析エンジン1521と、話題推定モデル構築エンジン1522と、単語分布推定エンジン1523と、しきい値設定エンジン1524と、重要単語判定エンジン1525と、単語クラスタ作成エンジン1526と、因果関係分析エンジン1527と、候補単語抽出エンジン1528と、出力単語選定エンジン1529からなる。 The word list generation engine unit 1520 includes a morphological analysis engine 1521, a topic estimation model construction engine 1522, a word distribution estimation engine 1523, a threshold setting engine 1524, an important word determination engine 1525, and a word cluster creation engine 1526, The causal relationship analysis engine 1527, the candidate word extraction engine 1528, and the output word selection engine 1529 are included.

ユーザインターフェース1530は、新規文書入力画面1531と、単語リスト確認画面1532からなる。 The user interface 1530 includes a new document input screen 1531 and a word list confirmation screen 1532.

これらのエンジン及びユーザインターフェースについては、以下の処理フローとともに説明する。
［フロー説明］
図２は本実施例の単語リスト生成装置の処理フロー図である。 These engines and user interfaces will be described together with the following processing flow.
[Flow explanation]
FIG. 2 is a process flow diagram of the word list generation device of this embodiment.

本実施例の単語リスト生成装置は、自然言語で記述された文書を対象とし、単語リストの生成対象でない他の文書、（例えばに過去に作成された文書）をDB作成用文書として取得するDB作成用文書取得ステップ101と、取得したDB作成用文書から重要単語を選定する重要単語選定ステップ102と、選定した重要単語を共起関係でクラスタ化するとともに、そのクラスタ内の重要単語の因果関係を分析する共起・因果関係抽出ステップ103と、単語リストの生成対象である新規文書を取得する新規文書取得ステップ104と、取得した新規文書から重要単語を選定する重要単語選定ステップ105と、データベース部にクラスタ化して記憶された重要単語とそのクラスタ内での因果関係からユーザに提示する単語リストを生成する単語リスト作成ステップ106と、作成した単語リストをユーザに提示する単語リスト提示ステップ107からなる処理フローを有している。 The word list generation apparatus according to the present embodiment targets a document described in a natural language, and acquires another document (for example, a document created in the past) that is not a word list generation target as a DB creation document. Document creation step 101, important word selection step 102 for selecting important words from the acquired DB creation document, and clustering the selected important words in a co-occurrence relationship, and the causal relationship of the important words in the cluster A co-occurrence / causal relationship extraction step 103 for analyzing a new document, a new document acquisition step 104 for acquiring a new document as a word list generation target, an important word selection step 105 for selecting an important word from the acquired new document, and a database A word list creating step 106 for generating a word list to be presented to the user from the important words clustered and stored in the part and the causal relationship in the cluster; And a processing flow from the word list presentation step 107 for presenting a word lists that form the user.

処理フローは、ステップ101からステップ103がDB作成ステップで、ステップ104からステップ107が単語リスト作成ステップである。
［DB作成用文書取得ステップ］
DB作成用文書取得ステップ101では、DB作成用文書DB1511より、DB作成用文書を取得する。 In the processing flow, steps 101 to 103 are DB creation steps, and steps 104 to 107 are word list creation steps.
[DB creation document acquisition step]
In a DB creation document acquisition step 101, a DB creation document is obtained from the DB creation document DB 1511.

DB作成用文書DB1511のデータテーブルを図４に示す。DB作成用文書DB1511には、案件No.、発行日時、文書が記録されている。後述する新規文書DBも同様の形式である。本実施例では、全文書を取得するが、必要に応じて、発行日時の欄を用いて、または他の条件を設定しておいて、一部の文書のみを取得するように設定してもよい。
［重要単語選定ステップ］
次に、重要単語選定ステップ102では、DB作成用文書取得ステップ101で取得したDB作成用文書を入力し、重要単語を選定する。 A data table of the DB creation document DB 1511 is shown in FIG. In the DB creation document DB 1511, a case number, an issue date, and a document are recorded. The new document DB described later has the same format. In this embodiment, all the documents are acquired. However, if necessary, only a part of the documents may be acquired by using the issue date / time column or by setting other conditions. Good.
[Important word selection step]
Next, in the important word selection step 102, the DB creation document acquired in the DB creation document acquisition step 101 is input, and an important word is selected.

なお、本実施例では、設計文書を対象とし、コストまたは工数に影響する単語を重要単語として選定している。製造業は、設計文書を顧客から受け取るか、顧客の要望に応じて製造業者が設計文書（製品仕様書など）作成する。設計文書で使用されている単語は、追加コストや追加工数に対して影響している単語は限られており、また、単語間の共起関係及び因果関係も他の文書よりも抽出が容易である。そして、これら設計文書からコストや工数の概算が把握できる単語リストをコンピュータによる情報処理によって自動生成できることは、製造業にとって大きなメリットとなる。
［重要単語選定ステップの詳細フロー］
重要単語選定ステップ102の詳細なフローを図３に示す。 In this embodiment, a word affecting the cost or man-hour is selected as an important word for a design document. In the manufacturing industry, a design document is received from a customer, or a manufacturer creates a design document (such as a product specification) according to a customer's request. Words used in design documents are limited in terms of additional costs and man-hours, and the co-occurrence and causal relationships between words are easier to extract than other documents. is there. And it is a great merit for the manufacturing industry that a word list that can grasp cost and man-hour estimation from these design documents can be automatically generated by information processing by a computer.
[Detailed flow of important word selection step]
A detailed flow of the important word selection step 102 is shown in FIG.

重要単語選定ステップ102は、形態素解析ステップ301と、話題推定モデル構築ステップ302と、単語分布分析ステップ303と、しきい値設定ステップ304と、重要単語判定ステップ305からなる処理フローを有している。
＜形態素解析ステップ＞
形態素解析ステップ301では、DB作成用文書取得ステップ101で取得したDB作成用文書を形態素解析エンジン1521に入力する。形態素解析エンジン1521は既存の技術であるので、詳細は省略するが、形態素解析エンジン1521は、各DB作成用文書を単語に分割した単語リストを出力する。 The important word selection step 102 has a processing flow including a morphological analysis step 301, a topic estimation model construction step 302, a word distribution analysis step 303, a threshold setting step 304, and an important word determination step 305. .
<Morphological analysis step>
In the morpheme analysis step 301, the DB creation document acquired in the DB creation document acquisition step 101 is input to the morpheme analysis engine 1521. Since the morphological analysis engine 1521 is an existing technology, the details are omitted, but the morphological analysis engine 1521 outputs a word list obtained by dividing each DB creation document into words.

図４のDB作成用文書から、形態素解析ステップ301で生成した単語リストを図５に示す。案件No.1の「機器Aと機器Bの配置が変更になりましたので、配線の長さを延長してください。」というDB作成用文書が、「機器A」「と」「機器B」「の」「配置」「が」「変更」「に」「なりました」「ので」「、」「配線」「の」「長さ」「を」「延長」「して」「ください」「。」という単語リストに分割される。
＜話題推定モデル構築ステップ＞
次の話題推定モデル構築ステップ302では、話題推定モデル構築エンジン1522に対して形態素解析ステップ301で文書を単語に分割した単語リストを入力する。話題推定モデル構築エンジン1522は、各DB作成用文書がどのような話題を扱っているかを推定するためのモデルを出力する。この話題推定モデル構築エンジン1522は既存技術であるので、詳述はしないが、利用可能なモデルとして、Naive BayesやProbabilistic Latent Semantic IndexingやLatent Dirichlet Allocationがある。本実施例では、図６に示すとおり、Latent Dirichlet Allocationを利用したモデルを構築している。 FIG. 5 shows a word list generated in the morphological analysis step 301 from the DB creation document of FIG. The document for creating a DB in Item No. 1 “Placement of device A and device B has changed, so please extend the length of the wiring.” Is “device A” “and” “device B” “Of” “placement” “ga” “change” “to” “now” “so” “,” “wiring” “of” “length” “to” “extend” “do” “please” “ . ”Is divided into a word list.
<Topic estimation model construction step>
In the next topic estimation model construction step 302, a word list obtained by dividing the document into words in the morpheme analysis step 301 is input to the topic estimation model construction engine 1522. The topic estimation model construction engine 1522 outputs a model for estimating what topic each DB creation document handles. Since this topic estimation model construction engine 1522 is an existing technology, it will not be described in detail, but usable models include Naive Bayes, Probabilistic Latent Semantic Indexing, and Latent Dirichlet Allocation. In this embodiment, as shown in FIG. 6, a model using Latent Dirichlet Allocation is constructed.

＜単語分布推定ステップ＞
次の単語分布分析ステップ303では、単語分布エンジン1523に対して、形態素解析ステップ301の出力である単語リストと、DB作成用文書と、話題推定モデル構築ステップ302で構築した話題推定モデルと、形態素解析ステップ301の出力である単語リストを入力する。単語分布エンジン1523は、話題推定モデルに基づいてDB作成用文書の話題を推定し、各話題ごとに各単語の出現頻度を分析し、単語分布（出現頻度）を出力する。図７に単語分布の例を示す。図７は、話題Aに対する単語の出現頻度のヒストグラムを示している。横軸が形態素解析エンジン1523によって分割された単語のリスト、縦軸が各単語の出現頻度を示している。 <Word distribution estimation step>
In the next word distribution analysis step 303, the word distribution engine 1523 receives the word list output from the morpheme analysis step 301, the DB creation document, the topic estimation model constructed in the topic estimation model construction step 302, and the morpheme The word list that is the output of the analysis step 301 is input. The word distribution engine 1523 estimates the topic of the DB creation document based on the topic estimation model, analyzes the appearance frequency of each word for each topic, and outputs the word distribution (appearance frequency). FIG. 7 shows an example of word distribution. FIG. 7 shows a histogram of the appearance frequency of words for topic A. The horizontal axis indicates a list of words divided by the morphological analysis engine 1523, and the vertical axis indicates the appearance frequency of each word.

＜しきい値設定ステップ＞
次のしきい値設定ステップ304では、しきい値設定エンジン1524に対して、単語分布分析ステップ303で分析した各単語の出現頻度を入力する。しきい値設定エンジン1524は各単語が重要単語であるかどうかを判断するためのしきい値を出力する。本実施例では、しきい値の設定方法として、話題ごと、各単語の出現頻度が他の単語と比べて著しく大きいかどうかを判断するためのしきい値を設定する。具体的には、各単語の出現頻度の75%タイル点に、75%タイル・と25%タイル点の差の定数倍を足し合わせた値をしきい値とした。 <Threshold setting step>
In the next threshold setting step 304, the appearance frequency of each word analyzed in the word distribution analysis step 303 is input to the threshold setting engine 1524. The threshold setting engine 1524 outputs a threshold for determining whether each word is an important word. In the present embodiment, as a threshold setting method, a threshold for determining whether the appearance frequency of each word is significantly higher than other words is set for each topic. Specifically, the threshold value is a value obtained by adding a constant multiple of the difference between the 75% tile and the 25% tile point to the 75% tile point of the appearance frequency of each word.

また、本実施例では、本しきい値を各話題に対して設定するか、全話題に対して共通のしきい値を設定するかはユーザが任意に選択できるようにしている。 In this embodiment, the user can arbitrarily select whether to set the threshold for each topic or to set a common threshold for all topics.

さらに、本実施例で重要単語として抽出するのは代名詞を除く名詞と動詞に限り抽出するものとし、その他の品詞は情報量が低いため除外している。本実施例の場合、設計文書を対象とし、コストまたは工数に影響する単語を重要単語として選定するが、他の品詞はコストまたは工数にほとんど影響しないからである。 Further, in this embodiment, only important nouns and verbs excluding pronouns are extracted as important words, and other parts of speech are excluded because the amount of information is low. In the case of the present embodiment, a word affecting the cost or man-hour is selected as an important word for the design document, but other parts of speech hardly influence the cost or the man-hour.

＜重要単語判定ステップ＞
次の重要単語判定ステップ305では、重要単語判定エンジン1525に対して、しきい値設定ステップ304で設定したしきい値と単語分布推定ステップ303で分析した各単語の出現頻度を入力する。重要単語判定エンジン1525は、しきい値よりも出現頻度が大きい単語のリストを重要単語として出力する。この重要単語は、重要単語DB1512に記憶される。 <Important word determination step>
In the next important word determination step 305, the threshold set in the threshold setting step 304 and the appearance frequency of each word analyzed in the word distribution estimation step 303 are input to the important word determination engine 1525. The important word determination engine 1525 outputs a list of words having an appearance frequency greater than the threshold as important words. This important word is stored in the important word DB 1512.

次に、共起・因果関係抽出ステップ103の詳細なフローを図８を用いて説明する。 Next, a detailed flow of the co-occurrence / causal relationship extraction step 103 will be described with reference to FIG.

［共起・因果関係抽出ステップ］
共起・因果関係抽出ステップ103は、単語クラスタ作成ステップ801と、因果関係分析ステップ802からなる。 [Co-occurrence / causal relationship extraction step]
The co-occurrence / causal relationship extraction step 103 includes a word cluster creation step 801 and a causal relationship analysis step 802.

＜単語クラスタ作成ステップ＞
単語クラスタ作成ステップ801は、単語クラスタ作成エンジン1526に対して、DB作成用文書DB1511のDB作成用文書と、重要単語判定ステップ305で選定された重要単語を入力する。単語クラスタ作成エンジン1526は、DB作成用文書における重要単語判定ステップ305で選定された重要単語の使用実績を抽出する。そして、単語クラスタ作成エンジン1526は、重要単語判定ステップ305で選定された話題毎の重要単語の共起関係を分析し、一定の共起関係のある単語同士を一つのクラスタとして抽出する。 <Word cluster creation step>
In the word cluster creation step 801, the DB creation document of the DB creation document DB 1511 and the important word selected in the important word determination step 305 are input to the word cluster creation engine 1526. The word cluster creation engine 1526 extracts the use results of the important words selected in the important word determination step 305 in the DB creation document. Then, the word cluster creation engine 1526 analyzes the co-occurrence relationship of the important words for each topic selected in the important word determination step 305, and extracts words having a certain co-occurrence relationship as one cluster.

DB作成用文書における、各重要単語の使用実績データの例を図９に示す。図９に示す表の各行はDB作成用文書のそれぞれに対応している。各列は各重要単語に対応している。文書1の行における重要単語1の列が1になっているのは、DB作成用文書1内で、重要単語1が使用されていることを意味する。一方、DB作成用文書1の行における重要単語2の列が0になっているのは、DB作成用文書1内で、重要単語2が使用されていないことを意味する。 An example of usage record data of each important word in the DB creation document is shown in FIG. Each row of the table shown in FIG. 9 corresponds to each DB creation document. Each column corresponds to each important word. When the column of the important word 1 in the row of the document 1 is 1, it means that the important word 1 is used in the DB creation document 1. On the other hand, the column of the important word 2 in the row of the DB creation document 1 is 0, which means that the important word 2 is not used in the DB creation document 1.

図１０に重要単語の単語クラスタ情報の例を示す。各行は重要単語に対応している。各行の2列目が単語クラスタを表している。図１０において、重要単語1と重要単語2の単語クラスタがともに1であるのは、重要単語1と重要単語2が同一の単語クラスタに属することを意味する。クラスタ分割方法の一例として、単語が共起する頻度からなる行列に対して、クラスタ分析方式を適用する方法などが挙げられる。クラスタ分析方式としては、k-means法やEMアルゴリズムによる方法などがある。 FIG. 10 shows an example of word cluster information of important words. Each line corresponds to an important word. The second column of each row represents a word cluster. In FIG. 10, the fact that both the important word 1 and the important word 2 have a word cluster of 1 means that the important word 1 and the important word 2 belong to the same word cluster. As an example of the cluster division method, there is a method of applying a cluster analysis method to a matrix composed of the frequency with which words co-occur. Cluster analysis methods include k-means method and EM algorithm method.

＜因果関係分析ステップ＞
因果関係分析ステップ802では、因果関係分析エンジン1527に対して、単語クラスタ作成ステップ801で作成した単語クラスタ情報と、各単語の使用実績データを入力する。因果関係分析エンジン1527は、単語クラスタ内における単語の因果関係を評価し、どちらの単語の方が上流の単語であるかを決定し、決定した各単語間の因果関係を共起・因果関係DB1512に記憶する。 <Causality analysis step>
In the causal relationship analysis step 802, the word cluster information created in the word cluster creation step 801 and the usage record data of each word are input to the causal relationship analysis engine 1527. The causal relationship analysis engine 1527 evaluates the causal relationship of words in the word cluster, determines which word is the upstream word, and determines the causal relationship between the determined words as a co-occurrence / causal relationship DB 1512. To remember.

各単語間の因果関係の例を図１１に示す。図１１に示す表の各行、列は重要単語に対応している。重要単語1の行における重要単語2の列が1になっているのは、重要単語1が使用されると、重要単語2も使用されるという因果関係（重要単語１が上流、重要単語２が下流）があることを意味する。また、重要単語2の行における重要単語1の列が0になっているのは、重要単語2が使用されたとき、重要単語1が使用されるという因果関係が存在しない（重要単語２が重要単語１の上流）ことを意味する。 An example of the causal relationship between words is shown in FIG. Each row and column in the table shown in FIG. 11 corresponds to an important word. The column of important word 2 in the row of important word 1 is 1. The causal relationship that important word 1 is used when important word 1 is used (important word 1 is upstream, important word 2 is Means downstream). In addition, the column of important word 1 in the row of important word 2 is 0. When important word 2 is used, there is no causal relationship that important word 1 is used (important word 2 is important). Means upstream of word 1).

また、どちらの単語の方が上流であるかを決定する他の方法としては、各単語を用いた過去に作成された文書数を利用する方法も挙げられる。例えば、単語Aは、過去文書のうち、100個の文書で用いられ、単語Bは、過去に作成された文書のうち、50個の文書で用いられ、さらに、単語Aと単語Bが同時に用いられていた文書数が10個であったとする。このとき、単語Aが用いられたときに単語Bが共起する確率は10%であり、単語Bが用いられたときに単語Aが共起する確率は20%となるので、単語Bが用いられたとき、単語Aが共起する確率の方が高く、単語Bを上流とする因果関係があると決定する。 Another method of determining which word is upstream is a method of using the number of documents created in the past using each word. For example, word A is used in 100 documents among past documents, word B is used in 50 documents among documents created in the past, and word A and word B are used simultaneously. Suppose that there were 10 documents. At this time, the probability that the word B co-occurs when the word A is used is 10%, and the probability that the word A co-occurs when the word B is used is 20%. The probability that the word A co-occurs is higher and it is determined that there is a causal relationship with the word B upstream.

［新規文書取得ステップ及び重要単語選定ステップ］
新規文書取得ステップ104及び重要単語選定ステップ105は、対象とする文書が新規文書である点で異なるのみで、DB作成用文書取得ステップ101及び重要単語選定ステップ102と同様の処理行う。なお、新規文書DBへの新規文書のユーザによる登録は、新規文書入力画面1531を用いて行う。 [New document acquisition step and important word selection step]
The new document acquisition step 104 and the important word selection step 105 are the same as the DB creation document acquisition step 101 and the important word selection step 102 except that the target document is a new document. Note that a new document is registered in the new document DB by the user using the new document input screen 1531.

［単語リスト作成ステップ］
次に、単語リスト作成ステップ106の詳細なフローを図１２に示す。 [Word list creation step]
Next, a detailed flow of the word list creation step 106 is shown in FIG.

単語リスト作成ステップ106は、重要単語DB1512の重要単語と、共起・因果関係DB1513の各単語クラスタ内における単語間の因果関係と、重要単語選定ステップ105で選定した新規文書内の重要単語とから、候補単語を抽出し、ユーザに提示する単語リストの単語を選定し、出力する。 The word list creation step 106 is based on the important words in the important word DB 1512, the causal relationship between words in each word cluster of the co-occurrence / causal relationship DB 1513, and the important words in the new document selected in the important word selection step 105. Candidate words are extracted, words in the word list to be presented to the user are selected and output.

単語リスト作成ステップ104は、候補単語抽出ステップ1201と、出力単語選定ステップ1202からなる。 The word list creation step 104 includes a candidate word extraction step 1201 and an output word selection step 1202.

＜候補単語抽出ステップ＞
候補単語抽出ステップ1201は、候補単語抽出エンジン1528に対して、重要単語DB1512の重要単語と、共起・因果関係DB1513の各単語クラスタにおける単語間の因果関係と、重要単語選定ステップ105で重要単語判定エンジン1525が選定した新規文書内の重要単語を入力する。重要単語DB1512の重要単語のうち、新規文書内の重要単語に含まれていない重要単語を第1候補単語リストとして抽出する。そして、候補単語抽出エンジン1528は、共起・因果関係DB1513の各単語間の因果関係を参照することにより、第1候補単語の上流単語が新規文書内の重要単語に含まれていない場合に、当該単語を除外せずにそのまま第2候補単語リストとし、第1候補単語の上流単語が新規文書内の重要単語に含まれている場合に、当該単語を除外して第2候補単語リストとして生成する。 <Candidate word extraction step>
Candidate word extraction step 1201 provides the candidate word extraction engine 1528 with important words in the important word DB 1512, causal relationships between words in each word cluster in the co-occurrence / causal relation DB 1513, and important words in the important word selection step 105. The important word in the new document selected by the determination engine 1525 is input. Of the important words in the important word DB 1512, important words that are not included in the important words in the new document are extracted as a first candidate word list. Then, the candidate word extraction engine 1528 refers to the causal relationship between the words in the co-occurrence / causal relationship DB 1513, and when the upstream word of the first candidate word is not included in the important word in the new document, Create the second candidate word list as it is without excluding the word, and generate the second candidate word list by excluding the word if the upstream word of the first candidate word is included in the important word in the new document To do.

＜出力単語選定ステップ＞
出力単語選定ステップ1202は、出力単語選定エンジン1529に対して、候補単語抽出ステップ1201で抽出した第2候補単語リストと共起・因果関係DB1513の各単語間の因果関係を入力する。出力単語選定エンジン1529は、単語間の因果関係を参照することにより、第2候補単語リストの各単語が含まれるカテゴリの最上流にある単語を最終単語リストとして出力する。なお、より簡易な方法として、第2候補単語リストをそのまま最終単語リストとしてもよい。 <Output word selection step>
The output word selection step 1202 inputs the causal relationship between the second candidate word list extracted in the candidate word extraction step 1201 and each word of the co-occurrence / causal relationship DB 1513 to the output word selection engine 1529. The output word selection engine 1529 outputs, as a final word list, the word at the top of the category that includes each word in the second candidate word list by referring to the causal relationship between the words. As a simpler method, the second candidate word list may be used as it is as the final word list.

［単語リスト提示ステップ］
単語リスト提示ステップ105では、単語リスト作成ステップ104で作成した最終単語リストをユーザに提示し、補足の要否に関する入力を要求する。ユーザに対して提示する画面イメージを図１３に示す。 [Word list presentation step]
In the word list presentation step 105, the final word list created in the word list creation step 104 is presented to the user, and an input relating to the necessity of supplementation is requested. A screen image presented to the user is shown in FIG.

101・・・DB作成用文書取得ステップ、102・・・重要単語選定ステップ、103・・・共起・因果関係抽出ステップ、104・・・新規文書取得ステップ、105・・・重要単語選定ステップ、106・・・単語リスト作成ステップ、107・・・単語リスト提示ステップ、301・・・形態素解析ステップ、302・・・話題推定モデル構築ステップ、303・・・単語分布分析ステップ、304・・・しきい値設定ステップ、305・・・重要単語判定ステップ、801・・・単語クラスタ作成ステップ、802・・・因果関係分析ステップ、1201・・・候補単語抽出ステップ、1202・・・出力単語選定ステップ、1510・・・データベース部、1511・・・DB作成用文書DB、1512・・・重要単語DB、1513・・・共起・因果関係DB、1520・・・単語リスト生成エンジン部、1521・・・形態素解析エンジン、1522・・・話題推定モデル構築エンジン、1523・・・単語分布推定エンジン、1524・・・しきい値設定エンジン、1525・・・重要単語判定エンジン、1526・・・単語クラスタ作成エンジン、1527・・・因果関係分析エンジン、1528・・・候補単語抽出エンジン、1529・・・出力単語選定エンジン、1530・・・ユーザインターフェース、1531・・・新規文書入力画面、1532・・・単語リスト確認画面 101 ... DB creation document acquisition step, 102 ... important word selection step, 103 ... co-occurrence / causal relationship extraction step, 104 ... new document acquisition step, 105 ... important word selection step, 106 ... Word list creation step, 107 ... Word list presentation step, 301 ... Morphological analysis step, 302 ... Topic estimation model construction step, 303 ... Word distribution analysis step, 304 ... Threshold value setting step, 305 ... Important word determination step, 801 ... Word cluster creation step, 802 ... Causal relationship analysis step, 1201 ... Candidate word extraction step, 1202 ... Output word selection step, 1510 ... Database part, 1511 ... Document DB for DB creation, 1512 ... Important word DB, 1513 ... Co-occurrence / causal relation DB, 1520 ... Word list generation engine part, 1521 ... Morphological analysis engine, 1522 ... Topic estimation model construction engine, 1523 ... Word distribution estimation engine, 1524 ... Threshold setting engine, 1525 ... Important word judgment engine, 1526 ... Word cluster creation engine, 1527 ... Causal relationship Analysis engine, 1528 ... candidate word extraction engine, 1529 ... output word selection engine, 1530 ... user interface, 1531 ... new document input screen, 1532 ... word list confirmation screen

Claims

A word list generation device that generates a word list that requests a user to input from words that are lacking in a target document described in a natural language,
A database section and a word list generation engine section;
In the database unit, a plurality of important words extracted from other documents described in a natural language are clustered, and the important words are stored together with a causal relationship between words in each cluster,
The word list generation engine unit
An engine that extracts important words from a new document for which a word list is to be created;
Of the important words extracted from the other documents, the words that are not included in the important words of the extracted new document, and the causal upstream words included in the important word cluster are also included in the important words of the new document. An engine that is included in the word list that is presented to the user if the word is not included,
Of the important words extracted from the other documents, the words that are not included in the important words of the extracted new document, and the causal upstream words included in the cluster of the words are also included in the important words of the new document A word list generation device, comprising: an engine that creates a word list by not including the word in a word list presented to a user.

In claim 1,
The word list generation engine unit
An engine that extracts a plurality of important words from other documents described in a natural language and stores them in a database unit;
An engine for extracting and clustering co-occurrence relationships between the plurality of important words and storing them in a database unit;
An engine that analyzes causal relationships between clustered important words and stores them in a database section;
A word list generation device comprising:

In claim 1 or 2,
The other document and the new document are design documents.

In claim 2,
A word list generation apparatus, wherein a causal relationship between words in the cluster is determined by the number of documents using each word.

A word list generation method for inputting a new document described in a natural language into a computer, and causing the computer to generate a word list that requests a user to input from words lacking in the new document,
Cluster multiple important words extracted from other documents written in natural language, store the important words in the database together with the causal relationship between the important words in each cluster,
On the computer,
Extract important words from a new document for which you want to create a word list,
Of the important words extracted from the other document, important words that are not included in the extracted important word of the new document, and the causal upstream word included in the important word cluster becomes the important word of the new document. If the word is not included, include it in the word list presented to the user for input,
Of the important words extracted from the other documents, the words that are not included in the important words of the extracted new document and that are included in the cluster of the words are included in the important words of the new document. A word list generation method characterized by causing a word list to be created when a word is not included in a word list to be presented to the user for input when the word is a word.

In claim 5,
Extract multiple important words from other documents written in natural language, store them in the database section,
Extracting and clustering co-occurrence relationships between the plurality of important words, storing them in a database unit,
A word list generation method, characterized in that a causal relationship between clustered important words is analyzed and stored in a database unit.

In claim 5 or 6,
The word list generation method, wherein the other document and the new document are design documents.

In claim 6,
A word list generation method, wherein the causal relationship between words in the cluster is determined by the number of documents using each word.