JP2001084250A

JP2001084250A - Method and device for extracting knowledge from enormous document data and medium

Info

Publication number: JP2001084250A
Application number: JP23967499A
Authority: JP
Inventors: Yasushi Matsuzawa; 裕史松澤; Tsuyoshi Fukuda; 剛志福田; Tetsuya Nasukawa; 哲哉那須川; Toru Nagano; 徹長野; Masayuki Morohashi; 正幸諸橋
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 1999-08-26
Filing date: 1999-08-26
Publication date: 2001-03-30
Anticipated expiration: 2019-08-26
Also published as: JP3353829B2

Abstract

PROBLEM TO BE SOLVED: To automatically extract a document satisfying a pattern from enormous amount of documents, to extract useful knowledge and to reduce time required for a response by generating a field-dependent dictionary from document data, generating a syntax tree considering modification, by means of a language analysis device and extracting/outputting a frequentlyappearing pattern by means of a pattern extraction device. SOLUTION: A language feature analysis device generates an analysis- dependent dictionary. A language analysis device needs to prepare a field- dependent dictionary for requiring an attribute adjusted to data to be analyzed. A word having the specified attribute is to be generated by each field. The language feature analysis device checks the word from actual data and registers it in the field-dependent dictionary. A pattern extraction device obtains a pattern, which frequently appears by using document data which is structure- analyzed by the device and takes out an original document having a syntax which is matched with the pattern. A frequently-appearing pattern device displays the document, having the detected frequently-appearing pattern and a syntax tree matched with it.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、特定の分野を対象
とした大量の文書から知識抽出を行うために、自動的に
パターンを抽出する技術に関するものであり、特に、抽
出された特定のパターンを満足する文書を大量文書の中
から抽出することによって、有用な知識抽出を行う膨大
な文書データからの知識抽出方法、その装置及び媒体に
関する技術である。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a technology for automatically extracting a pattern in order to extract knowledge from a large amount of documents in a specific field. Is a technique for extracting knowledge from a large amount of document data by extracting useful knowledge by extracting documents satisfying the following from a large number of documents, an apparatus and a medium therefor.

【０００２】[0002]

【従来の技術】計算機及びネットワーク環境の発達と普
及により、膨大なデータが電子化されて蓄積され、オン
ラインで参照可能となっている。このデータを有効利用
すべく、データマイニングの技術が盛んに研究開発され
てきた。しかし、従来のデータマインニング技術で対象
としているのは、数値を中心とした集計可能な定型デー
タのみである。しかし、大抵のデータにはテキスト部分
（即ち文書データ）が含まれており文書データは基本的
に定型ではないため、数値を中心とした定型データと異
なり集計が困難である。従って、文書データについて
は、基本的には１つ１つ目を通す必要があるため、非常
に手間がかかってしまう。即ち、人手で分析できる文書
データの量には限度があり、せっかく蓄積された膨大な
文書データを持て余してしまうという問題が生じてい
る。このような、非定型のテキスト文書から知識を抽出
する技術は、「テキストマイニング」と呼ばれ最近注目
を浴びている。このテキストマイニングは、コールセン
ターの記録、アンケート結果の集計等での利用だけでな
く、特許関係の文書、営業報告書等あらゆる非定型な文
書の分析に応用可能なことから最も期待されている技術
である。2. Description of the Related Art With the development and spread of computers and network environments, enormous amounts of data have been digitized and stored, and can be referenced online. In order to effectively use this data, data mining technology has been actively researched and developed. However, the conventional data mining technology targets only fixed data that can be tabulated based on numerical values. However, most data includes a text portion (that is, document data), and the document data is basically not a fixed form. Therefore, unlike the fixed form data centered on numerical values, it is difficult to perform the aggregation. Therefore, the document data basically needs to be passed one by one, which is extremely troublesome. That is, there is a limit to the amount of document data that can be analyzed manually, and there is a problem that a huge amount of document data that has been accumulated is left behind. Such a technique for extracting knowledge from an atypical text document is called “text mining” and has recently attracted attention. This text mining is the technology most anticipated because it can be applied not only to record in call centers, totaling of questionnaire results, etc. but also to analysis of any atypical documents such as patent documents, business reports, etc. is there.

【０００３】大量の文書の内容を分析する手段として、
類似内容を持つ文書を見つけてカテゴリごとに分類する
方法がある。例えば、現在ウェブの検索サイト等におい
て使われている方法として、予めカテゴリを用意してお
き、人が文書を読みその文書が該当するカテゴリを判断
し分類するというものがある。また、特定のキーワード
を含む文書はあるカテゴリに属するというルールに基づ
いて、この作業を自動的に行うことも可能である。例え
ば、「ABS」、「エアバッグ」というキーワードを含む
文書があれば車というカテゴリに属すると判断できる。
これは大量文書の大分類には適するが、より細かい分類
を行うことは困難である。[0003] As a means of analyzing the contents of a large number of documents,
There is a method of finding documents having similar contents and classifying them by category. For example, as a method currently used in web search sites and the like, there is a method in which a category is prepared in advance, and a person reads a document, determines a category to which the document corresponds, and classifies the document. In addition, this operation can be automatically performed based on the rule that a document including a specific keyword belongs to a certain category. For example, if there is a document including the keywords “ABS” and “airbag”, it can be determined that the document belongs to the category of car.
This is suitable for large classification of a large number of documents, but it is difficult to make a finer classification.

【０００４】例えば、コールセンター業務においては、
顧客からの電話内容にはどのような要件が多いのかを分
析することによって、コールセンター業務を改善したい
という要求がある。電話を記録した内容を人手によって
大雑把に分類し、分類した結果から注意深く文書を読
み、ほぼ同一内容の文書を集計する作業で、この要求は
達成できる。しかし、毎月、何万件という問い合わせを
受けるコールセンターの場合、人手で、これを行うのは
非常に労力がかかり、現実には困難である。また、蓄積
された文書は、特定分野を対象とした文書であり、カテ
ゴリを非常に細かく分ける必要があるが、内容を予測し
て事前にカテゴリを用意するのも非常に困難である。例
えば、簡単な「車」というカテゴリではなく、更に細か
く「エンジンの異音の発生」等と細かく分類することが
要求される。このような細かい分類では、分類する人は
文書の内容を更に良く吟味して分類作業をしなければな
らず、その作業量は膨大となる。また、カテゴリの判断
基準が人によって異なったり、同一人物でもその都度違
う判断をする可能性があり、客観的なデータを得ること
が難しい。For example, in a call center business,
There is a demand for improving call center operations by analyzing what requirements are high in telephone contents from customers. This requirement can be fulfilled by roughly classifying the contents of recorded telephone calls manually, carefully reading the documents from the classified results, and counting documents having substantially the same contents. However, in the case of a call center receiving tens of thousands of inquiries every month, it is very labor-intensive to do this manually, and it is difficult in practice. The stored documents are documents for a specific field, and the categories need to be very finely divided. However, it is also very difficult to predict the contents and prepare the categories in advance. For example, it is required to categorize not only the simple category of "car" but also the category of "occurrence of abnormal noise of engine". In such a fine classification, the classifying person has to examine the contents of the document more carefully to perform the classification work, and the amount of work is enormous. In addition, there is a possibility that the judgment criteria of the category differ from person to person, and even if the same person makes a different judgment each time, it is difficult to obtain objective data.

【０００５】近年、計算機を用いた文書の分類手法（文
書のクラスタリング）が開発されているが、この手法は
文書中に出現するキーワードから大雑把な分類を行うも
のである。しかし、特定分野を対象とする場合には、よ
り細かな分類が必要であり、従来の手法では対処できな
い。また、クラスタリングの結果、どんな内容の文書が
１つのクラスタに集められたのかは、その文書を人が読
まなければ理解できないという問題点がある。In recent years, a document classification method (document clustering) using a computer has been developed. This method performs rough classification based on keywords appearing in a document. However, when targeting a specific field, a more detailed classification is required, and cannot be dealt with by the conventional method. Further, as a result of the clustering, there is a problem that it is impossible to understand what kind of documents are collected in one cluster unless a person reads the documents.

【０００６】上述のように、大量の文書から語をキーワ
ードとして切り出し、共起する単語のペアを取り出す従
来技術が、データマイニングにおける「相関ルールの抽
出技術」と呼ばれるものである。しかし、この手法では
以下の問題点がある。長い文書において始めに現れる語
と最後に現れる語との間には関連性が無い場合がある
が、これを共起するものとしてカウントしたり、語の係
り受けの関係が無視されているために、例えば「ＡがＢ
するとＣがＤした」と「ＡがＤするとＣがＢした」では
意味が異なるが、共起関係だけを見ると、これら２つの
文書を同じものとして処理してしまう。従って、同一内
容の文書抽出が正しく行われない場合が多い。As described above, the conventional technique of extracting words as keywords from a large amount of documents and extracting a pair of co-occurring words is called "correlation rule extraction technique" in data mining. However, this method has the following problems. There may be no relation between the first word and the last word in a long document, but this is counted as co-occurrence or because the dependency of words is ignored. For example, "A is B
Then, the meaning is different between "C did D" and "A did D then C did B". However, if only the co-occurrence relationship is viewed, these two documents are processed as the same document. Therefore, it is often the case that a document with the same content is not correctly extracted.

【０００７】上述のような、不都合を解決するために
は、特定の単語が特定の順番で出現するものだけを抽出
する方法が考えられる。これがデータマイニングにおけ
る「時系列パターン抽出技術」と呼ばれるものである。
例えば、単語Ａ、単語Ｂ、単語Ｃ、単語Ｄという順序で
単語が出現する文書だけを抽出することができる。しか
し、このルールでは「ＡがＢするとＣがＤした」という
文書の場合は抽出できるが、「Ｃは、ＡがＢすると、Ｄ
した」という文書は、文書の意味は同じだが、単語の順
番が異なっているため抽出できないという問題がある。
即ち、同一内容の文書を抽出するためには、単語の共起
関係、出現順序だけでなく、単語間の係り受けの関係に
も着目する必要がある。In order to solve the inconvenience as described above, a method of extracting only words in which specific words appear in a specific order can be considered. This is called “time-series pattern extraction technology” in data mining.
For example, only documents in which words appear in the order of word A, word B, word C, and word D can be extracted. However, according to this rule, a document "A is B and C is D" can be extracted, but "C is D if A is B
There is a problem that the document "I did" cannot be extracted because the word order is different, although the meaning of the document is the same.
That is, in order to extract documents having the same contents, it is necessary to pay attention not only to the co-occurrence relation and the appearance order of words but also to the dependency relation between words.

【０００８】[0008]

【発明が解決しようとする課題】上述のように、本発明
では、大量の文書から特定のパターンを抽出すること、
また、そのパターンを満足する文書を自動的に抽出する
ことにより、有用な知識抽出を実現する膨大な文書デー
タからの知識抽出方法、その装置及び媒体を提供するも
のである。As described above, according to the present invention, a specific pattern is extracted from a large number of documents.
Another object of the present invention is to provide a method of extracting knowledge from a huge amount of document data, which realizes useful knowledge extraction by automatically extracting a document satisfying the pattern, an apparatus and a medium therefor.

【０００９】[0009]

【課題を解決するための手段】本発明は、大量の文書デ
ータからの知識抽出方法を対象とする。この知識抽出方
法において、形態素解析技術により１つの文書から単語
を切り出し単語間にある係り受けの関係を推定し係り受
け関係から構文木を構築するステップと、構築された構
文木の中で多くの構文木に含まれている頻出パターンを
与えられたパターンの制約に基づいて発見するステップ
と、発見された頻出パターンへの代入にマッチする文書
を検索するステップと、を含むものである。また、本発
明は、上記方法のステップをコンピュータに実行させる
ためのプログラムを記録したコンピュータ読み取り可能
な媒体をも含むものである。SUMMARY OF THE INVENTION The present invention is directed to a method for extracting knowledge from a large amount of document data. In this knowledge extraction method, a step of extracting words from one document by a morphological analysis technique, estimating a dependency relationship between words, and constructing a syntax tree from the dependency relationships, The method includes a step of finding a frequent pattern included in the parse tree based on a given pattern constraint, and a step of searching for a document that matches the assignment to the found frequent pattern. The present invention also includes a computer-readable medium storing a program for causing a computer to execute the steps of the above method.

【００１０】更に、本発明は、大量の文書データからの
知識抽出装置を対象とする。この知識抽出装置におい
て、基本辞書に含まれない語彙を分野依存辞書に登録す
る言語特徴分析装置、自然言語解析を行う言語解析装
置、特定パターンに適合するデータを発見するパターン
抽出装置及び、抽出した頻出パターンを表示する頻出パ
ターン表示装置を具備し、文書データから一般分野を対
象とする基本辞書と、文節生成処理用の生成規則と、構
文木生成用の生成規則と、分野依存辞書とを含む。上記
構成において、大量文書からの知識抽出を好適に実施で
きる。Further, the present invention is directed to an apparatus for extracting knowledge from a large amount of document data. In this knowledge extraction device, a language feature analysis device that registers a vocabulary not included in the basic dictionary in the field-dependent dictionary, a language analysis device that performs natural language analysis, a pattern extraction device that finds data matching a specific pattern, and The apparatus includes a frequent pattern display device that displays frequent patterns, and includes a basic dictionary for a general field from document data, a generation rule for generating a clause, a generation rule for generating a syntax tree, and a field-dependent dictionary. . In the above configuration, knowledge extraction from a large number of documents can be suitably performed.

【００１１】[0011]

【発明の実施の形態】言語特徴分析装置によって言語解
析装置の精度向上のために文書データから分野依存辞書
を作成し、言語解析装置によって係り受けを考慮した構
文木を作成し、パターン抽出装置によって頻出パターン
（即ち、知識）を抽出・出力する。以下、装置の形態で
発明を説明するが、本発明には、方法、プログラム媒体
も含まれることは言うまでもない。具体的な機能として
は、１．形態素解析技術により、１つの文書から単語を切
り出し、単語間における係り受けの関係を推定し、係り
受け関係から構文木を構築する機能、２．大量の文書から構築された大量の構文木の中で、
与えられたパターンの制約に基づいて、多くの構文木に
含まれているような頻出パターンを発見する機能、３．発見された頻出パターンの構文木を有する文書を
出力する機能、等である。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A language feature analysis device creates a field-dependent dictionary from document data to improve the accuracy of the language analysis device, creates a syntax tree in consideration of the dependency by the language analysis device, and creates a syntax tree by a pattern extraction device. Extract and output frequent patterns (ie, knowledge). Hereinafter, the present invention will be described in the form of a device, but it goes without saying that the present invention includes a method and a program medium. Specific functions include: 1. A function of extracting words from one document by a morphological analysis technique, estimating a dependency relationship between words, and constructing a syntax tree from the dependency relationships. In a large syntax tree constructed from a large number of documents,
2. A function of finding a frequent pattern included in many syntax trees based on a given pattern constraint. A function of outputting a document having a syntax tree of a found frequent pattern, and the like.

【００１２】図１は、文章から形態素を切り出し、係り
受け関係を抽出し、その係り受け関係から構文木を生成
する過程の概略図を示している。図１の文章”ＡがＢす
ると、ＣがＤする”から、形態素解析、係り受け関係の
抽出を行った結果、「Ａ」が「Ｂする」、「Ｃ」が「Ｄ
する」、「Ｂする」と「Ｄする」という２項関係が抽出
される。ここで２つの単語間の係り受け関係から矢印の
向きが決まる。これらの関係から、図の構文木が生成さ
れる。構文木は有向グラフ（接点を結ぶ枝に向きが有る
グラフ）として表現される。有向グラフ上の節点（ノー
ドと呼ぶ）には、形態素解析で切り出した単語をラベル
として付与する（図中では、Ａ〜Ｄと略記）。２つのノ
ード間を結ぶ枝（アークと呼ぶ）には向きがある。アー
クの向きは、前述のように、単語間の係り受け関係によ
り決まる。図２（ａ）のように、ここで、パターンと
は、構文木中に存在するノードとその位置関係を示す。
ノード、即ち単語の個数は任意である。ここで、各単語
に対して制約を与えることができる（例えば、動詞、専
門用語であること等）。位置関係は、一定のものに制約
しても良いが、単語が少数であれば可能性のある全ての
位置関係であっても良い。パターンの例を示す。いま、
１つの構文木中に、２つの単語Ａ、Ｂがあったとき、Ａ
というラベルを持つノードからＢというラベルを持つノ
ードに構文木中の有向グラフを辿ることで、到達するこ
とができ、更に図２（ｂ）のように、それがある距離内
であるとき、これをＡ−＊→Ｂと記述し、これをパター
ンとすることができる。更に、同様にして、他の単語
Ｃ、Ｄがあって、同時にＡ−＊→Ｂ、Ａ−＊→Ｃ−＊→
Ｄの関係が成り立っているとき、これを４つの単語とそ
の位置関係からなるパターンとする。また、このパター
ンに対しても制約を与えることができる。例えば、上記
Ａに対して動詞である、専門用語である等の制約であ
る。頻出パターンの発見とは、このように複数の単語と
その位置関係を表わすパターンのうち頻出するものを発
見することである。FIG. 1 is a schematic diagram showing a process of extracting a morpheme from a sentence, extracting a dependency relationship, and generating a syntax tree from the dependency relationship. As a result of the morphological analysis and the extraction of the dependency relation from the sentence “A becomes B, C becomes D” in FIG. 1, “A” becomes “B” and “C” becomes “D
Binary relations of "do", "do" and "do" are extracted. Here, the direction of the arrow is determined from the dependency relationship between the two words. From these relationships, a syntax tree in the figure is generated. The parse tree is expressed as a directed graph (a graph in which the branches connecting the nodes have directions). Nodes (referred to as nodes) on the directed graph are labeled with words extracted by morphological analysis (abbreviated as A to D in the figure). A branch (called an arc) connecting two nodes has a direction. The direction of the arc is determined by the dependency relationship between words, as described above. As shown in FIG. 2A, the pattern indicates a node existing in the syntax tree and its positional relationship.
The number of nodes, that is, the number of words is arbitrary. Here, a constraint can be given to each word (for example, verb, terminology, etc.). The positional relationship may be restricted to a certain one, but may be all possible positional relationships if the number of words is small. Here is an example of a pattern. Now
When there are two words A and B in one parse tree, A
By tracing the directed graph in the parse tree from the node having the label “B” to the node having the label “B”, the node can be reached. As shown in FIG. A- * → B can be described as a pattern. Further, similarly, there are other words C and D, and at the same time, A- * → B, A- * → C- * →
When the relationship of D is established, this is defined as a pattern consisting of four words and their positional relationship. Also, restrictions can be given to this pattern. For example, there is a restriction such as a verb or a technical term for A. The discovery of a frequent pattern is to find a frequent pattern among patterns representing a plurality of words and their positional relationship.

【００１３】文書が日本語等の場合は、構文木だけでな
く線形リストを構築することもできる。線形リストに対
しても、同様に与えられたパターンの発見をすることが
でき、この場合は処理が高速化される。When the document is in Japanese or the like, not only a syntax tree but also a linear list can be constructed. Similarly, a given pattern can be found for a linear list. In this case, the processing speed is increased.

【００１４】共起関係については、一般的に文章中の語
句と語句との距離が大きくなるほど、その語句と語句と
の関連性が小さくなることが多いため、距離（例えば、
構文木において、あるノードからあるノードまでに経過
する枝の数（アーク数））という概念を導入する。例え
ば距離＝３と定義する場合は、距離が４以上あるよう
な、語句と語句が離れているノード間を共起関係が無い
ものとして取り扱う等である。この距離は、対象の文書
に応じて適切な値を設定する。図３は本発明の全体構成
を示す図である。また、図４は本発明の処理の流れを示
すフローチャートである。図５は言語解析装置の詳細を
示すものであり、本装置によって構造解析された文書デ
ータを用いて、パターン抽出装置は頻出するパターンを
求め、そのパターンと合致する構文を持つ元の文書を取
り出す。頻出パターン表示装置は、発見された頻出パタ
ーンとそれに合致する構文木を持つ文書を表示する。こ
こで、本発明を構成する１．言語特徴分析装置、２．言
語解析装置、３．パターン抽出装置及び、４．頻出パタ
ーン表示装置について説明する。Regarding co-occurrence relations, in general, the greater the distance between words in a sentence, the smaller the relevance between the words and phrases is.
In the syntax tree, the concept of the number of branches (number of arcs) passing from a certain node to a certain node is introduced. For example, in the case where the distance is defined as 3, for example, a word having a distance of 4 or more is treated as having no co-occurrence relation between nodes where the word is separated from the word. This distance is set to an appropriate value according to the target document. FIG. 3 is a diagram showing the overall configuration of the present invention. FIG. 4 is a flowchart showing the flow of the processing of the present invention. FIG. 5 shows the details of the language analysis apparatus. The pattern extraction apparatus obtains a frequently occurring pattern using the document data subjected to the structure analysis by the apparatus, and extracts an original document having a syntax matching the pattern. . The frequent pattern display device displays a document having a found frequent pattern and a syntax tree that matches the frequent pattern. Here, the present invention is configured as follows. 1. Language feature analyzer, 2. language analyzer, 3. a pattern extraction device; The frequent pattern display device will be described.

【００１５】１．言語特徴分析装置について言語特徴分析装置は、言語解析装置の精度を向上させる
ために分野依存辞書の作成を行う。これは、一般的な辞
書に含まれていない特定分野のための語彙を追加し、そ
の語彙の属性について記述する。また、分野によって意
味や属性が異なる語彙について分野依存辞書を作成す
る。言語解析装置は、分析するデータに合った属性を必
要とするため、分野依存辞書（例えば「装置（１９）」
を「装置（Ｈ／Ｗ）」に書きかえるための辞書）を用意
する必要がある。「装置」や「良−い」といった一般語
については、最初に用意したものをどのデータに対して
も利用できるが、製品名のような特定の属性を持つ語な
どは、分野ごとに作成しなければならない。これを、実
際のデータから調べて分野依存辞書に登録するのが言語
特徴分析装置であり、以下の手順で登録を行う。1. About the language feature analyzer The language feature analyzer creates a field-dependent dictionary in order to improve the accuracy of the language analyzer. This adds a vocabulary for a specific field that is not included in the general dictionary and describes the attributes of that vocabulary. In addition, a field-dependent dictionary is created for vocabularies having different meanings and attributes depending on the field. Since the language analysis device needs an attribute suitable for the data to be analyzed, a field-dependent dictionary (for example, "device (19)"
It is necessary to prepare a dictionary for rewriting a “device (H / W)”. For general words such as "device" and "good-good", the first one can be used for any data, but words with specific attributes such as product names should be created for each field. There must be. It is the language feature analyzer that checks this from actual data and registers it in the field-dependent dictionary, and performs registration in the following procedure.

【００１６】Ａ．従来技術である形態素解析装置と基本
辞書を用いて文を品詞付き単語列に分割する。Ｂ．分野依存辞書に既に登録済みのものは単語列から削
除する。Ｃ．単語の出現頻度を計算し、単語列を出現頻度の多い
順に並べ替える。Ｄ．この単語列の中から、予め設定した属性に該当する
言葉を見つけて分野依存辞書に追加登録を行う。ここ
で、分野依存辞書中のエントリーの構造を品詞付き単語
列→品詞または属性付き単語列という形にすれば、たと
え形態素解析装置が誤った単語分割や誤った品詞付与を
しても必要な単語と属性を取り出すことができる。A. A sentence is divided into a word string with a part of speech using a morphological analysis device and a basic dictionary, which are conventional techniques. B. Those already registered in the field-dependent dictionary are deleted from the word string. C. Calculate the frequency of appearance of words, and sort the word strings in descending order of frequency of appearance. D. From this word string, a word corresponding to a preset attribute is found and additionally registered in a field-dependent dictionary. Here, if the structure of the entry in the field-dependent dictionary is changed from a word string with part-of-speech to a word string with part-of-speech or a word string with attributes, even if the morphological analyzer performs incorrect word division or incorrect part-of-speech assignment, And attributes can be extracted.

【００１７】２．言語解析装置について言語解析装置は、形態素解析装置、文節生成装置、辞書
適用装置、及び係り受け解析装置を含むものであり、以
下各々について説明する。（１）形態素解析装置入力された文に対して従来技術である形態素解析を行う
ことによって単語ｔに分割した後、基本辞書を用いて単
語列に対してラベルｌ（品詞あるいは属性名に相当する
名前）を付加する。また単語間の距離ｄを重みとして付
加する。以下、形態素 w＝[ t , l , d ]の組とする。
また同義語辞書を用いて、表現のゆれや同義語を１つの
統一された表記に変更する。2. About the language analysis device The language analysis device includes a morphological analysis device, a phrase generation device, a dictionary application device, and a dependency analysis device, and each will be described below. (1) Morphological analysis device An input sentence is divided into words t by performing a morphological analysis, which is a conventional technique, on a sentence, and then, using a basic dictionary, a word string is labeled 1 (corresponding to a part of speech or an attribute name). Name). The distance d between words is added as a weight. Hereinafter, a set of morphemes w = [t, l, d].
In addition, using the synonym dictionary, the fluctuation of the expression and the synonyms are changed to one unified notation.

【００１８】（２）文節生成装置文（あるいは特定の文脈）に各語句が出現する順番にw₁
, w₂ , ,….w_nとすると、w₁から順に生成規則に従って
文節を決定する。w_nが付属語である場合や、明らかに文
節が切れると判断できるところで文節を区切る。w_kで文
節を区切られた場合、次の文節はw_k＋１から始まり、こ
れを文末になるまで行う。各文節を自立語と付属語の組
合わせにし、これを構文木のノード及びノードからのア
ークとする。また、「反、非」等の接頭語、「ない」等
の助動詞がある語句の場合は、ラベルの符号を反転させ
る。(2) Clause generation device In the order in which each word appears in a sentence (or a specific context), w ₁
, w ₂ ,,... w _n , the clauses are determined in order from w ₁ according to the generation rules. Separate clauses where w _n is an adjunct or where it can be clearly determined that the clause is broken. If a clause is separated by w _k , the next clause starts at w _{k + 1} and continues until the end of the sentence. Each clause is made up of a combination of an independent word and an adjunct word, which is a node of the syntax tree and an arc from the node. In addition, in the case of a phrase having a prefix such as “anti, non” and an auxiliary verb such as “not,” the sign of the label is inverted.

【００１９】（３）辞書適用装置分野依存辞書によって、単語列中の単語及びラベルを書
き換える。対応する属性名が無い場合は、品詞がそのま
まラベルとして残る。各ノードには単語の他に、品詞等
の情報、アークには助詞の情報等が付加される。(3) Dictionary application apparatus Words and labels in a word string are rewritten by a field-dependent dictionary. If there is no corresponding attribute name, the part of speech remains as a label. Information such as part of speech and the like are added to each node in addition to words, and information about particles is added to the arc.

【００２０】例えば、「装置が良くない訳ではない」と
いう文章からは、下記のようなものが出力される。ここ
で用いた形態素解析装置においては常に重みは１にな
り、重みｄの表示を省略する。また番号は品詞を示す。
例えば、19…名詞、75…格助詞「が」、17…形容詞の語
幹、42…形容詞連用形活用語尾、等である。句点（。）
のｄを∞にすること等は、簡単ではあるが効果的な重み
付けの方法である。（１）．形態素解析装置からの出力：［装置，19］
［が,75］[良−い,17]［く,42］［な−い,51］［い,4
3］［訳,94］［で,56］［は,85］［な−い,51］［い,4
3］（２）．文節生成装置からの出力：（［装置，19］
［が,75］）（[良−い,17]［く,42］［な−い,51］
［い,43］）（［訳,94］［で,56］［は,85］）（［な
−い,51］［い,43］）小括弧で区切られているのが文節
である。（３）．辞書適用装置からの出力：（［装置，H/W］
［が,75］）（[良−い,評価]［く,42］［な−い,51］
［い,43］）（［訳,94］［で,56］［は,85］）
（［な−い,51］［い,43］）このように、入力文章から文節毎に分解されて、線形リ
ストの構文構造データが作成される。更に、後述する文
節間の係り受け関係の分析をすることで、有向グラフの
構文構造データを作成することができる。For example, from the sentence "The device is not bad", the following is output. In the morphological analyzer used here, the weight is always 1, and the display of the weight d is omitted. The numbers indicate the parts of speech.
For example, 19: noun, 75: case particle "ga", 17: stem of adjective, 42: adjective conjugation ending, etc. Period (.)
Is a simple but effective weighting method. (1). Output from morphological analyzer : [apparatus, 19]
[But, 75] [Good, 17] [C, 42] [No, 51] [Yes, 4
3] [translation, 94] [,, 56] [is, 85] [not, 51] [is, 4
3] (2). Output from phrase generator : ([device, 19]
[, 75]) ([good, 17] [ku, 42] [na-51,]
[I, 43]) ([translation, 94] [in, 56] [, 85]) ([na-i, 51] [i, 43]) The clauses are separated by parentheses. (3). Output from dictionary application device : ([device, H / W]
[, 75]) ([good, good] [ku, 42] [no, 51]
[I, 43]) ([translation, 94] [in, 56] [in, 85])
([Na-i, 51] [i, 43]) In this way, the input sentence is decomposed for each clause, and the syntax structure data of the linear list is created. Furthermore, by analyzing the dependency relationship between the clauses described later, it is possible to create the syntax structure data of the directed graph.

【００２１】（４）係り受け生成装置文法規則は、係り受け元のノードの自立語（R_sd）、付
属語（R_si）、係り受け先の自立語（R_dd）と付属語（R
_di）、及び係り受けの性質(T)、の組み合わせ｛R_sd , R
_si , R_dd , R_di, T ｝から構成される。この文法規則
を係り受け元のノードN_nと係り受け先のノードN_ｍ（ｎ
ｍ）に適用し、文法規則に合致した場合N_nとN_mに係り受
けの関係があると判断し、N_nからN_mに対して係り受けの
関係をつける。文法規則に合致すれば、係り受けは幾つ
でも持つことができる。また付属語及び係り受けの性質
からアークに重みを付けることもできる。抽出した係り
受けの関係をアークとし、辞書適用装置で抽出した情報
を各ノードに付加することによって、構文木を作成す
る。(4) Dependency generating device The grammar rules are that the independent word (R _sd ) and the auxiliary word (R _si ) of the dependency source node, the independent word (R _dd ) of the dependency destination and the auxiliary word (R
_di ) and the nature of the dependency (T), ｛R _sd , R
_It consists of _si , R _dd , R _di , T｝. This grammar rule is applied to the dependency source node N _n and the dependency destination node N _m (n
m), it is determined that there is a dependency relationship between N _n and N _m when the grammar rules are met, and a dependency relationship is _given from N _n to N _m . You can have any number of dependencies as long as they conform to the grammar rules. In addition, the weight of the arc can be weighted based on the properties of the accessory word and the dependency. The extracted dependency relationship is defined as an arc, and information extracted by the dictionary application device is added to each node to create a syntax tree.

【００２２】３．パターン抽出装置についてパターン抽出装置は、頻出パターン抽出装置と特定パタ
ーン適合文書抽出装置を含むものであり、以下各々につ
いて説明する。3. About the pattern extraction device The pattern extraction device includes a frequent pattern extraction device and a specific pattern matching document extraction device, and each will be described below.

【００２３】（１）頻出パターン抽出装置ここでは、１つのパターンとして、４つの単語（仮にV
a,Vb,Na,Nbとする）とその位置関係としてVa−＊→Vb−
＊→Nb、Va−＊→Naを考える。またVa、Vbは動詞である
こと、Na、Nbは名詞であることを制約として与える。こ
のようなパターンが与えられると頻出パターン抽出装置
は、各構文木に含まれる単語で、VaとNa、VbとNb、Vaと
Vbという係り受けの関係を持ち、かつVa Vbが動詞、Na
Nbが名詞であるような単語の組（Va −Vb− Na− Nb）
を探し、これを集計していく。(1) Frequent Pattern Extraction Device Here, four words (tentatively V
a, Vb, Na, Nb) and their positional relationship as Va− * → Vb−
* → Nb, Va− * → Na In addition, Va and Vb are restricted as verbs, and Na and Nb are restricted as nouns. When such a pattern is given, the frequent pattern extraction device uses the words included in each syntax tree as Va and Na, Vb and Nb, and Va and
Has a dependency relationship of Vb, and Va Vb is a verb, Na
A set of words where Nb is a noun (Va-Vb-Na-Nb)
Search for and sum up this.

【００２４】実現するための一例として具体的には、（１）Ａ．構文木を解析し、動詞ノードを見つけ、その
ノードから近距離に存在する動詞ノードについて調べ、
動詞と動詞の係り受けの関係にある動詞−動詞のペアを
求める。経路が複数ある場合は、距離が最短となるルー
トでの距離を集計の対象とする。例えば、ノードVaから
有向グラフを辿っていき、一定の距離内にあるノードVb
が存在すればノードVaとノードVbのペアが対象となる。
これを構文木上の全ての動詞ノードに対して行う。例え
ば、ここでVa−Vb、Vb−Vcが発見されたこととする。Ｂ．Ａと同様に、構文木を解析し、動詞ノードから近距
離に存在する名詞ノードについて調べ、係り受けの関係
にある動詞−名詞のペアを求める。例えば、ここでVa−
Na、Vb−Nb、Vc−Ncのペアが発見されたとする。Ｃ．Ａで求めた動詞−動詞の係り受けのペアと、Ｂで求
めた動詞−名詞の係り受けのペアから4つの語からなる
組を求める。例えば、ＡでVa−Vbが発見されて、かつＢ
でVa−Na、Vb−Nbが発見されれば、図７のように、この
４つの語からなる組（Va−Na−Vb−Nb）は集計対象とな
る。同様に（Va−Nb−Vc−Nc）も集計対象となる。As an example for realizing, specifically, (1) A. Analyzing the parse tree, finding a verb node, examining the verb nodes that are close to that node,
Find a verb-verb pair that has a verb-to-verb dependency relationship. When there are a plurality of routes, the distance on the route having the shortest distance is set as an object of aggregation. For example, tracing the directed graph from node Va, node Vb within a certain distance
Exists, the pair of the node Va and the node Vb is targeted.
This is performed for all verb nodes on the syntax tree. For example, it is assumed that Va-Vb and Vb-Vc are found here. B. Similar to A, the parsing tree is analyzed, a noun node existing at a short distance from the verb node is examined, and a verb-noun pair having a dependency relationship is obtained. For example, here Va−
It is assumed that a pair of Na, Vb-Nb, and Vc-Nc is found. C. From the verb-verb dependency pair obtained in A and the verb-noun dependency pair obtained in B, a set of four words is obtained. For example, if Va−Vb is discovered in A and B
If Va−Na and Vb−Nb are found in FIG. 7, the set of these four words (Va−Na−Vb−Nb) is to be aggregated as shown in FIG. Similarly, (Va−Nb−Vc−Nc) is also a target of counting.

【００２５】（２）全ての文書（構文木）に対して、上
記Ａ、Ｂ、Ｃを行い、最終的に集計された４つの語から
なる組の中から、頻出した組み合わせを出力する。（３）要素数の多い頻出パターンを抽出する場合を考え
る。パターンとして６つの単語（Va,Vb,Vc,Na,Nb,Nc）
からなり、Va−＊→Vb−＊→Vc−＊→Nc、Va−＊→Na、
Vb−＊→Nbという位置関係を考える。また、Va、Vb、Vc
は動詞であること、Na、Nb、Ncは名詞であることを制約
として与える。このようなパターンが与えられた時に
は、同様にして、Ａで求めた動詞−動詞のペアの中に、
Va−Vb,Vb−Vcというペア（VaはVbに、VbはVcにそれぞ
れ係り受けの関係がある）が存在するか調べ、Ｂで求め
た動詞−名詞の係り受けのペアを用いて、図８のように
６つの語からなる組を抽出する。(2) The above A, B, and C are performed on all documents (syntax trees), and a frequently-used combination is output from a set of four words that are finally totaled. (3) Consider a case where a frequent pattern having a large number of elements is extracted. 6 words as pattern (Va, Vb, Vc, Na, Nb, Nc)
Consisting of Va− * → Vb− * → Vc− * → Nc, Va− * → Na,
Consider a positional relationship of Vb− * → Nb. Also, Va, Vb, Vc
Is a verb, and Na, Nb, and Nc are nouns. When such a pattern is given, similarly, in the verb-verb pair obtained in A,
It checks whether there is a pair of Va-Vb and Vb-Vc (Va has a dependency relationship with Vb and Vb has a dependency relationship with Vc), and uses the verb-noun dependency pair obtained in B to obtain a diagram. A set consisting of six words such as 8 is extracted.

【００２６】（２）特定パターン適合文書抽出装置大量文書の中から、頻出パターンを満足する文書を抽出
し、これを出力する。これは、構文解析データ（構文木
データ）に対して、特定のパターンを構築する単語や属
性を全て含んでいるか、含んでいる場合には、それぞれ
の単語間に係り受けの関係があるのか否かを調べること
で実現できる。(2) Specific pattern matching document extraction device A document satisfying a frequent pattern is extracted from a large number of documents and output. This means that the syntax analysis data (syntax tree data) includes all the words and attributes that make up a specific pattern, and if so, whether there is a dependency relationship between the words. Can be realized by examining

【００２７】（３）線形リストからのパターン抽出言語解析装置において、係り受け解析装置にかけるデー
タとして、線形リストの構造を持つ構文解析データが構
築されており、このデータからも以下のようにパターン
を抽出することが可能である。（１）重み付きの距離を含んだ形態素（線形リストの要
素）wの列ｗ^＊に対し、係り受けの探索範囲を０〜∞で
設定する。ｗは単語ｔ，品詞または属性を表わすラベル
ｌ、右隣の単語との重み付き距離ｄの組である（ｗ＝
［ｔ，ｌ，ｄ］）。この時、探索範囲の値が０というの
は、探索を開始する場所の単語のみを探すことを意味
し、１ならば前後の単語も係り受けの探索候補とするこ
とを意味する。（２）探索パターンはＰ=＜ｐ₁,ｐ₂,….,ｐ_n＞、ｐ_１,….,ｐ_ｎ ∈ ｛［ｔ，ｌ］｝で表わすことができる。各ｐ_ｉ(ｉ=1,2,…,n)は、単語
ｔと品詞または属性を表わすラベルｌの組であり、Ｐは
このｐ₁,ｐ₂,….,ｐ_nを順に並べたものである。このと
き、ｐ_ｉと次に続くｐ_ｉ＋１は，（１）で指定した係り
受けの探索範囲以内に存在しなければならない。また、
パターンは、正規表現を用いて記述することもできる。
このパターンＰに一致するものを文章[ｔ，ｌ，ｄ]^＊の
中から探索し、これに一致する線形リストの部分集合の
重み付き距離ｄ＝Σ（ｄ_１，．．．，ｄ_ｎ）（ｄ_１，．．．，ｄ_ｎはパターンにマッチする最初から
最後までのワードの重み付き距離）が最少となるものを
選び出す。(3) Pattern Extraction from Linear List In the language analyzer, syntactic analysis data having a linear list structure is constructed as data to be applied to the dependency analyzer. Can be extracted. (1) For a column w ^{* of} a morpheme (element of a linear list) w including a weighted distance, a search range of the dependency is set from 0 to ∞. w is a set of a word t, a label 1 representing a part of speech or an attribute, and a weighted distance d to the right adjacent word (w =
[T, l, d]). At this time, if the value of the search range is 0, it means that only the word at the place where the search is started is searched, and if it is 1, it means that the preceding and following words are also candidates for the dependency. (2) The search pattern can be represented by P = <p ₁ , p ₂ ,..., _Pn >, p ₁ ,..., _Pn {{[t, l]}. Each _{p i (i = 1,2, ...} , n) is given to the set of labels l representing the term t and the part of speech or attributes, P is obtained by arranging the _{_{p 1, p 2, ....}} , The p _n in order It is. At this time, p _i and the following p _{i + 1} must be within the dependency search range specified in (1). Also,
Patterns can also be described using regular expressions.
Sentences that match the pattern P [t, l, d] * searched from among the weighted distance d = sigma subset of linear list that matches to _{_{(d 1, ..., d n}} ) (d _{1, ...,} d _n is the weighted distance of the word from the beginning that match the pattern to the end) is pick out what is minimized.

【００２８】（３）探索範囲と探索パターンを与えられ
て、入力の単語列[ｔ，ｌ，ｄ]^＊（単語は名前ｔ、属性
名ｌ、右隣の単語との距離ｄという要素からなる）から
パターンに合致する単語の組を取り出したものが、抽出
情報である。例えば、「装置が良くない訳ではない」と
いう文を例にとると、この文から構築された線形リスト
（(3) Given a search range and a search pattern, an input word string [t, l, d] ^* (a word is composed of elements such as a name t, an attribute name l, and a distance d from the right adjacent word) ) Is the extracted information. For example, taking the sentence "The device is not bad", for example, a linear list constructed from this sentence (

【００２０】参照）から、パターンＰ＝＜［＊，Ｈ／Ｗ］＞により属性名「Ｈ／Ｗ」にマッチする要素［装置，Ｈ／
Ｗ］（距離は省略）を取り出すことができる。Ｐ＝＜［＊，Ｈ／Ｗ｜Ｓ／Ｗ］，［＊，評価］＞により、テキスト中から複合属性［Ｈ／Ｗ］−［評価］
または［Ｓ／Ｗ］−［評価］にマッチする要素の組を探
し、この例では［装置，Ｈ／Ｗ］−［良い，評価］を取
り出すことができる。), An element [device, H / W] that matches the attribute name “H / W” by the pattern P = <[*, H / W]>
W] (the distance is omitted). By P = <[*, H / W | S / W], [*, evaluation]>, the composite attribute [H / W]-[evaluation] is obtained from the text.
Alternatively, a set of elements matching [S / W]-[evaluation] is searched, and in this example, [apparatus, H / W]-[good, evaluation] can be extracted.

【００２９】４．頻出パターン表示装置についてパターン抽出装置によって発見された頻出パターンとそ
れにマッチする構文木を有する文書を表示する。4. Frequent pattern display device Displays a document that has a frequent pattern found by the pattern extraction device and a syntax tree that matches it.

【００３０】本手法を実際のコールセンター業務で作成
された９万文のコールデータを処理して、その有効性を
確認した。以下に実施例の１具体例を示す。始めに個々
の文書から従来技術である形態素解析を行い、係り受け
解析装置によって構文木を構築する。例として簡単な文
章「電源を入れるとフロッピーディスクを要求する絵が
出る。」を用いることとする。この文章からは図９のよ
うな構文木（有向グラフ）が構築される。このグラフ中
で、有向のアークは語句の係り受けの関係を表わしてい
る。また、ノード（各語句）の右肩にある四角は、その
語が動詞であるか名詞であるかを示す（Nは名詞、Vは動
詞を示す）。The effectiveness of this method was confirmed by processing 90,000 call data created in actual call center operations. Hereinafter, one specific example of the embodiment will be described. First, a morphological analysis, which is a conventional technique, is performed from each document, and a syntax tree is constructed by a dependency analyzer. As an example, a simple sentence "When turning on the power, a picture requesting a floppy disk appears." From this sentence, a syntax tree (directed graph) as shown in FIG. 9 is constructed. In this graph, a directed arc indicates a dependency relationship between words and phrases. The square at the right shoulder of the node (each phrase) indicates whether the word is a verb or a noun (N indicates a noun, V indicates a verb).

【００３１】この構文木を作成するための文法規則は８
５個であり、あるノードの語句が動詞の連体形であれ
ば、そのノード以降に現れる名詞に対して係り受けを行
うというような簡単なものである。この例では、アーク
の重みは全て等しく１とする。有向グラフにおいて、あ
るノードからあるノードまでに経過した枝の数（アーク
数）を距離と定義する。例えば、「電源」と「要求す
る」では２つのアークを経由することで到達できるの
で、距離＝２となる。複数の経路が存在する場合は最短
の経路で計算する。また、抽出する知識としては、ここ
では距離が３以内のものだけを考えることとする。この
ように距離をある程度短くすることで、単語間の関連性
が無いと推定される係り受けを排除することが可能とな
る。上記の構文木から動詞−名詞の係り受けを求める
と、「出る」−「絵」、「要求する」−「フロッピーデ
ィスク」、「入れる」−「電源」等の近距離に存在する
語句のペアを取り出すことができる。The grammar rules for creating this syntax tree are 8
There are five, and if the word of a certain node is a verb adjunct, it is as simple as performing dependency on the noun appearing after that node. In this example, the weights of the arcs are all equal to one. In the directed graph, the number of branches (the number of arcs) passed from a certain node to a certain node is defined as a distance. For example, since “power source” and “request” can be reached via two arcs, distance = 2. When there are a plurality of routes, the calculation is performed using the shortest route. As knowledge to be extracted, only those whose distance is within 3 are considered here. By shortening the distance to some extent in this way, it is possible to eliminate dependency that is assumed to have no relevance between words. When the dependency of the verb-noun is obtained from the above parse tree, a pair of words that exist in a short distance such as “exit” — “picture”, “request” — “floppy disk”, “insert” — “power”, etc. Can be taken out.

【００３２】更に、動詞−動詞の係り受けにペアを求め
ると、「要求する」−「入れる」、「出る」−「入れ
る」を求めることができる。求めた動詞−動詞、動詞−
名詞の各ペアから、Ｖ１−Ｖ２、Ｖ１−Ｎ１、Ｖ２−Ｎ
２の係り受けの関係になっているものを求めると「電
源」「入れる」「フロッピーディスク」「要求する」や
「フロッピーディスク」「要求する」「絵」「出る」と
いう４つの語からなる組を抽出することができる。ま
た、「電源」「入れる」「フロッピーディスク」「要求
する」「絵」「出る」という６つの語からなる組も抽出でき
る。このように抽出した４つの語からなる組と６つの語
からなる組を集計することで、大量文書の中から同じ単
語を同じ係り受けの構造の中で用いる文書について集計
することができる。Further, when a pair is requested for the verb-verb dependency, "request"-"insert", "exit"-"insert" can be obtained. Verb sought-verb, verb-
From each pair of nouns, V1-V2, V1-N1, V2-N
In the case of the two dependency relations, a set consisting of four words, "power", "turn on", "floppy disk", "request" and "floppy disk", "request", "picture", and "exit" Can be extracted. Also, a set consisting of six words, “power”, “turn on”, “floppy disk”, “request”, “picture”, and “exit” can be extracted. By summing up the set of four words and the set of six words extracted in this way, it is possible to sum up the documents that use the same word in the same dependency structure from among a large number of documents.

【００３３】「名詞２」−「動詞２」、「名詞１」−「動詞
１」、「動詞１」−「動詞２」という構成の４つの語か
らなる組（即ち知識）を、実際のコールセンターのコー
ル記録文書から抽出してみる。「増設Ｈ／Ｗ」−「外
す」、「BIOS」−「戻す」という４つの語からなる知識
を抽出することができた。この知識の抽出元となった文
章は以下のものである。「増設H/Wを外してBIOSの復
元、FDISKで区画の切り直しリカバリーCDで出荷時に戻
してください」、「増設H/Wを外してBIOSの復元、リカ
バリーCDで出荷時に戻していってもISDNカードが使えな
い」、「増設H/Wを全て外してBIOSをF５で工場設定値に
戻してもレジューム機能の項目が復活できず、BIOS、H/
Wの不具合と考えサービスセンターにて調査が必要と判
断」等である。A set (that is, knowledge) composed of four words having a structure of "noun 2"-"verb 2", "noun 1"-"verb 1", "verb 1"-"verb 2" is transferred to an actual call center. Let's extract it from the call record document. Knowledge consisting of four words, "extension H / W"-"remove" and "BIOS"-"return" could be extracted. The text from which this knowledge was extracted is as follows. `` Removing the additional H / W and restoring the BIOS, re-partitioning with FDISK and returning to the factory with the recovery CD '', `` Even if removing the additional H / W and restoring the BIOS and returning to the factory with the recovery CD The ISDN card cannot be used "," Resume function cannot be restored even if all extension H / W is removed and BIOS is returned to the factory setting with F5, BIOS, H / W
Considered to be a problem with W and determined that an investigation was required at the service center. "

【００３４】その他に「ファイル」−「見つからな
い」、「メッセージ」−「出る」という４つの語からな
る知識も抽出することができた。この知識の抽出元とな
った文章は以下のものである。「プログラムファイルエ
ラーのファイルが見つからないとメッセージが出る」、
「“または必要なファイルが見つかりません”のメッセ
ージが出るようになったのでメッセージを消したい」、
「Xで\INSTALLと入力しても“ファイルが見つかりませ
ん”といった旨のエラーメッセージが出てしまいインス
トールできない」等である。In addition, knowledge consisting of four words "file"-"not found" and "message"-"go out" could be extracted. The text from which this knowledge was extracted is as follows. "A message appears when the program file error file cannot be found."
"The message" or required file not found "is now displayed. I want to delete the message."
"Even if I enter \ INSTALL with X, I get an error message saying" File not found "and cannot install."

【００３５】また、他に「PC」−「表示する」、「OS」
−「戻る」、「方法」−「分からない」という６つの語
からなる知識も抽出することができた。この知識の抽出
元となった文章は以下のものである。「PCの機種A、黒
い画面に白い文字が表示されていて、××モードからOS
に戻る方法が分からない」、「PCの機種A、ゲーム選択
後、コマンドプロンプトが表示され、OSに戻る方法が分
からない」、「PCの機種A、日本語DOSゲームアイコン選
択後、黒い画面に白い文字で“Cで￥OS"と表示され、OS
に戻る方法が分からない」等である。In addition, "PC"-"display", "OS"
-The knowledge consisting of the six words "return", "method"-"don't know" could also be extracted. The text from which this knowledge was extracted is as follows. `` PC model A, white characters are displayed on a black screen, OS from XX mode
`` I don't know how to return to '', `` After selecting PC model A, game, command prompt is displayed and I do not know how to return to OS '', `` PC model A, Japanese DOS game icon, black screen “C in OS” is displayed in white letters and the OS
I don't know how to return to ".

【００３６】更に、他に「電源」−「入れる」、「フロ
ッピーディスク」−「要求する」、「絵」−「出る」と
いう６つの語からなる知識も抽出することができた。こ
の知識の抽出元となった文章は以下のものである。「電
源を入れるとフロッピーディスクを要求する絵が出
る」、「ネットワークの設定を確認しようとしたが電源
を入れるとフロッピーディスクを要求する絵が出てOS起
動できない」、「電源を入れるとフロッピーディスクを
要求する絵が出てくる、BIOSでハードディスクは認識し
ている」等である。In addition, knowledge consisting of six words, "power"-"turn on", "floppy disk"-"request", "picture"-"go out" could be extracted. The text from which this knowledge was extracted is as follows. "When turning on the power, a picture requesting a floppy disk appears", "I tried to check the network settings, but when I turned on the power, a picture requesting the floppy disk appeared and I could not start the OS", "When I turned on the floppy disk Comes out, and the BIOS recognizes the hard disk. "

【００３７】更に、他に「インターネット」−「接続す
る」、「発信音」−「聞こえない」、「メッセージ」−
「出る」という６つの語からなる知識も抽出することが
できた。この知識の抽出元となった文章は「機種Aのイ
ンターネットに接続しようとすると“発信音が聞こえま
せん”とメッセージが出て繋がらない」、「インターネ
ットに接続しようとすると“発信音が聞こえない”とい
うメッセージが出て接続できない」、「機種Aのインタ
ーネットでプロバイダーに接続しようとすると“発信音
が聞こえません”とメッセージが出る」等である。In addition, "Internet"-"Connect", "Dialing tone"-"Inaudible", "Message"-
The knowledge consisting of the six words "go out" was also extracted. The sentence from which this knowledge was extracted is "When I try to connect to the Internet of model A, I get a message saying" I can't hear the dial tone "and it doesn't connect", "When I try to connect to the Internet," I can't hear the dial tone "Cannot connect because of the message", "" When trying to connect to the provider on the Internet of model A, a message saying "I can not hear the dial tone" appears. "

【００３８】本発明による知識抽出（頻出パターン発
見）方法のメリットとしては、（１）従来法であるキーワードだけを使った共起関係
や順序関係のデータマイニングの適用では得ることがで
きなかったパターンを抽出することができる。また従来
技術では、誤って見つけてしまうパターンを見つけな
い。（２）抽出された知識（頻出パターン）が人間にとっ
てわかりやすく、視認性に優れる。（３）線形リストを併用することで、処理を高速化で
きる。等がある。The advantages of the knowledge extraction (frequent pattern discovery) method according to the present invention are as follows: (1) Patterns that could not be obtained by the conventional data mining of co-occurrence or order relation using only keywords. Can be extracted. Further, in the related art, a pattern that is erroneously found is not found. (2) The extracted knowledge (frequent patterns) is easy for humans to understand and has excellent visibility. (3) The processing can be speeded up by using a linear list together. Etc.

【００３９】[0039]

【発明の効果】本発明によって、従来のデータマイニン
グ手法では発見できなかったりまたは誤って発見してい
た知識を、より適切に誤ることなく知識抽出できるよう
になった。また、抽出した知識も視認性に優れ、人間に
とって理解しやすいものとなった。例えば、企業のコー
ルセンター等では、大量の文書に出現するほぼ同一内容
の文書を発見し、出現数の多い内容について調べること
で、顧客からの問い合わせの多い内容に対してＦＡＱの
作成を行ったり、企業のホームページに掲載すること
で、問い合わせ件数の低減をすることができたり、その
内容をオペレータに知らせておくことで回答に要する時
間の削減を容易にすることができる。According to the present invention, knowledge that cannot be found by the conventional data mining method or that has been found by mistake can be extracted more appropriately without error. In addition, the extracted knowledge has excellent visibility and is easy for humans to understand. For example, a company call center or the like finds documents of almost the same content that appears in a large number of documents, and examines the content with a large number of appearances to create an FAQ for content that is frequently inquired by customers, By posting the information on a company homepage, it is possible to reduce the number of inquiries, or to notify the operator of the details, thereby facilitating the reduction of the time required for answers.

[Brief description of the drawings]

【図１】自然言語から構文木を作る過程を示す図であ
る。FIG. 1 is a diagram showing a process of creating a syntax tree from a natural language.

【図２】パターンについて示す図である。FIG. 2 is a diagram showing a pattern.

【図３】本発明の全体構成を示す図である。FIG. 3 is a diagram showing an overall configuration of the present invention.

【図４】本発明の処理のフローチャートである。FIG. 4 is a flowchart of a process of the present invention.

【図５】言語解析装置の詳細を示す図である。FIG. 5 is a diagram showing details of a language analyzer.

【図６】パターン抽出装置を示す図である。FIG. 6 is a diagram showing a pattern extraction device.

【図７】抽出された４つの語からなる組（パターン）
を示す図である。FIG. 7 is a set (pattern) of four extracted words.
FIG.

【図８】抽出された６つの語からなる組（パターン）
を示す図である。FIG. 8 is a set (pattern) of six extracted words.
FIG.

【図９】パターンの例を示す図である。FIG. 9 is a diagram illustrating an example of a pattern.

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｇ０６Ｆ 15/401 ３３０Ｚ (72)発明者松澤裕史神奈川県大和市下鶴間1623番地14 日本アイ・ビー・エム株式会社東京基礎研究所内 (72)発明者福田剛志神奈川県大和市下鶴間1623番地14 日本アイ・ビー・エム株式会社東京基礎研究所内 (72)発明者那須川哲哉神奈川県大和市下鶴間1623番地14 日本アイ・ビー・エム株式会社東京基礎研究所内 (72)発明者長野徹神奈川県大和市下鶴間1623番地14 日本アイ・ビー・エム株式会社東京基礎研究所内 (72)発明者諸橋正幸東京都多摩市聖ヶ丘４丁目１番地１号多摩大学経営情報学部内Ｆターム(参考） 5B075 ND03 NK31 NK32 NK43 PP24 PR04 UU40 5B091 AA15 CA02 CA05 CC01 CC02 CC05 ──────────────────────────────────────────────────の Continued on the front page (51) Int.Cl. ⁷ Identification symbol FI Theme coat ゛ (Reference) G06F 15/401 330Z (72) Inventor Hiroshi Matsuzawa 1623-14 Shimotsuruma, Yamato-shi, Kanagawa, Japan M Co., Ltd.Tokyo Basic Research Laboratories (72) Inventor Takeshi Fukuda 1623-14 Shimotsuruma, Yamato-shi, Kanagawa Prefecture Japan IBM Co., Ltd. Tokyo Research Laboratories (72) Inventor Tetsuya Nasukawa Yamato-shi, Kanagawa Prefecture 1623 Shimotsuruma 14 Tokyo Basic Research Laboratory, IBM Japan, Ltd. (72) Inventor Tohru Nagano 1623-14 Shimotsuruma, Yamato City, Kanagawa Prefecture IBM Japan, Ltd.Tokyo Basic Research Laboratory ( 72) Inventor Masayuki Morohashi 4-1-1, Seigaoka, Tama-shi, Tokyo F-term in the Faculty of Business and Information Sciences, Tama University 5B075 ND0 3 NK31 NK32 NK43 PP24 PR04 UU40 5B091 AA15 CA02 CA05 CC01 CC02 CC05

Claims

[Claims]

In a method for extracting knowledge from a large amount of document data, a step of extracting words from one document by a morphological analysis technique, estimating a dependency relationship between words, and constructing a syntax tree from the dependency relationship. And a step of finding a frequent pattern included in many parse trees in the constructed parse tree based on pattern constraints, and a step of searching for a document matching the assignment to the found frequent pattern. A knowledge extraction method characterized by including:

2. The knowledge extraction method according to claim 1, wherein in the step of constructing the syntax tree, a linear list is constructed, and a frequent pattern is found using the constructed linear list.

3. In the step of finding a frequent pattern described above, extracting knowledge by searching a pattern in which a combination of a search range, a word, and a label is described using a regular expression using the linear list. Characterized by
The knowledge extraction method according to claim 2.

4. An apparatus for extracting knowledge from a large amount of document data, a language feature analyzer for registering a vocabulary not included in a basic dictionary in a field-dependent dictionary, a language analyzer for performing natural language analysis, and a restriction on patterns. A pattern extraction device that finds data that matches a specific pattern based on the data, and a frequent pattern display device that displays the extracted frequent patterns, a basic dictionary for general fields from document data, and a generation for phrase generation processing Perform knowledge extraction with reference to rules, generation rules for syntax tree generation, and field-dependent dictionaries,
A knowledge extraction device characterized by the following.

5. The linguistic feature analysis device divides an input document into a word string with a part of speech using a morphological analysis dictionary, deletes words already registered from the word string using a field-dependent dictionary, and 5. The knowledge extracting apparatus according to claim 4, further comprising means for calculating the appearance frequency of the words, rearranging the words in the order of the frequency, and additionally registering the words in a field-dependent dictionary.

6. A morphological analyzer, wherein the language analyzer is a morphological analyzer,
Includes a phrase generation device, a dictionary application device, and a dependency analysis device, and parses in the form of a linear list and a syntax tree in consideration of distance, dependency, and label according to a phrase generation rule and a syntax tree generation rule. Means for generating data, wherein the morphological analysis device divides the input document into words using morphological analysis, adds labels including parts of speech or attributes, and unifies the expression using a synonym dictionary. Including,
The knowledge extraction device according to claim 4, wherein:

7. The pattern extraction device includes a frequent pattern extraction device and a specific pattern matching document extraction device, and the frequent pattern extraction device uses a syntax analysis data to generate a word, a positional relationship between words, and a label. Based on the combination of, including a means for examining co-occurrence relations and extracting frequent patterns, the specific pattern matching document extraction device determines whether or not the syntax analysis data includes a word constructing a specific pattern, an attribute, 5. The knowledge extraction apparatus according to claim 4, further comprising means for extracting a document that matches the frequent pattern by checking whether or not there is a dependency relationship between the phrases, and outputting the extracted document. apparatus.

8. The knowledge according to claim 4, wherein the frequent pattern display device includes display means for displaying a document having a frequent pattern found by the pattern extraction device and a syntax tree matching the frequent pattern. Extraction device.

9. A program for extracting knowledge from a large amount of document data, extracting words from one document by a morphological analysis technique, estimating a dependency relationship between words, and constructing a syntax tree from the dependency relationship. And a step of finding a frequent pattern included in many parse trees in the constructed parse tree based on pattern constraints, and a step of searching for a document matching the assignment to the found frequent pattern. , A computer-readable medium storing a program for causing a computer to execute the program.