JP2003296345A

JP2003296345A - Input template preparation knowledge extraction method from text db

Info

Publication number: JP2003296345A
Application number: JP2002103346A
Authority: JP
Inventors: Toru Hisamitsu; 徹久光; Masakazu Fujio; 正和藤尾; Yoshikazu Iketa; 嘉一井桁
Original assignee: Hitachi Ltd; Hitachi Medical Corp
Current assignee: Hitachi Ltd; Hitachi Healthcare Manufacturing Ltd
Priority date: 2002-04-05
Filing date: 2002-04-05
Publication date: 2003-10-17
Anticipated expiration: 2022-04-05
Also published as: JP4137490B2

Abstract

<P>PROBLEM TO BE SOLVED: To drastically reduce the load of manual template description, and to extract detail template knowledge whose extraction used to be impossible by a normal data mining method. <P>SOLUTION: The morpheme analysis and modification analysis of sentences in a DB are carried out, and the candidates of a typical description pattern called generalized case expression is generated. Thus, it is possible to automatically generate template candidates having proper fineness by executing the frequency analysis of words or generalized case expression in a remark field appearing in records having a discrimination field fulfilling a prescribed condition. <P>COPYRIGHT: (C)2004,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、事実を描写した文
の集合を格納する所見フィールドと、該描写された事実
に関するなんらかの判断を述べた文集合を格納する判断
フィールドを含むレコードを格納したDBにおいて、特定
の条件を満たす判断フィールドを持つものについて、所
見フィールドの入力を支援するためのテンプレートを抽
出する技術に関するものである。上記のようなテキスト
型のデータを格納するDBは様々な業務において基本的で
あり、蓄積された大量のデータを知的資産として活用す
るためには、記載内容の形式的整合性・均質性の向上が
必要であり、これを実現するためにテンプレートを用い
てユーザの入力を支援する手段の重要性が強く認識され
てきている。本発明は、このようなテンプレートを生成
するために、DBの記載内容を類型化した知識を抽出し、
ユーザの入力を誘導するテンプレートを生成するための
技術に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a DB that stores a record including a finding field that stores a set of sentences that describe a fact and a judgment field that stores a sentence set that describes some judgment about the depicted fact. In the above, the present invention relates to a technique for extracting a template for supporting the input of a finding field for those having a judgment field satisfying a specific condition. The DB that stores the text-type data as described above is fundamental to various businesses, and in order to utilize the large amount of accumulated data as intellectual assets, the formal consistency and homogeneity of the described contents must be maintained. There is a need for improvement, and the importance of means for supporting user input using templates to realize this has been strongly recognized. In order to generate such a template, the present invention extracts knowledge typifying the contents of the DB,
The present invention relates to a technique for generating a template that guides user input.

【０００２】上記のようなDBの例としては、医師が種々
の造影装置を用いて患者の身体の一部を撮影し、得られ
た画像に対して所見と診断を記入する「読影レポート」
と呼ばれるデータがある。読影レポートDBは、電子化の
進んだ病院ではすでに多くのデータが蓄積されている
が、医師の労力低減と、レポートの内容の質的向上のた
めに、入力支援システムが強く求められており、これら
の点で、本発明の適用対象として典型的である。従って
以下では、「読影レポート」のDBを例として、従来の技
術と本発明について述べる。An example of the above-mentioned DB is a "radiographic report" in which a doctor takes a picture of a part of the body of a patient using various imaging devices and writes findings and diagnoses on the obtained image.
There is data called. A large amount of data has already been accumulated in the radiology report database in hospitals that have become more electronic, but there is a strong demand for an input support system to reduce the labor of doctors and improve the quality of report content. In these respects, it is typical as an application target of the present invention. Therefore, in the following, the conventional technique and the present invention will be described by using the DB of the "interpretation report" as an example.

【０００３】[0003]

【従来の技術】従来の技術としては、まず専門家（読影
レポートの場合は医師）に、入力支援システムのための
テンプレートを作成するための知識をインタヴューする
方法があった。この方法の問題点は人的な負担がきわめ
て大きいことであり、これがシステム開発のボトルネッ
クとなっていた。2. Description of the Related Art As a conventional technique, there is a method of first interviewing an expert (a doctor in the case of an image interpretation report) with knowledge for creating a template for an input support system. The problem with this method is that the human burden is extremely heavy, which has been a bottleneck in system development.

【０００４】人的負担を軽減するための代替方法とし
て、一般のデータマイニング手法を適用し、「相関ルー
ル」を導出することが考えられる。ここで、データマイ
ニングにおいては、DBを構成する出現する個々の要素を
アイテム、処理上の一まとまりのアイテム集合をトラン
ザクション、任意個のアイテムの集合をアイテムセット
と称し、あらかじめ与えた正数α、βに対して、互いに
疎なアイテムセットX、Yであって、全トランザクション
中X∪Yを含むトランザクションの割合がα以上、Xを含
むトランザクション中、Yを含むものの割合がβ以上の
X、Yの対をX⇒Yという記号で表して、これを相関ルール
と呼ぶ。基本となる考え方は、「XとYが関連がある」こ
とを、「X、Yが同一トランザクション内に含まれる傾向
が強い」こととして定義することである。全トランザク
ション中X∪Yを含むトランザクションの割合をX⇒Yのサ
ポート、Xを含むトランザクション中、Yを含むものの割
合を、X⇒Yのコンフィデンスと称し、適当なα、βを与
えて、サポートがα以上、コンフィデンスがβ以上の相
関ルールを得るための方法が開発されている（Agrawa
l、 A.、 Imielinski、 T.、 and Swami、 A. (1993).
Mining association rules between sets of items in
large databases. In Proceedings of the ACM SIGMOD
Conferences on Management of Data、 pp.94-105）。As an alternative method for reducing the human burden, it is conceivable to apply a general data mining method and derive a "correlation rule". Here, in data mining, each element that appears in the DB is called an item, a set of items in processing is called a transaction, and a set of any number of items is called an item set. For β, item sets X and Y that are sparse with respect to each other, the ratio of transactions that include X∪Y in all transactions is α or more, and the ratio of transactions that include Y in transactions that include X is β or more.
The pair of X and Y is represented by the symbol X⇒Y, which is called the association rule. The basic idea is to define that "X and Y are related" as "there is a strong tendency for X and Y to be included in the same transaction." The ratio of transactions that include X∪Y in all transactions is called X⇒Y support, and the ratio of transactions that include X in X is called the confidence of X⇒Y. A method has been developed to obtain an association rule of α or higher and confidence of β or higher (Agrawa
l, A., Imielinski, T., and Swami, A. (1993).
Mining association rules between sets of items in
large databases. In Proceedings of the ACM SIGMOD
Conferences on Management of Data, pp.94-105).

【０００５】[0005]

【発明が解決しようとする課題】上記のような基本的な
枠組みは、例えばスーパーマーケットの販売を記録した
DB中で、「ともに売れる傾向が強いアイテム群Xとアイ
テム群Y」を調べる「バスケット分析」のような典型的
な場合には直ちに適用できる。しかし、例えば「読影レ
ポート」のDBから、所見の入力作成を支援するための知
識を抽出するという場合には問題がある。第一に、スー
パーマーケットの「バスケット分析」のような典型例で
は、各アイテムは売れた個々の商品、トランザクション
は「いっしょに購入されたこと」であり、その定義は自
明である。The above basic framework records, for example, supermarket sales.
In a typical case such as a "basket analysis" in which "a group of items X and a group of items Y that have a strong tendency to sell together" are checked in the DB, it is immediately applicable. However, there is a problem, for example, in extracting knowledge for supporting input creation of findings from the DB of "radiographic report". First, in a typical example like a supermarket "basket analysis", each item is an individual item sold and a transaction is "purchased together", the definition of which is self-explanatory.

【０００６】これに対して、我々が扱う問題の場合、判
断フィールドに「胃癌(である)」と記述されるときに
は、所見フィールドに例えば「UGIにおいて胃の粘膜に
不整が見られる」というような場合が多いという状況が
ある。トランザクションの単位を一レコードとすること
は自然であるとしても、「胃癌(である)」という所見
や、「UGIにおいて胃の粘膜に不整が見られる」という
文表現を単位アイテムと考えることは、データマイニン
グの一般論からは必ずしも自明ではない。(例えば、単
に「語」をアイテムと定義することもできる)。更に、
上のように文を単位アイテムとみなすと決定しても、単
に、文と文の対を多く導出するだけでは、目的とする
「テンプレート生成のための知識」の抽出には不完全で
ある。「テンプレート生成のための知識」の抽出のため
には、内容が共通する文をまとめて捉える必要があり、
そのためには、品詞や格スロット(動詞の場合)、意味コ
ード等の記載された辞書を用いて形態素解析と掛かり受
け解析を行い、言い回しや語の出現順序の違いを越え
て、おのおのの文が含む「単語間の依存関係」を認識
し、それらの中から、条件を満たす所見フィールドを特
徴付ける主要な依存関係、すなわち、先に述べた典型記
述パターンを抽出する必要がある。典型記述パターンを
抽出した後に、その中に現れる名詞句を修飾する修飾語
句のヴァリエーションを捉え、リスト化して提示する等
の処理に移れば良い。ここで、一旦典型記述パターンの
獲得を経由するのは、個々の文をそのままで記述パター
ンとして扱うと、語順やこまかな修飾句の違いにより、
本質的に同じ内容を現す文が別のものとして扱われてし
まい、各記述パターンの出現頻度がきわめて低頻度とな
って、特徴的なパターンを見つけることが困難となるか
らである。以上のべたようなことは、あきらかに従来の
枠組みでは達成できない。On the other hand, in the case of the problem that we deal with, when "gastric cancer (is)" is described in the judgment field, for example, in the findings field, "irregularities in the gastric mucosa are observed in UGI". There are many situations. Even though it is natural to use one unit of transaction as one record, it is not possible to consider the finding "(gastric cancer)" and the sentence expression "irregularities in the stomach mucous membrane in UGI" as unit items. It is not always obvious from the general theory of data mining. (For example, you could just define a "word" as an item). Furthermore,
Even if it is determined that the sentence is regarded as a unit item as described above, merely deriving a large number of sentence pairs is not sufficient for extracting the desired "knowledge for template generation". In order to extract "knowledge for template generation", it is necessary to collectively capture sentences with common content,
For that purpose, we perform morphological analysis and crossword analysis using a dictionary that describes parts of speech, case slots (in the case of verbs), meaning codes, etc. It is necessary to recognize the "dependencies between words" that are included and to extract the major dependencies that characterize the finding fields that satisfy the conditions, that is, the typical descriptive patterns described above, from among them. After extracting the typical descriptive pattern, the variation of the modifiers that modify the noun phrase appearing in the typical descriptive pattern may be captured, and the process may be presented as a list. Here, once the typical description pattern is acquired, if each sentence is treated as a description pattern as it is, it will be different due to the difference in word order and small modifiers.
This is because sentences that have essentially the same content are treated as different sentences, and the appearance frequency of each description pattern is extremely low, making it difficult to find a characteristic pattern. Clearly, the above cannot be achieved by the conventional framework.

【０００７】本発明が解決しようとする課題は、人手に
よるテンプレート記述の負担を大幅に軽減しつつ、上記
のように、通常のデータマイニング手法では抽出が不可
能であった、入力テンプレート記述に利用できる形の知
識を抽出する事である。The problem to be solved by the present invention is to significantly reduce the burden of manual template description and to use it for input template description that cannot be extracted by the normal data mining method as described above. It is to extract the knowledge that can be formed.

【０００８】[0008]

【課題を解決するための手段】目的は、所見フィールド
と、それに対応する判断フィールドを持つ形式のデータ
を蓄積するDBへの入力支援テンプレート作成のための知
識抽出であるから、「判断を表す単語であって、そのた
めの所見をテンプレート入力したい単語」のリストを用
意する。これはユーザが指定しても良いし、判断フィー
ルドに特に頻出すると自動的に判定された単語のリスト
でも良い。このリスト中の単語（読影レポートの場合は
傷病名）を含むレコード集合を求め、それらの所見フィ
ールドを特徴付けるだけの具体性を持ち、同時に不必要
な詳細さを持たない典型記述パターンを抽出すればよ
い。[Means for Solving the Problems] Since the purpose is knowledge extraction for creating an input support template for a DB that stores data in a format having a finding field and a corresponding judgment field, Then, prepare a list of "words for which the user wants to input the finding as a template". This may be specified by the user, or may be a list of words automatically determined to be particularly frequent in the determination field. If we find a set of records that include the words in this list (the name of the disease in the case of a radiology report), and extract the typical descriptive patterns that are specific enough to characterize their findings fields and that do not have unnecessary detail at the same time. Good.

【０００９】このためにはまず、上記の所見フィールド
の集合中の文内容を、言い回しや語の出現順序の違いを
吸収して認識する必要がある。「言い回しや語の出現順
序の違いを無視する」ために、各文を、文末の動詞VPを
中心とした単語の依存関係として表現する。すなわち、
VPが文法上取りうる格スロットを｛C_1、…、C_VP
（m）｝とすると、各格スロットC_iに対応する名詞句NP
_iを認定し、VPと名詞句の組｛VP、NP_1、 …、 NP_V
（m）｝に変換する。ここで、C_iに対応する名詞句が、
形容詞や連体修飾句を含む長いものであるときは、これ
らの修飾語句を除いた主要部、すなわち最後の名詞（単
名詞または複合名詞）をNP_iとする。こうして得た｛V
P、NP_1、 …、 NP_V（m）｝を以下では格表現と呼ぶ。
実際には格は省略されることもあるので、その場合はNP
_Iをワイルドカードと呼ぶ、特定の同一の記号に置換す
る。さらに、上で得られた｛VP、NP_1、 …、 NP_V
（m）｝に対し、｛NP_1、 …、 NP_V（m）｝のうちのワ
イルドカードでないものを任意個ワイルドカードに置換
したものを生成する。格スロットの要素が特定されない
表現の方が一般的な内容的を表すので、これらを以下で
は一般化格表現と呼ぶことにする。For this purpose, first, it is necessary to recognize the sentence contents in the above-mentioned set of finding fields by absorbing differences in wording and word appearance order. In order to "ignore differences in wording and word appearance order", each sentence is expressed as a word dependency centered on the verb VP at the end of the sentence. That is,
The VP can take grammatically possible case slots {C_1,…, C_VP
(M)}, the noun phrase NP corresponding to each case slot C_i
_i is certified, and VP and noun phrase pair {VP, NP_1,…, NP_V
(M)}. Here, the noun phrase corresponding to C_i is
When it is a long one that includes adjectives and adnominal modifiers, the main part excluding these modifiers, that is, the last noun (single or compound noun) is NP_i. Thus obtained {V
Hereinafter, P, NP_1, ..., NP_V (m)} is referred to as a case expression.
Actually, the case may be omitted, so in that case NP
Replace _I with a particular identical symbol, called a wildcard. In addition, we obtained {VP, NP_1,…, NP_V
(M)} is generated by replacing any non-wildcard of {NP_1, ..., NP_V (m)} with a wildcard. Since the expression in which the elements of the case slot are not specified is more general, they will be referred to as generalized case expressions below.

【００１０】DB中の所見フィールド中の各文からあらゆ
る一般化格表現を生成し、DB中の単語に加え、単語とと
もに、一般化格表現をDB中の頻度解析の対象とする。特
定の条件を満たす判断フィールドを持つレコード中の所
見フィールドに現れる一般化格表現のうち、あらかじめ
定める「高頻出性」条件を満たすうちで最も具体的な、
すなわち、ワイルドカードである名詞句が最も少ない一
般化格表現を抽出する。これにより、「所見フィールド
を特徴付けるだけの具体性を持ちながら、不必要な詳細
さを持たない言い回しを」に対応する一般化格表現が抽
出できる。これが、先に述べた「特定の条件を満たす判
断フィールドに対応する所見フィールド」における典型
記述パターンであるといえる。一旦典型記述パターンが
見つかったら、そのうち選択されたものについては、ま
ずワイルドカードを具体的名詞句に還元し、それらを提
示、ユーザに選択させる。さらにそのうち選択された名
詞句に関しては、その修飾句を収集して提示、ユーザに
選択させる。All generalized case expressions are generated from each sentence in the finding field in the DB, and in addition to the words in the DB, the generalized case expressions are subjected to frequency analysis in the DB. Of the generalized case expressions that appear in the finding field in the record that has a judgment field that satisfies a specific condition, the most specific one that satisfies the predetermined "high frequency" condition,
That is, the generalized case expression with the fewest wildcard noun phrases is extracted. As a result, a generalized case expression corresponding to “a phrase having specificity enough to characterize the finding field but not having unnecessary detail” can be extracted. It can be said that this is the typical description pattern in the "finding field corresponding to the judgment field satisfying the specific condition" described above. Once the typical description patterns are found, for the selected ones, the wildcards are first reduced to concrete noun phrases, which are presented to the user for selection. Furthermore, regarding the selected noun phrase, the modifier phrases are collected and presented, and the user is allowed to select.

【００１１】以上を手順として整理すると以下のように
なる。（１）文中の単語認識と、単語間の係り受け関係抽出処
理、（２）文末動詞を中心とした格表現の抽出と、その
一般化処理、（３）単語と一般化された格表現を含むイ
ンデクスファイルの作成、（４）「判断を表す単語であ
って、そのための所見をテンプレート入力したい単語」
のリストの用意、（５）該インデクスファイルを用い
て、特定の条件を持つ判断フィールド（読影レポートの
場合は傷病名）を持つ所見レコード中に特異的に頻出す
る一般化格表現を典型記述パターンとして抽出する処
理、（６）（５）で抽出された一般化格表現をユーザに
提示し、そのうちの任意個の要素をユーザに選択させる
処理、（７）（６）で選択された各一般化格表現に対
し、一般化格表現においてワイルドカードになっていた
名詞句を復元して提示し、そのうちの任意個の要素をユ
ーザに選択させる処理、（８）（７）で選択された各名
詞句に対し、原文中でそれを修飾していた修飾句を復元
して提示し、そのうちの任意の要素をユーザに選択させ
る処理。ここで、（４）は、他とは異質の処理であり、
ユーザが単語を与える場合は不要であるが、自動的に対
象単語を抽出する場合、形態素解析の結果と単語の頻度
情報が必要なので、（５）の直前に置かねばならない。The above procedure is summarized as follows. (1) word recognition in sentences and dependency relation extraction processing between words, (2) case expression extraction centering on sentence end verbs and its generalization processing, (3) words and generalized case expressions Creating an index file that includes (4) “Words that represent judgments and for which you want to input the findings for that template”
(5) Using the index file, a typical descriptive pattern of a generalized case expression that frequently appears specifically in a finding record that has a judgment field (in the case of an image interpretation report, a disease name) having a specific condition using the index file. Process of presenting the generalized case expressions extracted in (6) and (5) to the user and allowing the user to select an arbitrary number of elements from them (7) and (6) For the personalized expression, the process of restoring and presenting the noun phrase that was a wildcard in the generalized personality expression, and allowing the user to select any number of the elements, (8) and (7) For noun phrases, processing that restores and presents the modified phrase that was modified in the original sentence, and allows the user to select an arbitrary element from them. Here, (4) is a process different from other processes,
It is not necessary when the user gives a word, but when automatically extracting the target word, the result of the morphological analysis and the frequency information of the word are necessary, so it must be placed immediately before (5).

【００１２】[0012]

【発明の実施の形態】以上に述べた知識抽出方法を実現
するためのシステム構成例を図１に示す。１０１は記憶
装置であり、ハードディスク等を用いて文書データや各
種のプログラムモジュール等を格納する。また、プログ
ラムモジュールの作業用領域としても利用される。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 shows an example of a system configuration for realizing the knowledge extraction method described above. A storage device 101 stores document data and various program modules using a hard disk or the like. It is also used as a work area for program modules.

【００１３】以下、１０１０は、ユーザインタフェイス
や他のモジュール間のデータのやり取りを管理する制御
モジュール、１０１１は知識抽出対象の文書を蓄積する
文書データDB、１０１２は、形態素解析と掛かり受け解
析を行うための辞書であり、単語の品詞や格スロット
(動詞の場合)、意味コード等が記載されている。１０１
３は形態素解析モジュールであり、文書を構成する単語
を同定する。日本語の場合は分かち書き＋品詞付け、英
語の場合は原型還元等の処理を行う。これらの手法につ
いては特定しない。両言語とも、商用・研究用を問わず
さまざまなシステムが公開されている。形態素解析によ
り、例えば「幽門部に粘膜の不整を認める」という文
は、「幽門部/に/粘膜/の/不整/を/認める」のように、
文を構成する単語の列に分解される。１０１４は掛かり
受け解析モジュールであり、文内の単語と単語の依存関
係を解析する。「幽門部/に/粘膜/の/不整/を/認める」
は、[幽門部に⇒認める]、[不整を⇒認める]、[粘膜の
⇒不整]のように解析される。これらにより、課題を解
決するための手段で述べた手順（１）が実現される。１
０１３、１０１４の解析結果は共有データ記録領域１０
２６に記録し、１０１５、１０１６、１０１７にて利用
する。In the following, 1010 is a control module for managing the exchange of data between the user interface and other modules, 1011 is a document data DB for accumulating documents for knowledge extraction, and 1012 is a morphological analysis and a catching analysis. A dictionary for doing, part-of-speech and case slots for words
(For verbs), the meaning code, etc. are described. 101
Reference numeral 3 denotes a morphological analysis module, which identifies words that form a document. In the case of Japanese, it is divided into words and part-of-speech, and in the case of English, processing such as prototype reduction is performed. We do not specify these methods. In both languages, various systems have been released for commercial and research purposes. By morphological analysis, for example, the sentence "recognizing irregularities of mucous membrane in the pyloric part" is like "recognizing pyloric part / in / mucosa / of irregularity /"
It is broken down into the sequence of words that make up the sentence. Reference numeral 1014 is a hangup analysis module, which analyzes a word in a sentence and a dependency relationship between the words. "Pyle part / on / mucosa / of / irregularity / recognize"
Are analyzed as [⇒ recognized in pylorus], [⇒ recognized irregularity], [⇒ irregularity in mucous membrane]. By these, the procedure (1) described in the means for solving the problem is realized. 1
The analysis results of 013 and 1014 are the shared data recording area 10
Recorded in 26 and used at 1015, 1016, 1017.

【００１４】１０１５は、１０１４の結果から、「格表
現」を抽出するモジュールであり、文末の動詞に注目し
て、{認める、幽門部(ニ)、不整(ヲ)}のような表現を生
成する。ここで、「認める」のヲ格は、「粘膜の不整」
であるが、より一般的かつ本質的な部分を残すため、主
要部である「不整」に縮約し、修飾部分「粘膜の」は共
有データ記録領域１０２７に記録しておく。１０１６
は、１０１５の出力の結果から、{認める、幽門部
(ニ)、不整(ヲ)}、{認める、＊(ニ)、不整(ヲ)}、{認め
る、幽門部(ニ)、＊(ヲ)}、{認める、＊(ニ)、＊(ヲ)}
という一般化格表現を生成する。すべての一般化格表現
を生成する理由は、特定の条件を満たす判断フィールド
を持つ所見フィールドを記述するためのテンプレートと
して、どの程度詳細な記述レベルが適切なのかはあらか
じめわからないため、すべての詳細度レベルの表現を考
慮したうえで、出現が最も偏っているものを「頻出特異
度」の計算により選択する方策をとるためである。重要
なことは、テンプレート入力により入力効率が向上する
ことであるから、無理にすべての入力をカバーできなく
ても、過半のケースで入力の補助ができ、カバーできな
いものは自由入力に委ねることにすればよい。Reference numeral 1015 is a module for extracting the "case expression" from the result of 1014, and paying attention to the verb at the end of the sentence, generating expressions such as {accept, pylorus (d), irregularity (wo)}. To do. Here, the “admit” rating is “irregular mucosa”
However, in order to leave a more general and essential part, the main part is reduced to “irregularity”, and the modified part “mucosal” is recorded in the shared data recording area 1027. 1016
From the output of 1015, {Acknowledge, pylorus
(D), irregularity (wo)}, {accept, * (d), irregularity (wo)}, {acknowledge, pylorus (d), * (wo)}, {accept, * (d), * (wo )}
Generates a generalized case expression. The reason why all generalized case expressions are generated is that it is not known in advance how much detailed description level is appropriate as a template for describing a finding field having a judgment field that satisfies a specific condition. This is because, in consideration of the expression of the level, the one having the most biased appearance is selected by calculating the “frequent specificity”. The important thing is that template input improves input efficiency, so even if you can not cover all the input by force, you can assist the input in the majority of cases, and leave the input that cannot be covered to free input. do it.

【００１５】例えば、{認める、＊(ニ)、＊(ヲ)}は、ほ
とんど常に所見フィールドに現れるので、これをテンプ
レートに用いるとユーザが動詞以外のすべてを指定しな
ければならず、もう少し詳細なテンプレートよりは利用
価値が低い。同様に、すべての格が埋まった{認める、
幽門部(ニ)、不整(ヲ)}のような格パターンは、典型パ
ターンとするには詳細すぎ、一般に頻度が低い。低頻度
のものを典型表現としてテンプレート入力のメニューと
して提示することは、利用頻度が低いものが多数提示さ
れてしまうため、やはり好ましくない。テンプレートに
利用するのに適当なのは、一般に、出現の特異性が両者
の中間となるような、適度な一般性を持つ一般化格表現
である。例えば、「胃癌」を判断フィールドにもつレコ
ードの所見フィールド中では、{認める、＊(ニ)、不整
(ヲ)}が、両者の中間となる適度な一般性を持つ一般化
格表現になる。それぞれの一般化格表現の頻出特異性は
頻出特異度計算モジュール１０１８により計算され、元
になる格表現と頻出特異度とともに１０２６に記録して
おく。１０１５と１０１６により、課題を解決するため
の手段で述べた手順（２）が実現される。For example, {accept, * (d), * (wo)} almost always appears in the finding field, so if this is used as a template, the user has to specify all but the verb. Less useful than simple templates. Similarly, all cases are buried {acknowledge,
Case patterns such as the pylorus (d) and irregularity (wo)} are too detailed to be typical patterns and are generally infrequent. It is not preferable to present a low-frequency one as a typical expression as a template input menu because many low-frequency ones are presented. Generally suitable for use as templates is a generalized case expression with a moderate generality such that the peculiarity of occurrence is in between. For example, in the finding field of a record that has “stomach cancer” in the judgment field, {accept, * (d), irregular
(Wo)} becomes a generalized case expression with a moderate generality that is intermediate between the two. The frequent uniqueness of each generalized case expression is calculated by the frequent uniqueness calculation module 1018, and is recorded in 1026 together with the original case expression and the frequent uniqueness. 1015 and 1016 realize the procedure (2) described in the means for solving the problem.

【００１６】単語・一般化格表現記録モジュール１０１
７は、１０２６に記録された形態素解析の結果と、１０
１６で生成された一般化格表現生成結果に基づき、各単
語や一般化格表現が、どのレコードのどのフィールドに
何回現れるかを記録する。すなわちインデクスファイル
が生成される。１０１７により、課題を解決するための
手段で述べた手順（３）が実現される。Word / generalized case expression recording module 101
7 is the result of the morphological analysis recorded in 1026 and 10
Based on the generalized case expression generation result generated in 16, the number of times each word or generalized case expression appears in which field of which record is recorded. That is, an index file is created. By 1017, the procedure (3) described in the means for solving the problem is realized.

【００１７】頻出特異度計算モジュール１０１８は、単
語や一般化格表現が、特定の条件を満たす集合中に特異
的に高頻度で現れるかどうかを判定する。そのための方
法はいくつか考えられる。もっとも簡単なものは、すべ
ての格スロット埋まっているものを除いて単純に頻度を
返すものであり、もう少し複雑なものとしては、「確率
を用いた特徴単語の選択方法：P2000-354407」で開示さ
れた、確率的尺度を用いることも考えられる。すなわ
ち、頻出特異度を測りたい単語もしくは一般化格表現
の、全文書集合中での頻度をK, 該特定の条件を満たす
集合中の該単語もしくは一般化格表現の頻度をk、単語
と一般化格表現の全文書集合中での総頻度をN、該特定
の条件を満たす集合中の単語数をnとしたとき，該単語
の重みを，「N個の玉の中に印のついたK個の玉が入って
いるとき，これから任意にn個の玉を取り出したときに
印のついた玉がk個以上含まれる確率」p(N, K, n, k)
（◎数１参照）に対応付け，その確率値の対数値の符号
を反転させたものを、頻出特異度を測りたい単語もしく
は一般化格表現の頻出特異度とする。この尺度は、単な
る頻度と違い、「する」のような高頻度不要語を排除で
きると同時に、低頻度要素の過大評価も生じない良い性
質を持つことが分っている（Hisamitsu, T. and Niwa,
Y. (2001) Topic-Word Selection by Combinatorial Pr
obability, Proceedingsof NLPRS2001, pp.289-296）。
p(N, K, n, k)は以下の式で与えられる。どのような尺
度を用いるかは、ここでは特定しない。The frequent uniqueness calculation module 1018 determines whether or not a word or a generalized case expression appears specifically and frequently in a set satisfying a specific condition. There are several possible ways to do this. The simplest one is to simply return the frequency except for all the case slots that are filled, and a slightly more complicated one is disclosed in "Method of selecting characteristic words using probability: P2000-354407". It is also conceivable to use a stochastic measure as described above. That is, the frequency of the word or generalized case expression whose frequency specificity is to be measured is K in the entire document set, the frequency of the word or generalized case expression in the set satisfying the specific condition is k, and the word and general case When the total frequency of personalized expressions in all document sets is N and the number of words in the set satisfying the specific condition is n, the weight of the word is “marked in N balls. When there are K balls, the probability that more than k balls will be included when any n balls are taken out from this ”p (N, K, n, k)
Corresponding to (see Mathematical Expression 1) and reversing the sign of the logarithmic value of the probability value is taken as the frequent uniqueness of the word or generalized case expression whose frequent uniqueness is to be measured. It has been found that this scale has a good property that, unlike mere frequency, it can exclude high-frequency unnecessary words such as "do", and at the same time, it does not cause overestimation of low-frequency elements (Hisamitsu, T. and Niwa,
Y. (2001) Topic-Word Selection by Combinatorial Pr
obability, Proceedingsof NLPRS2001, pp.289-296).
p (N, K, n, k) is given by the following equation. The scale used is not specified here.

【００１８】[0018]

【数１】判断フィールド解析モジュール１０１９は、判断フィー
ルドを解析して、１０１７のデータを参照し、１０１８
で算出される特異度を用いて、判断フィールド中の特徴
的な単語を抽出する。これは、全自動でも、ユーザに候
補を提示し、選択させる方法でも良い。あらかじめユー
ザが、「判断を表す単語であって、そのための所見をテ
ンプレート入力したい単語」をリストアップしている場
合、判断フィールド解析モジュール１０１９は用いず、
該リストを目標単語記録モジュール１０２０に直接記録
する。この制御は制御モジュール１０１０でおこなわれ
る。目標単語記録モジュールは１０２０、１０１９で得
られた、もしくはユーザが指定した、「判断を表す単語
で、そのための所見をテンプレート入力したい単語」の
リストを記録する。ユーザが該リストを手動で与えない
場合、１０１９と１０２０により、課題を解決するため
の手段で述べた手順（４）が実現される。[Equation 1] The judgment field analysis module 1019 analyzes the judgment field, refers to the data of 1017, and determines 1018.
Characteristic words in the judgment field are extracted by using the specificity calculated in step. This may be fully automatic or may be a method of presenting candidates to the user for selection. If the user has previously listed up “a word that represents a judgment, and a word for which the user wants to input a finding for that” as a template, the judgment field analysis module 1019 is not used,
The list is recorded directly in the target word recording module 1020. This control is performed by the control module 1010. The target word recording module records a list of "words that represent judgments and words for which the user wants to input a finding therefor as a template" obtained at 1020 or 1019 or designated by the user. If the user does not provide the list manually, 1019 and 1020 implement the procedure (4) mentioned in the means for solving the problem.

【００１９】典型記述パターン抽出モジュール１０２１
は、１０２０に記録された各単語について、それを判断
フィールドに含むレコードを抽出し、それらに含まれる
所見フィールド中から、１０１７のデータを参照し、１
０１８で算出される頻出特異度を用いて、特異的に頻出
する一般化格表現を典型記述パターンとして抽出し、抽
出の元となった目標単語記録モジュール１０２０の単語
とともに共有データ記録領域１０２６に記録する。１０
２１により、課題を解決するための手段で手順（５）が
実現される。Typical description pattern extraction module 1021
For each word recorded in 1020, a record that includes it in the judgment field is extracted, and the data of 1017 is referred from the finding field included in them, and 1
Using the frequent uniqueness calculated in 018, a generalized case expression that is uniquely frequent is extracted as a typical description pattern and recorded in the shared data recording area 1026 together with the word of the target word recording module 1020 that is the source of extraction. To do. 10
21 realizes the procedure (5) as a means for solving the problem.

【００２０】典型記述パターン提示モジュール１０２２
は、共有データ記録領域１０２６に記録された一般化格
表現をユーザに提示し、そのうちの任意個の要素をユー
ザに選択させ、選択された要素を共有データ記録領域１
０２６に記録する。一般化格表現をユーザに提示する方
法は特定しないが、例えば、「判断を表す単語であっ
て、そのための所見をテンプレート入力したい単語」ご
とに、特異的頻出性の高い動詞の順で見せるなどの方法
がある。１０２２により、課題を解決するための手段で
述べた手順（６）が実現される。Typical description pattern presentation module 1022
Presents the generalized case expression recorded in the shared data recording area 1026 to the user, allows the user to select any number of the elements, and selects the selected element from the shared data recording area 1
Record at 026. The method of presenting the generalized case expression to the user is not specified, but, for example, for each "word that represents a judgment and for which the user wants to input the finding for that template", the verbs with high specific frequency are displayed in order. There is a method. By 1022, the procedure (6) described in the means for solving the problem is realized.

【００２１】名詞句提示モジュール１０２３は、共有デ
ータ記録領域１０２６に記録された一般化格表現をユー
ザに提示し、そのうちの任意個の要素をユーザに選択さ
せ、選択された要素を共有データ記録領域１０２６に記
録する。一般化格表現をユーザに提示する方法は特定し
ない。１０２３により、課題を解決するための手段で述
べた手順（７）が実現される。The noun phrase presentation module 1023 presents the generalized case expression recorded in the shared data recording area 1026 to the user, allows the user to select any number of the elements, and selects the selected element from the shared data recording area. Record at 1026. The method of presenting the generalized case expression to the user is not specified. By 1023, the procedure (7) described in the means for solving the problem is realized.

【００２２】修飾語句提示モジュール１０２３は、共有
データ記録領域１０２６に記録された、ユーザが選択し
た一般化格表現に対して、ワイルドカードに置換されて
いた名詞を提示し、そのうちの任意個の要素をユーザに
選択させ、選択された要素を共有データ記録領域１０２
６に記録する。名詞句をユーザに提示する方法は特定し
ない。１０２３により、課題を解決するための手段で述
べた手順（７）が実現される。The modifier presentation module 1023 presents the noun replaced by the wildcard for the generalized case expression selected by the user, which is recorded in the shared data recording area 1026, and any of the elements To the user and select the selected element from the shared data recording area 102.
Record at 6. The method of presenting the noun phrase to the user is not specified. By 1023, the procedure (7) described in the means for solving the problem is realized.

【００２３】修飾語句提示モジュール１０２４は、共有
データ記録領域１０２６に記録された、ユーザが選択し
た一般化格表現とその格スロット中の名詞（群）のそれ
ぞれに対して、それを修飾する語句を１０１４の出力記
録より抽出、リスト化し、そのうちの任意個の要素をユ
ーザに選択させ、選択された要素を、テンプレート記録
DB１０２７に記録する。１０２４により、課題を解決す
るための手段で述べた手順（８）が実現される。The modifier presentation module 1024, for each generalized case expression selected by the user and the nouns (groups) in the case slot, recorded in the shared data recording area 1026, provides a phrase for modifying it. Extracted from the output record of 1014, made a list, and let the user select any number of the elements, and select the selected element as a template record.
Record in DB1027. By 1024, the procedure (8) described in the means for solving the problem is realized.

【００２４】図２は、判断フィールドに「胃癌」「食道
癌」「大腸癌」などを含む複数のレコードの例である。
２０１は単位となる一レコードを示している。FIG. 2 shows an example of a plurality of records including "stomach cancer", "esophageal cancer", "colon cancer", etc. in the judgment field.
201 shows one record as a unit.

【００２５】図３は、図2に示されたレコード群から抽
出した単語、及び一般化格情報の、出現頻度と出現フィ
ールド情報である。FIG. 3 shows appearance frequencies and appearance field information of words and generalized case information extracted from the record group shown in FIG.

【００２６】図４は、図３に示された情報を用いて、判
断フィールドに「胃癌である」と入力するためのテンプ
レートのもととなる典型記述パターンを提示し、選択を
求める様子である。４０１は提示ウィンドウ、４０２
は、選択サブウィンドウである。FIG. 4 shows a state in which a typical descriptive pattern that is the basis of a template for inputting “stomach cancer” is presented in the judgment field using the information shown in FIG. 3 and selection is requested. . 401 is a presentation window, 402
Is the selection subwindow.

【００２７】図５は、図４で「〜に (〜な)不整を認
める」が選択されたとして、ニ格に採用する名詞候補を
提示し、選択を求める様子である。５０１は提示ウィン
ドウ、５０２は、選択サブウィンドウである。FIG. 5 shows a state in which noun candidates to be adopted in the two cases are presented and selection is requested, assuming that “accept irregularity in (...)” is selected in FIG. Reference numeral 501 is a presentation window, and 502 is a selection subwindow.

【００２８】図６は、図５で「前壁に」が選択されたと
して、「不整」を修飾する語句のうちテンプレートのメ
ニューとする修飾句候補を提示し、選択を求める様子で
ある。５０１は提示ウィンドウ、５０２は、選択サブウ
ィンドウである。ここで、５０２には、「〜に (〜な)
不整を認める」全体にあらわれる修飾句を表示してい
るが、これを「前壁に (〜な)不整を認める」のパタ
ーンに現れるものに絞って表示することもできる。FIG. 6 shows a state in which, when “on front wall” is selected in FIG. 5, candidate modifier phrases to be used as a template menu are presented among the phrases that modify “irregularity” and selection is requested. Reference numeral 501 is a presentation window, and 502 is a selection subwindow. Here, 502 indicates "to
The modifiers that appear throughout the "Accept irregularities" are shown, but it is also possible to limit this to those that appear in the "Accept irregularities on the front wall" pattern.

【００２９】[0029]

【発明の効果】本発明で述べた知識抽出方法により、人
的なコストを掛けずに、しかも一般のデータマイニング
手法を用いる場合に比べてきめ細かなテンプレートの生
成が可能となる。According to the knowledge extraction method described in the present invention, it is possible to generate a finer template than at the time of using a general data mining method without incurring human cost.

[Brief description of drawings]

【図１】本発明の知識抽出方法を実現するためのシステ
ム構成図である。FIG. 1 is a system configuration diagram for realizing a knowledge extraction method of the present invention.

【図２】複数のレコードの例である。FIG. 2 is an example of a plurality of records.

【図３】図2に示されたレコード群から抽出した単語、
及び一般化格情報の情報である。FIG. 3 is a word extracted from the record group shown in FIG.
And information on generalized personality information.

【図４】図３に示された情報を用いて生成した典型記述
パターンを提示し、選択を求める様子である。FIG. 4 is a state in which a typical description pattern generated using the information shown in FIG. 3 is presented and selection is requested.

【図５】図４で選択した典型記述パターンのなかで、ワ
イルドカードになっていた名詞句に採用する候補を提示
し、選択を求める様子である。5 is a state in which candidates to be adopted for a noun phrase that has become a wildcard are presented among the typical descriptive patterns selected in FIG. 4 and selection is requested.

【図６】図５で名詞句が選択されて詳細化されたテンプ
レートに対し、名詞句の修飾句のメニューとする候補を
提示し、選択を求める様子である。FIG. 6 is a state in which candidates for a menu of modified phrases of noun phrases are presented to the template in which noun phrases are selected and detailed in FIG. 5, and selection is requested.

[Explanation of symbols]

１０１：記憶装置１０１１：文書データDB １０１２：辞書１０１３：形態素解析モジュール１０１４：係り受け解析モジュール１０１５：格表現抽出モジュール１０１６：一般化格表現生成モジュール１０１７：単語・一般化格表現記録モジュール１０１８：頻出特異度計算モジュール１０１９：判断フィールド解析モジュール１０２０：目標単語記録モジュール１０２１：典型記述パターン抽出モジュール１０２２：典型記述パターン提示モジュール１０２３：名詞句提示モジュール１０２４：修飾語提示モジュール１０２５：テンプレート記録DB １０２６：共有データ記録領域１０２７：作業用領域２０１：DBの単位となる一レコード３０１：２０１に示されたレコード群から抽出した単
語、及び一般化格情報４０１：典型記述パターン提示ウィンドウ４０２：典型記述パターン選択サブウィンドウ５０１：テンプレートに採用する名詞候補提示ウィンド
ウ５０２：５０１の選択サブウィンドウ６０１：テンプレートに採用する修飾句候補提示ウィン
ドウ６０２：６０１の選択サブウィンドウ。101: Storage device 1011: Document data DB 1012: Dictionary 1013: Morphological analysis module 1014: Dependency analysis module 1015: Case expression extraction module 1016: Generalized case expression generation module 1017: Word / generalized case expression recording module 1018: Frequent occurrence Specificity calculation module 1019: Judgment field analysis module 1020: Target word recording module 1021: Typical description pattern extraction module 1022: Typical description pattern presentation module 1023: Noun phrase presentation module 1024: Modifier presentation module 1025: Template record DB 1026: Shared Data recording area 1027: Work area 201: Word extracted from the record group shown in one record 301: 201 which is a unit of DB, and generalization case information 401: Typical description pattern Display window 402: typically described pattern selection sub-window 501: noun candidate presentation window 502 employing the template: 501 selection sub-window 601: qualifiers candidate presentation window 602 employing the template: 601 of the selected sub-window.

───────────────────────────────────────────────────── フロントページの続き (72)発明者藤尾正和東京都国分寺市東恋ケ窪一丁目280番地株式会社日立製作所中央研究所内 (72)発明者井桁嘉一東京都千代田区内神田一丁目１番14号株式会社日立メディコ内Ｆターム(参考） 5B075 ND03 NK32 NR02 PP13 QM05 UU28 5B091 AA15 CA02 CA05 CC04 CC16 DA06 ─────────────────────────────────────────────────── ─── Continued front page (72) Inventor Masakazu Fujio 1-280, Higashi Koikekubo, Kokubunji, Tokyo Central Research Laboratory, Hitachi, Ltd. (72) Inventor Kaichi Kaji 1-14-1 Kanda, Uchida, Chiyoda-ku, Tokyo Inside the Hitachi Medical Co. F term (reference) 5B075 ND03 NK32 NR02 PP13 QM05 UU28 5B091 AA15 CA02 CA05 CC04 CC16 DA06

Claims

[Claims]

1. A DB storing a record containing a finding field for storing a set of sentences describing facts and a judgment field for storing a set of sentences describing some judgments about the depicted facts. A record that satisfies a specified condition is extracted, and a pattern of a sentence description that is unique and frequent, called a typical description pattern, is extracted from the finding field in the extracted record set. A knowledge extraction method characterized in that it is presented and can be selected by a user, and the selected typical description pattern is recorded in association with a condition regarding a judgment field used for record extraction.

2. The knowledge extraction method according to claim 1, wherein the DB
For the verb phrase VP in the sentence that appears inside, the case that the VP takes {C_
1,…, C_VP (m)}, the noun phrase NP_ corresponding to the case C_i
generate i, a set {VP, NP_1, ..., NP_VP (m)} of the recognized noun phrase and VP, which i is recognized and is called a case expression, NP
A set of generalized case expressions {VP, NP_1 ', in which any number of _i is replaced with a specific identical symbol called a wildcard
, NP_VP (m) '} is generated, and the generalized case expression is used as a typical description pattern.

3. In presenting a generalized case expression extracted as a typical description pattern by using the knowledge extraction method according to claim 2, a generalized case expression having the verb phrase as an element for each same verb phrase A knowledge extraction method characterized by presenting the noun phrases appearing in, and allowing the user to select any of the noun phrases.

4. When presenting a generalized case expression extracted as a typical description pattern using the knowledge extraction method according to claim 2, a noun phrase that modifies the noun phrase appearing therein is extracted from the DB and presented. A knowledge extraction method characterized in that a user can select any of the modifiers.