JP5772437B2

JP5772437B2 - Data structure extraction program and data structure extraction device

Info

Publication number: JP5772437B2
Application number: JP2011205729A
Authority: JP
Inventors: シャオジュン馬
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2011-09-21
Filing date: 2011-09-21
Publication date: 2015-09-02
Anticipated expiration: 2031-09-21
Also published as: JP2013069025A

Description

本発明は、データ構造抽出プログラム及びデータ構造抽出装置に関する。 The present invention relates to a data structure extraction program and a data structure extraction device.

従来のデータ構造を抽出及び出力する方法として、例えば、ツリー構造を有する複数のデータからパターン抽出技法を用いて頻出するツリー構造を見本のデータ構造として抽出し、抽出された見本のデータ構造の類似度を算出するものがある（例えば、非特許文献１、２参照）。これらの方法を利用し、予め定めた閾値以上の類似度を有するデータ構造を出力することができる。 As a conventional method of extracting and outputting a data structure, for example, a tree structure that frequently appears using a pattern extraction technique is extracted from a plurality of data having a tree structure as a sample data structure, and the extracted data structure is similar to the sample data structure. There is one that calculates the degree (for example, see Non-Patent Documents 1 and 2). Using these methods, it is possible to output a data structure having a similarity equal to or higher than a predetermined threshold.

非特許文献１には、根ノードを頂点として分岐する内部ノード及び最底辺の葉ノードから構成されるツリー構造間の比較をする際に、それぞれのツリー構造の葉ノードの並びを比較し、一方の葉ノードの並びから他方の葉ノードの並びへ変形するのに要するデータの挿入、削除、置換等の手順の最小回数（以下、「編集距離」という。）を算出する方法が開示されている。この編集距離を用いて、編集距離が予め定められた値より小さいツリー構造同士を類似性があるツリー構造とする。 Non-Patent Document 1 compares the arrangement of leaf nodes of each tree structure when comparing between tree structures composed of an internal node that branches from a root node as a vertex and a leaf node at the bottom. Disclosed is a method for calculating a minimum number of procedures (hereinafter referred to as “edit distance”) such as insertion, deletion, and replacement of data required for transforming from one leaf node sequence to another leaf node sequence. . Using this edit distance, tree structures having an edit distance smaller than a predetermined value are set as similar tree structures.

また、非特許文献２には、ツリー構造の葉ノードだけでなく、ツリー構造間の根ノード及び内部ノードを含めて一方のツリー構造から他方のツリー構造へ変形するのに要するデータの挿入、削除、置換等の手順の最小回数（以下、「ＴｒｅｅＥｄｉｔ距離」という。）を算出する方法が開示されている。このＴｒｅｅＥｄｉｔ距離の算出方法は、上記した非特許文献１の方法に比べて計算量が増加するものの、葉ノード以外も考慮するため、ツリー構造間の類似度としてより確かな値を算出する。 Non-Patent Document 2 includes not only the leaf nodes of a tree structure but also the insertion and deletion of data required for transformation from one tree structure to the other, including root nodes and internal nodes between tree structures. , A method of calculating the minimum number of procedures such as replacement (hereinafter referred to as “Tree Edit distance”) is disclosed. This tree edit distance calculation method increases the amount of calculation compared to the method of Non-Patent Document 1 described above, but calculates a more reliable value as the similarity between tree structures in order to consider other than leaf nodes.

Gusfield, Dan （1997). Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge, UK: Cambridge University Press. ISBN0-521-58519-8.Gusfield, Dan (1997). Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge, UK: Cambridge University Press. ISBN0-521-58519-8. Philip, Bille. A survey on tree edit distance and related problems. Journal Theoretical Computer Science, Volume 337 Issue 1-3, 9 June 2005. Elsevier Science Publishers Ltd. Essex, UKPhilip, Bille.A survey on tree edit distance and related problems.Journal Theoretical Computer Science, Volume 337 Issue 1-3, 9 June 2005.Elsevier Science Publishers Ltd. Essex, UK

本発明の目的は、出現する複数のデータ構造のうち、共通するデータと当該共通するデータ以外のデータとの頻度を用い、当該データ構造から類似するものを判断して出力するデータ構造抽出プログラム及びデータ構造抽出装置を提供することにある。 An object of the present invention is to use a frequency of common data and data other than the common data among a plurality of appearing data structures, determine a similar one from the data structure, and output the data structure extraction program It is to provide a data structure extraction device.

本発明の一態様は、上記目的を達成するため、以下のデータ構造抽出プログラム及びデータ構造抽出装置を提供する。 In order to achieve the above object, an aspect of the present invention provides the following data structure extraction program and data structure extraction device.

［１］コンピュータを、
複数のサポートから予め定めた閾値以上の頻度で出現するパターンを複数種類抽出する抽出手段と、
前記抽出手段が第１の閾値を用いて抽出したパターンを有するサポートの集合のそれぞれにおいて、互いに有するパターンが類似するサポートの集合のすべてに共通するサポートが存在し、当該互いに有するパターンが類似するサポートの集合において当該共通するサポート以外のサポートを有さないサポートの集合が存在しない場合、前記共通するサポートの数を第２の閾値として設定する設定手段と、
前記抽出手段が第１の閾値を用いて抽出したパターンを有するサポートの集合のそれぞれにおいて、当該互いに有するパターンが類似するサポートの集合のすべてに共通するサポートが存在し、当該互いに有するパターンが類似するサポートの集合において当該共通するサポート以外のサポートを有さないサポートの集合が存在する場合、当該共通するサポート以外のサポートを有さないサポートの集合のパターンを出力すべきパターンであると判断する出力判断手段と、
前記出力判断手段が出力すべきパターンであると判断したパターンと、前記設定手段が対象としたサポートから前記第２の閾値以上の頻度で前記抽出手段により抽出されるパターンを出力する出力手段として機能させるデータ構造抽出プログラム。 [1]
Extracting means for extracting a plurality of types of patterns appearing at a frequency equal to or higher than a predetermined threshold from a plurality of supports ;
In each set of support having a pattern extracted by the extracting unit by using the first threshold value, there is support for a pattern having mutually common to all sets of support similar, support for the pattern in which the with each other is similar If the set of no support other than support for the common support in the set does not exist, and setting means for setting the number of said common support as a second threshold value,
In each set of support having a pattern extracted by the extracting unit by using the first threshold value, there is a support which is common to all sets of support patterns corresponding with each other are similar, the pattern in which the with each other is similar If a supported set of no support other than support for the common support in the set is present, it is determined that the pattern pattern to be output of a set of support without the support of non-support for the common output Judgment means,
Functions as an output means for outputting a pattern in which the output determination unit determines that a pattern to be output, a pattern in which the setting means is extracted by the extraction means with a frequency of more than the second threshold value from the support that targets Data structure extraction program

［２］複数のサポートから予め定めた閾値以上の頻度で出現するパターンを複数種類抽出する抽出手段と、
前記抽出手段が第１の閾値を用いて抽出したパターンを有するサポートの集合のそれぞれにおいて、互いに有するパターンが類似するサポートの集合のすべてに共通するサポートが存在し、当該互いに有するパターンが類似するサポートの集合において当該共通するサポート以外のサポートを有さないサポートの集合が存在しない場合、前記共通するサポートの数を第２の閾値として設定する設定手段と、
前記抽出手段が第１の閾値を用いて抽出したパターンを有するサポートの集合のそれぞれにおいて、当該互いに有するパターンが類似するサポートの集合のすべてに共通するサポートが存在し、当該互いに有するパターンが類似するサポートの集合において当該共通するサポート以外のサポートを有さないサポートの集合が存在する場合、当該共通するサポート以外のサポートを有さないサポートの集合のパターンを出力すべきパターンであると判断する出力判断手段と、
前記出力判断手段が出力すべきパターンであると判断したパターンと、前記設定手段が対象としたサポートから前記第２の閾値以上の頻度で前記抽出手段により抽出されるパターンを出力する出力手段とを有するデータ構造抽出装置。 [2] Extraction means for extracting a plurality of types of patterns appearing at a frequency equal to or higher than a predetermined threshold from a plurality of supports ;
In each set of support having a pattern extracted by the extracting unit by using the first threshold value, there is support for a pattern having mutually common to all sets of support similar, support for the pattern in which the with each other is similar If the set of no support other than support for the common support in the set does not exist, and setting means for setting the number of said common support as a second threshold value,
In each set of support having a pattern extracted by the extracting unit by using the first threshold value, there is a support which is common to all sets of support patterns corresponding with each other are similar, the pattern in which the with each other is similar If a supported set of no support other than support for the common support in the set is present, it is determined that the pattern pattern to be output of a set of support without the support of non-support for the common output Judgment means,
A pattern in which the output determination unit determines that a pattern to be output, and output means said setting means outputs a pattern that is extracted by the extraction means with a frequency of more than the second threshold value from the support that targets Data structure extraction device having.

請求項１又は２に係る発明によれば、本構成を採用しない場合に比べて、出現する複数のデータ構造から、類似する特徴を有するデータ構造を精度よく抽出することができる。 According to the first or second aspect of the present invention, it is possible to accurately extract a data structure having similar characteristics from a plurality of appearing data structures as compared with the case where this configuration is not adopted.

図１は、本発明の実施の形態に係るデータ構造抽出装置の構成の一例を示す図である。FIG. 1 is a diagram showing an example of the configuration of a data structure extraction apparatus according to an embodiment of the present invention. 図２は、ＤＰＣデータのパターンの構成の一例を示す概略図である。FIG. 2 is a schematic diagram illustrating an example of a configuration of a DPC data pattern. 図３は、ＤＰＣデータから抽出されるパターンの構成の一例を示す概略図である。FIG. 3 is a schematic diagram illustrating an example of a configuration of a pattern extracted from DPC data. 図４は、データ構造抽出装置のパターン抽出動作を示すフローチャートである。FIG. 4 is a flowchart showing the pattern extraction operation of the data structure extraction apparatus. 図５は、サポートベクトルの構成の一例を示す概略図である。FIG. 5 is a schematic diagram illustrating an example of the configuration of a support vector. 図６は、サポートベクトルの分類動作を説明するための図である。FIG. 6 is a diagram for explaining the support vector classification operation. 図７（ａ）〜（ｃ）は、サポートベクトル分類手段がマッピングした表及び出力手段１０７が出力する出力内容の一例を示す概略図である。FIGS. 7A to 7C are schematic diagrams illustrating an example of the table mapped by the support vector classification unit and the output contents output by the output unit 107. FIG. 図８（ａ）〜（ｃ）は、サポートベクトル分類手段がマッピングした表及び出力手段１０７が出力する出力内容の一例を示す概略図である。8A to 8C are schematic diagrams illustrating an example of the table mapped by the support vector classification unit and the output contents output by the output unit 107. FIG. 図９は、データ構造抽出装置の動作例を示すフローチャートである。FIG. 9 is a flowchart illustrating an operation example of the data structure extraction apparatus. 図１０は、データ構造抽出装置のパターン出力動作の一例を示すフローチャートである。FIG. 10 is a flowchart showing an example of the pattern output operation of the data structure extraction apparatus. 図１１（ａ）及び（ｂ）は、データ構造抽出装置によって抽出されるパターンの一例を示す概略図である。FIGS. 11A and 11B are schematic diagrams illustrating an example of patterns extracted by the data structure extraction device.

（データ構造抽出装置の構成）
図１は、本発明の実施の形態に係るデータ構造抽出装置の構成の一例を示す図である。 (Configuration of data structure extraction device)
FIG. 1 is a diagram showing an example of the configuration of a data structure extraction apparatus according to an embodiment of the present invention.

データ構造抽出装置１は、ＣＰＵ等から構成され各部を制御するとともに各種のプログラムを実行する制御部１０と、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）やフラッシュメモリ等の記憶媒体から構成され情報を記憶する記憶部１１とを備え、例えば、患者臨床情報及び診療行為の電子データの解析に用いられる。 The data structure extraction device 1 is configured by a CPU and the like, and controls each unit and executes various programs, and a storage unit configured by a storage medium such as an HDD (Hard Disk Drive) or a flash memory and stores information. 11, for example, used for analyzing patient clinical information and electronic data of medical practice.

制御部１０は、後述するデータ構造抽出プログラム１１０を実行することで、パターン抽出手段１００、サポートベクトル生成手段１０１、サポートベクトル分類手段１０２、共通サポート計算手段１０３、閾値設定手段１０４、パターン再抽出手段１０５、出力判断手段１０６及び出力手段１０７等として機能する。 The control unit 10 executes a data structure extraction program 110 to be described later, whereby a pattern extraction unit 100, a support vector generation unit 101, a support vector classification unit 102, a common support calculation unit 103, a threshold setting unit 104, a pattern re-extraction unit 105, functions as the output judging means 106, the output means 107, and the like.

パターン抽出手段１００は、抽出手段の一例であり、解析対象である電子データとして記憶部１１から後述するＤＰＣデータ１１１に含まれる複数の比較対象のデータ（以下、「サポート」という。）からデータ構造の一例として、対象として患者名を根ノードとした見本のツリー構造（以下、「パターン」という。）を複数取得する。ここで、パターンとは、パターン抽出技法を用いて、ツリー構造を有する複数のデータに予め定めた頻度（第１の閾値）で出現する共通のツリー構造として抽出されるもののことをいう。 The pattern extraction unit 100 is an example of an extraction unit, and has a data structure from a plurality of comparison target data (hereinafter referred to as “support”) included in the DPC data 111 described later from the storage unit 11 as electronic data to be analyzed. As an example, a plurality of sample tree structures (hereinafter referred to as “patterns”) having a patient name as a root node as a target are acquired. Here, a pattern refers to a pattern extracted as a common tree structure that appears at a predetermined frequency (first threshold) in a plurality of data having a tree structure using a pattern extraction technique.

なお、パターン抽出手段１００は、周知のパターン抽出技法を用いることで、複数のサポート間で後述するＥファイル及びＦファイルが共通したものをパターンとして抽出する。周知のパターン抽出技法として、例えば、シーケンシャル・パターン・マイニングのＰｒｅｆｉｘＳｐａｎ、ＢＩＤＥ、ＣｌｏＳｐａｎ等又はサブツリーマイニング等を用いることができる。 The pattern extraction unit 100 extracts a common E file and F file, which will be described later, as a pattern among a plurality of supports by using a known pattern extraction technique. As a well-known pattern extraction technique, for example, Prefix Span, BIDE, CloSpan or the like of sequential pattern mining or sub-tree mining can be used.

サポートベクトル生成手段１０１は、パターン抽出手段１００が抽出したパターン毎にサポートの集合をベクトル化し、後述するサポートベクトルを生成する。 The support vector generation unit 101 vectorizes a support set for each pattern extracted by the pattern extraction unit 100, and generates a support vector to be described later.

サポートベクトル分類手段１０２は、サポートベクトル生成手段１０１が生成したサポートベクトル間の相関係数を計算し、この相関関数に基づいてサポートベクトルを複数のグループに分類する。 The support vector classification unit 102 calculates a correlation coefficient between the support vectors generated by the support vector generation unit 101, and classifies the support vectors into a plurality of groups based on the correlation function.

共通サポート計算手段１０３は、サポートベクトル分類手段１０２が分類したグループに属するパターン間に共通するサポートの数を計算する。 The common support calculation unit 103 calculates the number of supports common among the patterns belonging to the group classified by the support vector classification unit 102.

閾値設定手段１０４は、同じサポートベクトルのグループに属するパターン間に共通するサポートの集合が存在し、パターンのそれぞれに共通するサポート以外の他のサポートが存在する場合、共通するサポートの集合に含まれるサポートの数を第２の閾値として設定する。 The threshold setting means 104 is included in the common support set when there is a common support set among patterns belonging to the same support vector group and there is another support other than the common support for each pattern. The number of supports is set as the second threshold.

パターン再抽出手段１０５は、抽出手段の他の一例であり、閾値設定手段１０４が設定した第２の閾値に基づいてパターン抽出手段１００が抽出しなかったパターンを抽出する。なお、パターン再抽出手段１０５を用いる代わりにパターン抽出手段１００により第２の閾値に基づいてパターンを抽出するものであってもよい。 The pattern re-extraction unit 105 is another example of the extraction unit, and extracts a pattern that the pattern extraction unit 100 did not extract based on the second threshold set by the threshold setting unit 104. Instead of using the pattern re-extraction unit 105, the pattern extraction unit 100 may extract a pattern based on the second threshold value.

出力判断手段１０６は、同じサポートベクトルのグループに属するパターン間に共通するサポートの集合が存在し、一方のパターンに共通するサポート以外の他のサポートが存在する場合、当該一方のパターンを出力しないと判断する。 If there is a set of supports common to patterns belonging to the same support vector group and there is another support other than the support common to one pattern, the output determination means 106 does not output that one pattern. to decide.

出力手段１０７は、パターン抽出手段１００が抽出したパターンから出力判断手段１０６が出力しないと判断したパターンを削除し、パターン再抽出手段が抽出したパターンを加えて、パターン抽出結果として出力する。 The output unit 107 deletes the pattern determined not to be output by the output determination unit 106 from the pattern extracted by the pattern extraction unit 100, adds the pattern extracted by the pattern re-extraction unit, and outputs the result as a pattern extraction result.

記憶部１１は、制御部１０を上述した各手段として動作させるデータ構造抽出プログラム１１０、ＤＰＣデータ１１１及び閾値情報１１２等を記憶する。 The storage unit 11 stores a data structure extraction program 110, DPC data 111, threshold information 112, and the like that cause the control unit 10 to operate as each unit described above.

ＤＰＣデータ１１１は、分析可能な全国統一形式の患者臨床情報及び診療行為の電子データセットである。患者臨床情報は、例えば、患者基本情報、病名、術式、各種のスコア・ステージ分類等であり、診療行為情報は、診療行為、医薬品、医療材料、実施日、回数・数量、診療科、病棟、保険種別等である。 The DPC data 111 is an electronic data set of patient clinical information and medical practice in a nationally unified format that can be analyzed. Patient clinical information includes, for example, patient basic information, disease name, technique, various score / stage classifications, etc., and medical practice information includes medical practice, pharmaceuticals, medical materials, implementation date, frequency / quantity, clinical department, ward And insurance type.

また、ＤＰＣデータ１１１は、基本となるデータとして様式１、Ｅファイル及びＦファイルと呼ばれるデータを有する。様式１とは、患者の臨床情報、傷病名、術式、補助治療等である。Ｅファイルとは、実施日、回数、診療科、病棟、オーダ医師等の情報である。Ｆファイルとは、Ｅファイルの詳細な内容であり、例えば、行為、薬剤、材料、数量等の情報である。 The DPC data 111 includes data called format 1, E file, and F file as basic data. Form 1 includes patient clinical information, names of wounds, surgical procedures, adjuvant treatment, and the like. The E file is information such as the implementation date, number of times, clinical department, ward, order doctor, and the like. The F file is a detailed content of the E file, and is information such as an action, a medicine, a material, and a quantity.

本実施の形態では、患者を根ノードとし、その患者に属する日時データ及びＥファイルを内部ノード、Ｅファイルに属するＦファイルを葉ノードとして構成されるツリー構造をサポートとし、複数のサポートに予め定めた頻度以上で現れるサポートのデータ構造をパターンとして取得して、取得されたパターン間で類似度を算出し、算出された類似度に基づいて複数のパターンの集合を抽出する。 In the present embodiment, a tree structure including a patient as a root node, date / time data belonging to the patient and an E file as an internal node, and an F file belonging to the E file as a leaf node is supported. A support data structure that appears at a frequency higher than the specified frequency is acquired as a pattern, a similarity is calculated between the acquired patterns, and a set of a plurality of patterns is extracted based on the calculated similarity.

閾値情報１１２は、パターン抽出手段１００が用いる予め定められた第１の閾値と、閾値設定手段１０４が設定した第２の閾値とを格納する。 The threshold information 112 stores a predetermined first threshold used by the pattern extraction unit 100 and a second threshold set by the threshold setting unit 104.

図２は、パターン抽出手段１００のパターンの取得元となるＤＰＣデータ１１１のサポートの構成の一例を示す概略図である。 FIG. 2 is a schematic diagram illustrating an example of a configuration for supporting the DPC data 111 from which the pattern extraction unit 100 obtains a pattern.

ＤＰＣデータ１１１から取得される複数のサポート２００ａ、２００ｂ…は、患者に属する日時データ２２と、日時データ２２に属するＥファイル２１と、Ｅファイル２１に属するＦファイルとを有し、ツリー構造を構成する。 The plurality of supports 200a, 200b,... Acquired from the DPC data 111 includes date / time data 22 belonging to a patient, an E file 21 belonging to the date / time data 22, and an F file belonging to the E file 21, and constitutes a tree structure. To do.

図３は、ＤＰＣデータ１１１から抽出されるパターンの構成の一例を示す概略図である。 FIG. 3 is a schematic diagram showing an example of a configuration of a pattern extracted from the DPC data 111. As shown in FIG.

ＤＰＣデータ１１１から抽出されるパターン２ａ及びパターン２ｂは、患者に属する日時データ２２と、日時データ２２に属するＥファイル２１と、Ｅファイル２１に属するＦファイルとを有し、ツリー構造を構成する。 The patterns 2a and 2b extracted from the DPC data 111 include date / time data 22 belonging to a patient, an E file 21 belonging to the date / time data 22, and an F file belonging to the E file 21, and constitute a tree structure.

（データ構造抽出装置の動作）
以下に、データ構造抽出装置の動作例を各図を参照しつつ、（１）パターン抽出動作、（２）サポート分類動作、（３）パターン出力動作に分けて説明する。 (Operation of data structure extraction device)
Hereinafter, the operation example of the data structure extraction apparatus will be described by dividing into (1) pattern extraction operation, (2) support classification operation, and (3) pattern output operation, with reference to each drawing.

図９は、データ構造抽出装置の動作例を示すフローチャートである。 FIG. 9 is a flowchart illustrating an operation example of the data structure extraction apparatus.

（１）パターン抽出動作
まず、パターン抽出手段１００は、記憶部１１のＤＰＣデータ１１１からパターンを抽出する対象となる複数のサポートを取得する（Ｓ１）。 (1) Pattern Extraction Operation First, the pattern extraction unit 100 acquires a plurality of supports that are targets for pattern extraction from the DPC data 111 in the storage unit 11 (S1).

図４は、データ構造抽出装置のパターン抽出動作を示すフローチャートである。 FIG. 4 is a flowchart showing the pattern extraction operation of the data structure extraction apparatus.

以下、説明を簡単にするため、図４に示すように簡略表示した６つのサポート２００Ａ〜２００Ｆを取得した場合について説明する。 Hereinafter, in order to simplify the description, a case where six supports 200A to 200F that are simply displayed as shown in FIG. 4 are acquired will be described.

パターン抽出手段１００は、サポート２００Ａ〜２００ＤからＥファイル２１ａ及び２１ｂを含むパターン２_１を抽出し、サポート２００Ｂ、２００Ｅ及び２００ＦからＥファイル２１ｃ及びＦファイル２０ｃを含むパターン２_２を抽出する。 Pattern extraction means 100 extracts from the support 200A~200D patterns _{2 1} containing E files 21a and 21b, support 200B, extracts a pattern _{2 2} containing E files 21c and F files 20c from 200E and 200F.

（２）サポート分類動作
次に、サポートベクトル生成手段１０１は、パターン抽出手段１００が抽出したパターン２_１及び２_２のそれぞれについてサポートの集合をベクトル化してサポートベクトルを生成する（Ｓ４）。 (2) support classification Operation Next, support vector generating unit 101 generates a set of support by vectorizing the support vector for each of the patterns 2 ₁ and 2 ₂ a pattern extracting means 100 is extracted (S4).

図５は、サポートベクトルの構成の一例を示す概略図である。 FIG. 5 is a schematic diagram illustrating an example of the configuration of a support vector.

サポートベクトルＳＶ１は、パターン２_１についてのサポートベクトルであり、サポート２００Ａ、２００Ｂ…の順でベクトルの成分が記載されている。例えば、サポート２００Ａはパターン２_１を含むためベクトル成分は「１」であり、サポート２００Ｆはパターン２_１を含まないためベクトル成分は「０」である。サポートベクトルＳＶ２も同様に記載される。 Support vector SV1 is a support vector for pattern _{2 1,} support 200A, the component of 200B ... sequentially with vectors of are described. For example, it supports 200A is vector component for including a pattern _{2 1} is "1", supports 200F vector component contains no pattern _{2 1} is "0". The support vector SV2 is described in the same manner.

次に、サポートベクトル分類手段１０２は、サポートベクトル間の相関係数として内積を算出する（Ｓ４）。サポートベクトルＳＶ１とＳＶ２との内積は、１／√６となる。 Next, the support vector classification unit 102 calculates an inner product as a correlation coefficient between the support vectors (S4). The inner product of the support vectors SV1 and SV2 is 1 / √6.

次に、サポートベクトル分類手段１０２は、サポートベクトルを以下に説明するようにクラスタリング（分類）する（Ｓ５）。以下に説明する例では、ステップＳ１においてサポートが７つ抽出され、７つのサポートそれぞれについてサポートベクトルＳＶ１〜ＳＶ７が生成され、サポートベクトルＳＶ１〜ＳＶ７間の内積が計算されたものとする。 Next, the support vector classification unit 102 clusters (classifies) the support vectors as described below (S5). In the example described below, it is assumed that seven supports are extracted in step S1, support vectors SV1 to SV7 are generated for each of the seven supports, and inner products between the support vectors SV1 to SV7 are calculated.

図６は、サポートベクトルの分類動作を説明するための図である。 FIG. 6 is a diagram for explaining the support vector classification operation.

図６に示すように、サポートベクトル分類手段１０２は、それぞれの内積で行列を生成する。内積の値の大きなものがサポートベクトルに対応したパターンが類似していることを示す。そこで、サポートベクトル分類手段１０２は、内積の値に応じてサポートベクトルをクラスタリングし、サポートベクトルＳＶ１〜ＳＶ５のクラスター１と、サポートベクトルＳＶ６及びＳＶ７のクラスター２とに分類する。 As shown in FIG. 6, the support vector classification means 102 generates a matrix with each inner product. A large inner product value indicates that the patterns corresponding to the support vectors are similar. Therefore, the support vector classifying unit 102 clusters the support vectors according to the value of the inner product, and classifies the support vectors into cluster 1 of support vectors SV1 to SV5 and cluster 2 of support vectors SV6 and SV7.

次に、サポートベクトル分類手段１０２は、分類したサポートグループをパターンで表上にマッピングする。マッピングの結果を以下に示す。 Next, the support vector classification unit 102 maps the classified support groups on the table with patterns. The mapping results are shown below.

（３）パターン出力動作
図１０は、データ構造抽出装置のパターン出力動作の一例を示すフローチャートである。また、図７（ａ）〜（ｃ）は、サポートベクトル分類手段がマッピングした表及び出力手段１０７が出力する出力内容の一例を示す概略図である。 (3) Pattern Output Operation FIG. 10 is a flowchart showing an example of the pattern output operation of the data structure extraction device. FIGS. 7A to 7C are schematic diagrams showing examples of the table mapped by the support vector classification unit and the output contents output by the output unit 107.

まず、共通サポート計算手段１０３は、クラスタリングされたサポートグループ間の共通サポート数を計算する（Ｓ１１）。図７（ａ）に示すようにマッピングされた例において、「サポートグループ１」のサポートを概略化して図示すると図７（ｂ）に示すようになるが、各パターン１〜５において共通するサポートの数、ここでは「８」を共通サポート数として計算する。 First, the common support calculation unit 103 calculates the number of common supports between clustered support groups (S11). In the example mapped as shown in FIG. 7 (a), the support of “support group 1” is schematically shown as shown in FIG. 7 (b). The number, here “8”, is calculated as the common support number.

次に、「サポートグループ１」のように各パターンに共通サポート以外のサポートを含む場合（Ｓ１２；Ｙｅｓ）、閾値設定手段１０４は、共通サポート数を第２の閾値として閾値情報１１２を設定する（Ｓ１３）。 Next, when each pattern includes support other than the common support as in “support group 1” (S12; Yes), the threshold setting unit 104 sets the threshold information 112 using the number of common support as the second threshold ( S13).

次に、パターン再抽出手段１０５は、第１の閾値より小さく第２の閾値より以上の条件で「サポートグループ１」の共通サポートに該当する「パターン８」を再抽出する（Ｓ１４）。 Next, the pattern re-extraction means 105 re-extracts “Pattern 8” corresponding to the common support of “Support Group 1” under a condition that is smaller than the first threshold and greater than the second threshold (S14).

次に、「サポートグループ２」に対して上記動作を繰り返す（Ｓ１６）。「サポートグループ２」は、各パターンに共通サポート以外のサポートを含まない場合（Ｓ１２；Ｎｏ）である。 Next, the above operation is repeated for “support group 2” (S16). “Support group 2” is a case where each pattern includes no support other than the common support (S12; No).

図８（ａ）〜（ｃ）は、サポートベクトル分類手段がマッピングした表及び出力手段１０７が出力する出力内容の一例を示す概略図である。 8A to 8C are schematic diagrams illustrating an example of the table mapped by the support vector classification unit and the output contents output by the output unit 107. FIG.

図８（ａ）に示すようにマッピングされた例において、「サポートグループ２」のサポートを概略化して図示すると図８（ｂ）に示すようになるが、出力判断手段１０６は、共通サポート以外のサポートを含む「パターン６」は、共通サポートが支配的である「パターン７」に比べて意味が薄い（重要度が低い）と判断できるため、「パターン６」を出力しないと判断する（Ｓ１５）。 In the example mapped as shown in FIG. 8A, the support of “support group 2” is schematically shown as shown in FIG. 8B. “Pattern 6” including support can be determined to be less meaningful (less important) than “Pattern 7” in which common support is dominant, so it is determined not to output “Pattern 6” (S15). .

次に、出力手段１０７は、ステップＳ１２で各パターンに共通サポート以外のサポートを含むと判断された「サポートグループ１」については、パターン抽出手段１００が抽出した「パターン１」〜「パターン５」に加え、パターン再抽出手段１０５が再抽出した「パターン８」を、図７（ｃ）に示すように「サポートグループ１」の出力内容１０７ａとして出力する（Ｓ１７）。 Next, the output unit 107 converts “pattern 1” to “pattern 5” extracted by the pattern extraction unit 100 for “support group 1” determined to include support other than common support in each pattern in step S12. In addition, “pattern 8” re-extracted by the pattern re-extraction means 105 is output as the output content 107a of “support group 1” as shown in FIG. 7C (S17).

また、出力手段１０７は、ステップＳ１２で各パターンに共通サポート以外のサポートを含まないと判断された「サポートグループ２」については、出力判断手段１０６が出力しないと判断した「パターン６」を除き、「パターン７」を、図８（ｃ）に示すように「サポートグループ２」の出力内容１０７ｂとして出力する（Ｓ１７）。 In addition, the output unit 107 excludes “pattern 6” that is determined not to be output by the output determination unit 106 for “support group 2” that is determined not to include support other than common support in each pattern in step S12. “Pattern 7” is output as the output content 107b of “support group 2” as shown in FIG. 8C (S17).

［実施例］
図１１（ａ）及び（ｂ）は、データ構造抽出装置１によって抽出されるパターンの一例を示す概略図である。 [Example]
FIGS. 11A and 11B are schematic diagrams illustrating an example of patterns extracted by the data structure extraction apparatus 1.

図１１（ａ）に示すように、Ｅファイル２１ｇ、２１ｈ及びＦファイル２０ｇを含むパターン２_３のみ有する患者のサポートが「４名」、Ｅファイル２１ｇ、２１ｈ及びＦファイル２０ｇ、２０ｈを含むパターン２_３有する患者のサポートが「８名」、Ｅファイル２１ｇ及びＦファイル２０ｇ、２０ｈを含むパターン２_３のみ有する患者のサポートが「４名」存在する場合を考える。 As shown in FIG. 11 (a), E file 21g, support of patients with only the pattern _{2 3} including 21h and F file 20g is "four", pattern 2 containing E file 21g, 21h and F file 20g, a 20h support "eight" of a patient having _3, E files 21g and F file 20g, support for patients with only the pattern _{2 3} including 20h consider the case where there "four".

このとき、第１の閾値を「１２」とすると、パターン抽出手段１００は、パターン２_４はパターン２_３及び２_５を含み、パターン２_３のサポート数がパターン２_３と２_４のサポート数の和である「１２」であり、パターン２_３のサポート数がパターン２_５と２_４のサポート数の和である「１２」であるため、パターン２_３び２_５を抽出し、サポート数が「８」であるパターン２_４は抽出しない。 At this time, when the first threshold value is "12", the pattern extracting means 100, pattern _{2 4} includes patterns _{2 3} and _{2 5,} the number of support patterns _{2 3} pattern _{2 3} and _{2 4} number of support the sum is "12", because the number of support patterns 2 ₃ is the sum of the number of the support pattern 2 ₅ 2 ₄ is "12", to extract the pattern 2 ₃ beauty 2 _5, the number of support " 8 "pattern 2 ₄ a is not extracted.

しかし、パターン２_４は、パターン２_３及び２_５より患者の多い重要度の高いパターンであるため、抽出すべきパターンである。ここで、閾値設定手段１０４は、パターン抽出手段１００によって抽出されたパターン２_３及び２_５に共通サポートが存在すると判断し、第２の閾値を共通サポート数である「８」に設定する。 However, the pattern ₂₄ is a pattern to be extracted because it is a highly important pattern with more patients than the patterns 2 ₃ and 2 ₅ . Here, the threshold setting unit 104 determines that there is common support in the patterns ₂₃ and ₂₅ extracted by the pattern extraction unit 100, and sets the second threshold to “8” which is the number of common supports.

その結果、パターン再抽出手段１０５は、パターン抽出手段１００が抽出しなかった重要度の高いパターン２_４を抽出する。 As a result, the pattern re-extraction unit 105 extracts a high pattern 2 ₄ importance to pattern extraction unit 100 has not extracted.

次に、図１１（ｂ）に示すように、Ｅファイル２１ｇ、２１ｈ及びＦファイル２０ｇを含むパターン２_３のみ有する患者のサポートが「２名」、Ｅファイル２１ｇ、２１ｈ及びＦファイル２０ｇ、２０ｈを含むパターン２_３有する患者のサポートが「１０名」存在する場合を考える。 Next, as shown in FIG. 11 (b), E file 21g, support of patients with only the pattern _{2 3} including 21h and F file 20g is "two people", E file 21g, 21h and F file 20g, a 20h consider the case of support of the patient is present "10 people" with pattern 2 ₃ including.

このとき、第１の閾値を「９」とすると、パターン抽出手段１００は、パターン２_４はパターン２_３を含み、パターン２_３のサポート数がパターン２_３と２_４のサポート数の和である「１２」であり、パターン２_３のサポート数が「１０」であるため、パターン２_３び２_４を抽出する。 At this time, when the first threshold value is "9", the pattern extracting means 100, pattern 2 ₄ includes patterns 2 _3, the number of support patterns 2 ₃ is the sum of the number of support patterns 2 ₃ and 2 ₄ Since it is “12” and the number of supports of the pattern 2 ₃ is “10”, the patterns 2 _{3 and} 2 ₄ are extracted.

しかし、パターン２_３は、パターン２_４に比べて患者の少ない重要度の低いパターンであるため、抽出すべきでないパターンである。そこで、出力判断手段１０７は、パターン２_３を出力すべきでないパターンであると判断する。 However, the pattern 2 ₃ are the lower patterns with less importance the patient compared to the pattern 2 ₄ is a pattern that should not be extracted. Therefore, the output determination unit 107 determines that the pattern that should not be output pattern 2 _3.

［他の実施の形態］
なお、本発明は、上記実施の形態に限定されず、本発明の趣旨を逸脱しない範囲で種々な変形が可能である。例えば、本発明はＤＰＣデータ１１１にのみ適用されるものではなく、木構造を代表とする任意のデータ構造を有するデータの集合であれば同様に適用することができる。 [Other embodiments]
The present invention is not limited to the above embodiment, and various modifications can be made without departing from the spirit of the present invention. For example, the present invention is not applied only to the DPC data 111, but can be similarly applied to a set of data having an arbitrary data structure represented by a tree structure.

また、上記実施の形態で使用されるデータ構造抽出プログラム１１０は、ＣＤ−ＲＯＭ等の記憶媒体から装置内の記憶部に読み込んでも良く、インターネット等のネットワークに接続されているサーバ装置等から装置内の記憶部にダウンロードしてもよい。また、上記実施の形態で使用される手段１００〜１０７の一部または全部をＡＳＩＣ等のハードウェアによって実現してもよい。 In addition, the data structure extraction program 110 used in the above embodiment may be read from a storage medium such as a CD-ROM into a storage unit in the apparatus, or from a server apparatus connected to a network such as the Internet. You may download to the memory | storage part. Moreover, you may implement | achieve part or all of the means 100-107 used by the said embodiment with hardware, such as ASIC.

１データ構造抽出装置
１０制御部
１１記憶部
２０Ｆファイル
２１Ｅファイル
２２日時データ
１００パターン抽出手段
１０１サポートベクトル生成手段
１０２サポートベクトル分類手段
１０３共通サポート計算手段
１０４閾値設定手段
１０５パターン再抽出手段
１０６出力判断手段
１０７出力手段
１１０データ構造抽出プログラム
１１１ＤＰＣデータ
１１２閾値情報

DESCRIPTION OF SYMBOLS 1 Data structure extraction apparatus 10 Control part 11 Storage part 20 F file 21 E file 22 Date and time data 100 Pattern extraction means 101 Support vector generation means 102 Support vector classification means 103 Common support calculation means 104 Threshold setting means 105 Pattern re-extraction means 106 Output Judgment means 107 Output means 110 Data structure extraction program 111 DPC data 112 Threshold information

Claims

Computer
Extracting means for extracting a plurality of types of patterns appearing at a frequency equal to or higher than a predetermined threshold from a plurality of supports ;
In each set of support having a pattern extracted by the extracting unit by using the first threshold value, there is support for a pattern having mutually common to all sets of support similar, support for the pattern in which the with each other is similar If the set of no support other than support for the common support in the set does not exist, and setting means for setting the number of said common support as a second threshold value,
In each set of support having a pattern extracted by the extracting unit by using the first threshold value, there is a support which is common to all sets of support patterns corresponding with each other are similar, the pattern in which the with each other is similar If a supported set of no support other than support for the common support in the set is present, it is determined that the pattern pattern to be output of a set of support without the support of non-support for the common output Judgment means,
Functions as an output means for outputting a pattern in which the output determination unit determines that a pattern to be output, a pattern in which the setting means is extracted by the extraction means with a frequency of more than the second threshold value from the support that targets Data structure extraction program

Extracting means for extracting a plurality of types of patterns appearing at a frequency equal to or higher than a predetermined threshold from a plurality of supports ;
In each set of support having a pattern extracted by the extracting unit by using the first threshold value, there is support for a pattern having mutually common to all sets of support similar, support for the pattern in which the with each other is similar If the set of no support other than support for the common support in the set does not exist, and setting means for setting the number of said common support as a second threshold value,
In each set of support having a pattern extracted by the extracting unit by using the first threshold value, there is a support which is common to all sets of support patterns corresponding with each other are similar, the pattern in which the with each other is similar If a supported set of no support other than support for the common support in the set is present, it is determined that the pattern pattern to be output of a set of support without the support of non-support for the common output Judgment means,
A pattern in which the output determination unit determines that a pattern to be output, and output means said setting means outputs a pattern that is extracted by the extraction means with a frequency of more than the second threshold value from the support that targets Data structure extraction device having.