JP6270216B2

JP6270216B2 - Clustering apparatus, method and program

Info

Publication number: JP6270216B2
Application number: JP2014195107A
Authority: JP
Inventors: 圭介小川; 橋本　真幸; 真幸橋本; 一則松本
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2014-09-25
Filing date: 2014-09-25
Publication date: 2018-01-31
Anticipated expiration: 2034-09-25
Also published as: JP2016066269A

Description

本発明は、医療データにおいて特定疾病のリスクを予測するのに適したクラスタリングを行うことのできるクラスタリング装置、方法及びプログラムに関する。 The present invention relates to a clustering apparatus, method, and program capable of performing clustering suitable for predicting the risk of a specific disease in medical data.

医療データに基づいて、対象者をクラスタリングしたい場合がある。特許文献１や特許文献２に代表されるように、健康管理システム等が大きな広がりを見せている。このような健康管理システムでは、利用者に対して健康上のアドバイス等を行う場合が多いが、特許文献３に示すように、利用者を実際の健康データを元に分類した上でアドバイスを行った方が、より行動変容につながりやすく、効果的なアドバイスが可能となる。 There are cases where it is desired to cluster the target person based on medical data. As represented by Patent Document 1 and Patent Document 2, the health management system and the like are expanding greatly. In such a health management system, health advice is often given to users. However, as shown in Patent Document 3, advice is given after classifying users based on actual health data. It is easier to change the behavior and effective advice is possible.

特開2013-085626号公報JP 2013-085626 特開2010-264088号公報JP 2010-264088 特開2010-170534号公報JP 2010-170534 A

ここで、近年、潜在トピック分析（特に、潜在的ディリクレ配分法：Latent dirichlet allocation：LDA）に代表される、高精度な分類手法が注目を浴びている。このLDAを用いて、健康要因の似通った人を分類し、将来健康状態が悪化するハイリスク者を抽出することも可能である。 Here, in recent years, a high-precision classification technique represented by latent topic analysis (particularly, latent dirichlet allocation (LDA)) has attracted attention. Using this LDA, it is also possible to classify people with similar health factors and extract high-risk people whose health conditions will deteriorate in the future.

一般的に教師無し学習では、得られたデータから特徴の似通ったもの同士を類型化し、LDAに代表されるトピック分類では、教師無し学習により複数トピックの混合としてデータを表現して類型化を行う。医療データを対象とする場合、トピックは、健康状態を表す単語の発生確率の混合として表現されることとなる。 In general, unsupervised learning categorizes similar data from the obtained data. In topic classification represented by LDA, data is represented as a mixture of multiple topics by unsupervised learning. . When targeting medical data, a topic is expressed as a mixture of occurrence probabilities of words representing a health condition.

一般的に、レセプトデータ（診療明細情報）や健診データ等の医療データを用いて類型化を行い、糖尿病等の特定の疾病リスクを予測したいとき、教師なし学習では不要な特徴（当該特定の疾病とは関係のない特徴）を取り扱ってしまうことから精度が低下してしまうという課題がある。 In general, when categorization is performed using medical data such as receipt data (medical specification information) and medical examination data, and a specific disease risk such as diabetes is predicted, an unsupervised learning feature (the specific There is a problem that the accuracy is reduced because it deals with features not related to disease).

図１は、当該課題を模式的に説明するための例を示す図であり、同一の医療データに対して類型化を行った２つの結果R10,R20が示されている。当該医療データにおいて糖尿病の疾病リスクを予測したい場合の類型化の結果として、結果R10は教師無し学習により不要な特徴が現れてしまった望ましくない例を、結果R20は糖尿病の予測に適した望ましい例を、それぞれ示している。 FIG. 1 is a diagram showing an example for schematically explaining the problem, and shows two results R10 and R20 obtained by typifying the same medical data. As a result of typification when you want to predict the disease risk of diabetes in the medical data, the result R10 is an undesired example in which unnecessary features have appeared due to unsupervised learning, and the result R20 is a desirable example suitable for diabetes prediction Respectively.

すなわち、結果R10においては医療データ内の一連の患者は、健常者のクラスタC11と、骨折の患者のクラスタC12と、風邪の患者のクラスタC13と、糖尿病の患者のクラスタC14と、に類型化（すなわち、分類）されてしまっている。ここで特に、「骨折」や「風邪」は基本的には「糖尿病」とは独立であるので、このような結果R10の類型化は、「糖尿病」のリスク予測に用いるものとしては不適切である。「不要な特徴」である「骨折」や「風邪」が現れることで、「糖尿病」自体の分析を精密に行うことができない結果となっている。 That is, in the result R10, a series of patients in the medical data are classified into a cluster C11 for healthy subjects, a cluster C12 for fractures, a cluster C13 for colds, and a cluster C14 for diabetics ( That is, it has been classified. In particular, since “fractures” and “colds” are basically independent of “diabetes”, such a classification of R10 is inappropriate as a risk prediction for “diabetes”. is there. Appearance of “unnecessary features” such as “fractures” and “cold” results in inability to accurately analyze “diabetes” itself.

一方、結果R20は、医療データ内の一連の患者を、「糖尿病」の疾病リスク予測に関して適切に類型化したものとなっている。すなわち、健常者のクラスタC21と、糖尿病予備軍AのクラスタC22と、糖尿病予備軍BのクラスタC23と、糖尿病患者のクラスタC24と、に類型化されている。結果R20は全て糖尿病についての観点からの類型化であり、結果R10のように「骨折」や「風邪」といった「不要な特徴」は現れていないことから、適切に糖尿病の疾病リスクの検討が可能な結果となっている。 On the other hand, the result R20 appropriately categorizes a series of patients in the medical data with respect to the disease risk prediction of “diabetes”. That is, it is categorized into a cluster C21 for healthy subjects, a cluster C22 for diabetic reserve army A, a cluster C23 for diabetic reserve army B, and a cluster C24 for diabetic patients. The results R20 are all categorized from the viewpoint of diabetes, and since the “unnecessary features” such as “fracture” and “cold” do not appear like the results R10, it is possible to appropriately examine the disease risk of diabetes It has become a result.

本発明は、上記従来技術の課題に鑑み、特定疾病のリスクを予測するのに適したクラスタリングを行うことが可能なクラスタリング装置、方法及びプログラムを提供することを目的とする。 The present invention has been made in view of the above-described problems of the prior art, and an object thereof is to provide a clustering apparatus, method, and program capable of performing clustering suitable for predicting the risk of a specific disease.

上記目的を達成するため、本発明は、クラスタリング装置であって、ユーザより分析対象とする疾病名及び当該疾病の発症期間の指定を受け付け、各対象者における複数年代に渡る健康を各年代において評価した結果の単語リストとしての重み計算用の医療データを解析して、単語ごとに、前記発症期間を経て前記疾病が発症することへの関連度を重みとして計算する重み計算部と、分析対象の一連の対象者についてそれぞれ、健康評価に関連する単語によるバグオブワーズの形式により、当該対象者の健康状態に関する特徴ベクトルを受け取り、当該特徴ベクトルの各単語頻度に対して前記単語ごとに計算された重みを付与したうえで、前記分析対象の一連の対象者を潜在トピック分析によりクラスタリングするクラスタリング部と、を備えることを特徴とする。 In order to achieve the above object, the present invention is a clustering apparatus that accepts designation of a disease name to be analyzed and an onset period of the disease from a user, and evaluates the health of each subject over multiple ages in each era. The weight calculation unit that analyzes the medical data for weight calculation as a word list as a result and calculates the degree of relevance to the onset of the disease through the onset period as a weight for each word; Each of a series of subjects receives a feature vector related to the health status of the subject in the form of a bug of words with words related to health evaluation, and calculates a weight calculated for each word for each word frequency of the feature vector. And a clustering unit for clustering the series of subjects to be analyzed by latent topic analysis. And wherein the Rukoto.

また、本発明は、クラスタリング装置であって、ユーザより分析対象とする疾病名及び当該疾病の発症期間の指定を受け付け、各対象者における複数年代に渡る健康を各年代において評価した結果の単語リストとしての重み計算用の医療データを解析して、単語ごとに、前記発症期間を経て前記疾病が発症することへの関連度を重みとして計算する重み計算部と、分析対象の一連の対象者についてそれぞれ、健康評価に関連する単語によるバグオブワーズの形式により、当該対象者の健康状態に関する特徴ベクトルを受け取り、前記分析対象の一連の対象者の特徴ベクトルを列挙した行列データに対して潜在トピック分析を施すことで、当該行列データを対象者及びトピックの関係行列θとトピック及び単語の関係行列Φとの積に分解し、前記行列データDに対して前記単語ごとに計算された重みWを付与した行列データW*Dに対して潜在トピック分析を施して得られる、対象者及びトピックの修正された関係行列θ'とトピック及び単語の修正された関係行列Φ'とを、潜在トピック分析を施すことなく近似値として求め、当該近似値として求められた修正された関係行列θ'を前記分析対象の一連の対象者のクラスタリング結果とするクラスタリング部と、を備え、前記クラスタリング部は、前記近似値として求めるに際して、前記重みWを恒等変換に微小変換を加えたものと判定される複数の変換の積に分解し、当該複数の変換を逐次的に前記関係行列θ及び前記関係行列Φに対して摂動として加えることにより、前記修正された関係行列θ'及び前記修正された関係行列Φ'をそれぞれ近似値として求めることを特徴とする。 In addition, the present invention is a clustering device, which accepts designation of a disease name to be analyzed and an onset period of the disease from a user, and is a word list as a result of evaluating each subject's health over multiple ages in each era A weight calculation unit that analyzes medical data for weight calculation as a word, calculates a degree of relevance to the onset of the disease through the onset period as a weight, and a series of subjects to be analyzed Each of them receives feature vectors related to the health status of the subject in the form of a bug of words based on words related to health evaluation, and performs latent topic analysis on the matrix data listing the feature vectors of the series of subjects to be analyzed. The matrix data is decomposed into a product of the relationship matrix θ of the subject and the topic and the relationship matrix Φ of the topic and the word, The subject and topic-corrected relationship matrix θ ′, the topic, and the subject and topic obtained by performing the latent topic analysis on the matrix data W * D obtained by assigning the weight W calculated for each word to the matrix data D The corrected relation matrix Φ ′ of the word is obtained as an approximate value without performing latent topic analysis, and the corrected relation matrix θ ′ obtained as the approximate value is obtained as a clustering result of the series of subjects to be analyzed. A clustering unit, wherein the clustering unit decomposes the weight W into a product of a plurality of transformations determined to be an identity transformation plus a minute transformation when the approximate value is obtained. Are sequentially added as perturbations to the relationship matrix θ and the relationship matrix Φ, so that the modified relationship matrix θ ′ and the modified relationship matrix Φ ′ are respectively approximated values. It is characterized by obtaining it.

また、本発明は、クラスタリング方法であって、ユーザより分析対象とする疾病名及び当該疾病の発症期間の指定を受け付け、各対象者における複数年代に渡る健康を各年代において評価した結果の単語リストとしての重み計算用の医療データを解析して、単語ごとに、前記発症期間を経て前記疾病が発症することへの関連度を重みとして計算する重み計算段階と、分析対象の一連の対象者についてそれぞれ、健康評価に関連する単語によるバグオブワーズの形式により、当該対象者の健康状態に関する特徴ベクトルを受け取り、当該特徴ベクトルの各単語頻度に対して前記単語ごとに計算された重みを付与したうえで、前記分析対象の一連の対象者を潜在トピック分析によりクラスタリングするクラスタリング段階と、を備えることを特徴とする。 In addition, the present invention is a clustering method, which accepts designation of a disease name to be analyzed and an onset period of the disease from a user, and a word list as a result of evaluating each subject's health over multiple ages in each era A weight calculation stage that analyzes medical data for weight calculation as a word, calculates a degree of relevance to the onset of the disease through the onset period as a weight, and a series of subjects to be analyzed Each of them receives a feature vector related to the health condition of the subject in the form of a bug of words related to the health evaluation, and assigns a weight calculated for each word to each word frequency of the feature vector. And a clustering step of clustering a series of subjects to be analyzed by latent topic analysis. To.

また、本発明は、クラスタリング方法であって、ユーザより分析対象とする疾病名及び当該疾病の発症期間の指定を受け付け、各対象者における複数年代に渡る健康を各年代において評価した結果の単語リストとしての重み計算用の医療データを解析して、単語ごとに、前記発症期間を経て前記疾病が発症することへの関連度を重みとして計算する重み計算段階と、分析対象の一連の対象者についてそれぞれ、健康評価に関連する単語によるバグオブワーズの形式により、当該対象者の健康状態に関する特徴ベクトルを受け取り、前記分析対象の一連の対象者の特徴ベクトルを列挙した行列データに対して潜在トピック分析を施すことで、当該行列データを対象者及びトピックの関係行列θとトピック及び単語の関係行列Φとの積に分解し、前記行列データDに対して前記単語ごとに計算された重みWを付与した行列データW*Dに対して潜在トピック分析を施して得られる、対象者及びトピックの修正された関係行列θ'とトピック及び単語の修正された関係行列Φ'とを、潜在トピック分析を施すことなく近似値として求め、当該近似値として求められた修正された関係行列θ'を前記分析対象の一連の対象者のクラスタリング結果とするクラスタリング段階と、を備え、前記クラスタリング段階では、前記近似値として求めるに際して、前記重みWを恒等変換に微小変換を加えたものと判定される複数の変換の積に分解し、当該複数の変換を逐次的に前記関係行列θ及び前記関係行列Φに対して摂動として加えることにより、前記修正された関係行列θ'及び前記修正された関係行列Φ'をそれぞれ近似値として求めることを特徴とする。 In addition, the present invention is a clustering method, which accepts designation of a disease name to be analyzed and an onset period of the disease from a user, and a word list as a result of evaluating each subject's health over multiple ages in each era A weight calculation stage that analyzes medical data for weight calculation as a word, calculates a degree of relevance to the onset of the disease through the onset period as a weight, and a series of subjects to be analyzed Each of them receives feature vectors related to the health status of the subject in the form of a bug of words based on words related to health evaluation, and performs latent topic analysis on the matrix data listing the feature vectors of the series of subjects to be analyzed. Thus, the matrix data is decomposed into a product of the relationship matrix θ of the subject and the topic and the relationship matrix Φ of the topic and the word, The subject and topic-corrected relation matrix θ ′ and the topic obtained by performing latent topic analysis on the matrix data W * D to which the weight W calculated for each word is given to the matrix data D And the corrected relationship matrix Φ ′ of the word as an approximate value without performing the latent topic analysis, and the corrected relationship matrix θ ′ determined as the approximate value is clustered for the series of subjects to be analyzed. A clustering stage as a result, and in the clustering stage, when obtaining the approximate value, the weight W is decomposed into a product of a plurality of transformations determined to be a transformation of the identity transformation and the transformation By sequentially adding a plurality of transformations as perturbations to the relationship matrix θ and the relationship matrix Φ, the modified relationship matrix θ ′ and the modified relationship matrix Φ ′ are respectively It is obtained as an approximate value.

さらに、本発明は、コンピュータを前記クラスタリング装置として機能させるプログラムであることを特徴とする。 Furthermore, the present invention is a program for causing a computer to function as the clustering device.

本発明によれば、単語ごとに、発症期間を経て疾病が発症することへの関連度を重みとして計算し、バグオブワーズの形式で与えられている分析対象の一連の対象者をクラスタリングするに際して当該単語重みを考慮するので、特定疾病のリスクを予測するのに適したクラスタリング結果を得ることができる。 According to the present invention, for each word, the degree of relevance to the onset of the disease through the onset period is calculated as a weight, and when a series of subjects to be analyzed given in the form of bug of words are clustered, Since the weight is taken into consideration, a clustering result suitable for predicting the risk of a specific disease can be obtained.

特定の疾病リスクの予測を行いたい場合に、教師無し学習による類型化では無関係な特徴を取り扱うことから精度が低下してしまうという課題を模式的に説明する例を示す図である。FIG. 10 is a diagram illustrating an example schematically explaining a problem that accuracy is lowered because irrelevant features are handled in typification by unsupervised learning when prediction of a specific disease risk is desired. 一実施形態に係るクラスタリング装置の機能ブロック図である。It is a functional block diagram of the clustering device concerning one embodiment. 重み算出部が読み込む医療データの模式的な例を示す図である。It is a figure which shows the typical example of the medical data which a weight calculation part reads. 共起率としての重みを各単語について計算した例を示す図である。It is a figure which shows the example which calculated the weight as a co-occurrence rate about each word. 医療データにおいて共起又は非共起を調べる年代を説明するための図である。It is a figure for demonstrating the age which investigates co-occurrence or non-co-occurrence in medical data. 潜在トピック分析によるクラスタリングを説明するための図である。It is a figure for demonstrating the clustering by a latent topic analysis. 第二実施形態における修正計算を説明するための図である。It is a figure for demonstrating the correction calculation in 2nd embodiment.

図２は、一実施形態に係るクラスタリング装置の機能ブロック図である。クラスタリング装置10は、重み計算部11、分析対象文書化部12及びクラスタリング部13を備える。各部の概要は以下の通りである。 FIG. 2 is a functional block diagram of the clustering apparatus according to one embodiment. The clustering apparatus 10 includes a weight calculation unit 11, an analysis target documenting unit 12, and a clustering unit 13. The outline of each part is as follows.

重み計算部11では、重み計算用の全患者・年代のデータとしての医療データを読み込むと共に、データ分析を実施するユーザより分析対象の疾病名及び発症期間の指定を受け付け、その詳細を後述する処理によって単語ごとの重みを求めて、クラスタリング部13に渡す。当該単語ごとの重みの情報を利用することにより、当該指定された疾病が当該指定された発症期間の経過後（例えば発症期間に１年を指定した場合は１年後）に発症するリスクを抱えた患者をクラスタリング部13において高精度に抽出すると共に、クラスタリング結果を参照することで当該リスク要因を詳細に分析することが可能となる。 The weight calculation unit 11 reads medical data as all patient / age data for weight calculation, receives designation of the disease name and onset period of the analysis target from the user who performs the data analysis, and details processing will be described later To obtain the weight for each word and pass it to the clustering unit 13. By using the weight information for each word, there is a risk that the specified disease will develop after the specified onset period (for example, one year if the specified onset period is one year) It is possible to extract the patient with high accuracy in the clustering unit 13 and to analyze the risk factor in detail by referring to the clustering result.

図３は、重み計算部11が当該読み込む医療データの模式的な例を示す図である。図３の例では、全医療データは患者Aさん〜Fさんに関するものであり、当該各患者の各年代における医療データが与えられている。すなわち、患者Aさんについて40歳〜43歳の各年代における医療データと、患者Bさんについて42歳〜45歳の各年代における医療データと、患者Cさんについて41歳〜43歳の各年代における医療データと、患者Dさんについて50歳〜53歳の各年代における医療データと、患者Eさんについて51歳〜53歳の各年代における医療データと、患者Fさんについて52歳〜55歳の各年代における医療データと、が与えられている。 FIG. 3 is a diagram showing a schematic example of the medical data read by the weight calculation unit 11. In the example of FIG. 3, all the medical data relates to patients A to F, and medical data for each age of each patient is given. That is, medical data for patient A in each age group between 40 and 43, medical data for patient B in each age group between 42 and 45, and medical care for patient C in each age group between 41 and 43 years Data, medical data for patient D in each age group from 50 to 53 years old, patient E for medical data in each age group from 51 to 53 years old, and patient F in each age group from 52 to 55 years old Medical data.

このような各患者の各年代につき与えられる医療データの具体的な内容としては、レセプトデータ（診療明細情報）及び／又は健診データを利用することができる。レセプトデータには、各患者につき診断された病名や症状、また、処方された薬及びその量などが日時と共に記載されている。健診データには、各患者につき血圧、体重その他といった健康診断の各項目とその評価値とが、診断の日時と共に記載されている。 As specific contents of the medical data given for each age of each patient, receipt data (medical specification information) and / or medical examination data can be used. In the receipt data, the name and symptom diagnosed for each patient, the prescribed drug and its amount, and the like are described along with the date and time. In the medical examination data, each item of the medical examination such as blood pressure, weight, etc. and the evaluation value thereof are described together with the date and time of diagnosis for each patient.

従って、当該医療データは、各患者の各年代につき、その健康状態を反映した特定の単語を列挙したものとなっている。また、単語の種類によっては、数値が紐付けられることとなる。例えば、レセプトデータにおいて処方された薬の名前を表す単語には、その量が紐付けられる。健診データにおいて血液検査の各項目には、その評価値が紐付けられる。 Therefore, the medical data is a list of specific words that reflect the health status of each patient for each age. In addition, depending on the type of word, a numerical value is associated. For example, the amount is linked to a word representing the name of a medicine prescribed in the receipt data. The evaluation value is associated with each item of the blood test in the medical examination data.

また、当該医療データは、参照することで、各患者につき、特定の年代において特定の疾病の発症がある場合は、その旨の情報を得ることができるように構成されている。図３の例では、患者Aさんは43歳において高脂血症を発症しており、患者Dさんは51歳において脂肪肝を発症しており、患者Fさんは52歳において糖尿病腎症を発症している。なお、こうした情報を得ることが可能なように、医療データは疾病の診断の旨の情報が含まれるレセプトデータを含んで構成されていることが好ましいが、特定疾病の特定年代における発症の旨の情報が得られさえすれば、必ずしもレセプトデータを含んでいなくともよい。 In addition, by referring to the medical data, if there is an onset of a specific disease at a specific age for each patient, information to that effect can be obtained. In the example of FIG. 3, patient A developed hyperlipidemia at age 43, patient D developed fatty liver at age 51, and patient F developed diabetic nephropathy at age 52. doing. In order to obtain such information, it is preferable that the medical data is configured to include receipt data including information on the diagnosis of the disease. As long as the information can be obtained, the receipt data may not necessarily be included.

なお、医療データの記載項目の具体例として、後述する図４における欄C2における各記載を挙げることができる。 In addition, as specific examples of the description items of the medical data, descriptions in a column C2 in FIG.

分析対象文書化部12では、健康分析対象の一連の患者の医療データを読み込み、各患者の医療データを文書化して、クラスタリング部13に渡す。 The analysis object documenting unit 12 reads a series of medical data of patients to be analyzed for health, documents the medical data of each patient, and passes it to the clustering unit 13.

分析対象文書化部12にて当該読み込む医療データは、重み計算部11で読み込んだのと同種のものであり、レセプトデータ及び／又は健診データを含んで構成されるものである。ここで、重み計算部11で読み込んだデータとは別の患者についてのデータを読み込んでもよいし、重み計算部11で読み込んだデータの一部又は全部の患者のデータを含めて分析対象として読み込むようにしてもよい。 The medical data read by the analysis object documenting unit 12 is the same type as that read by the weight calculation unit 11 and includes receipt data and / or medical examination data. Here, data about a patient different from the data read by the weight calculation unit 11 may be read, or a part or all of the data read by the weight calculation unit 11 may be read as an analysis target. It may be.

ただし、重み計算部11で読み込んだデータは図３に例示したように、各患者につき一連の年代に渡るものであったが、分析対象文書化部12で読み込むデータは、各患者につき、１つの年代のみで構成されている。図３の例で、各患者A〜Fの現在（例えば2014年）の年齢における医療データが図示する最後の部分であるとすると、分析対象文書化部12では例えば、各患者A〜Fの当該現在の年代における医療データ（Aの43歳、Bの45歳、…、Fの55歳の医療データ）を読み込むことができる。 However, as shown in FIG. 3, the data read by the weight calculation unit 11 is for a series of ages for each patient. However, the data read by the analysis target documenting unit 12 is one for each patient. It consists only of ages. In the example of FIG. 3, assuming that the medical data at the current age (for example, 2014) of each patient A to F is the last part shown in the figure, the analysis target documenting unit 12, for example, Medical data for the current age (A's 43 years old, B's 45 years old, ..., F's 55 years old medical data) can be read.

分析対象文書化部12では、以上のような各患者の各１つの年代の医療データを、文書化する。当該文書化は、後段のクラスタリング部13において潜在トピック分析によるクラスタリングが可能な形式へと予め変換しておく前処理に相当する。 The analysis object documenting unit 12 documents the medical data of each age of each patient as described above. This documentation corresponds to pre-processing that is converted in advance into a format that can be clustered by latent topic analysis in the clustering unit 13 in the subsequent stage.

分析対象文書化部12では具体的には、周知の形式であるBag of Words（バグオブワーズ）の形式で文書化する。すなわち、健康状態を表す複数m個の単語i₁, i₂, ,…, i_mをそれぞれ何回用いるかという単語頻度のm次元ベクトルの形で、文書化する。当該複数の単語は、医療データ内に現れたものから作成することができる。 Specifically, the analysis object documenting unit 12 performs documenting in a Bag of Words format that is a well-known format. That is, a plurality of m words i _1, i ₂ representing the health, ..., a i _m in the form of a m-dimensional vector word frequency of how used many times each and document. The plurality of words can be created from those appearing in the medical data.

すなわち、レセプトデータであれば診断された病名や症状、また、処方された薬等の、記載されている健康状態や健康関連品目、健康関連事項を表す各単語をそのまま、バグオブワーズにおける単語頻度ベクトルの各要素となる単語として利用する。健診データであっても同様に、検査項目を表す各単語をそのまま、バグオブワーズにおける単語の各々として利用する。また、記載項目がフレーズで構成されていれば、所定の単語抽出フィルタで単語のみを抽出して利用してもよい。 In other words, in the case of receipt data, each word representing the health status, health-related items, and health-related items described in the diagnosed disease name and symptom, prescribed medicine, etc. is used as it is, and the word frequency vector in Bug of Words is used. Use as a word for each element. Similarly, even in the medical examination data, each word representing the inspection item is used as it is as each word in Bug of Word. Moreover, if the description item is composed of phrases, only words may be extracted and used with a predetermined word extraction filter.

頻度に関しては、単語に対して上記例示した薬の量や血液検査の項目の場合のように数値が紐付けられている場合は、単語ごとに予め設定しておく変換規則に従って、紐付けられている数値を頻度としての値に変換することができる。また、各単語における数値の紐付けの有無によらず、当該単語が医療データ内にあれば頻度「１」、なければ頻度「０」とするといった形で文書化してもよい。 Regarding the frequency, when a numerical value is associated with a word as in the case of the amount of medicine exemplified above or a blood test item, it is associated with a word according to a conversion rule set in advance for each word. Can be converted into a frequency value. Further, it may be documented in such a manner that the frequency is “1” if the word is in the medical data, and the frequency is “0” if the word is in the medical data, regardless of whether or not the numerical value is associated with each word.

また、頻度の対応づけを含む分析対象文書化部12における当該文書化には、本出願人による特願2013-159323号（数値データ解析装置及びプログラム）、特願2013-163207号（数値データ解析装置及びプログラム）、特願2013-217817号（数値データ文書化装置及びプログラム）などを利用してもよい。また、バグオブワーズにおける単語頻度ベクトルの各要素の単語ごとに、以上説明した頻度を対応付ける手法を使い分けるようにしてもよい。 In addition, for the documentation in the analysis object documenting section 12 including frequency correspondence, Japanese Patent Application No. 2013-159323 (numerical data analysis apparatus and program) and Japanese Patent Application No. 2013-163207 (numerical data analysis) Apparatus and program), Japanese Patent Application No. 2013-217817 (numerical data documentation apparatus and program), and the like may be used. In addition, a method of associating the frequency described above with each word of each element of the word frequency vector in Bug of Words may be used properly.

クラスタリング部13では、分析対象文書化部12から送られる分析対象の各患者の単語頻度ベクトル（分析対象の各患者の特徴ベクトルに相当）に対して、LDA等の潜在トピック分析を行うことにより、各患者のクラスタリング結果を得る。 In the clustering unit 13, by performing a latent topic analysis such as LDA on the word frequency vector (corresponding to the feature vector of each patient to be analyzed) sent from the analysis target documenting unit 12, Obtain clustering results for each patient.

当該実施形態では特に、分析対象の各患者の特徴ベクトルをそのままクラスタリングするのではなく、重み計算部11で得られた単語ごとの重みを予め掛け合わせたうえでクラスタリング部13がクラスタリングを行うことにより、特定疾病が特定期間内において発症しうるか否かという観点に特化したクラスタリング結果を得ることができる。 In this embodiment, in particular, the feature vector of each patient to be analyzed is not clustered as it is, but the clustering unit 13 performs clustering after multiplying the weight for each word obtained by the weight calculation unit 11 in advance. A clustering result specialized in terms of whether or not a specific disease can develop within a specific period can be obtained.

すなわち、分析対象文書化部12より得られる、各患者Aについて所定の複数m個の単語i₁, i₂, ,…, i_mのそれぞれの頻度を与えた特徴ベクトルをV[A]=(f_1[A], f_2[A], …, f_m[A])とすると、クラスタリング部13は当該特徴ベクトルV[A]をそのままクラスタリングするのではなく、重み計算部11で得られる、当該各単語i₁, i₂, ,…, i_mのそれぞれの重みw₁, w₂, …, w_mを掛け合わせた、V'[A]=(w₁* f_1[A], w₂*f_2[A], …, w_m*f_m[A])をクラスタリング対象として、LDA等の潜在トピック分析を行う。 That is, the feature vector obtained from the analysis object documenting unit 12 and giving the respective frequencies of a predetermined plurality of m words i ₁ , i ₂ ,..., I _m for each patient A is V [A] = ( f _{1 [A]} , f _{2 [A]} ,..., f _{m [A]} ), the clustering unit 13 does not cluster the feature vector V [A] as it is, but is obtained by the weight calculation unit 11. each word i _1, i _2, the, ..., each of the weights w ₁ of i _m, w _2, ..., obtained by multiplying _{w m, V '[a]} = (w 1 * f 1 [a], w ₂ * f _{2 [A]} ,…, w _m * f _{m [A]} ) is used as a clustering target, and latent topic analysis such as LDA is performed.

ここで、重み計算部11の詳細説明で後述するように、当該掛け合わせる各単語i₁, i₂, …, i_mのそれぞれの重みw₁, w₂, …, w_mは、特定疾病が特定期間内に発症することに関連した単語についてその単語頻度を大きくさせ、関連しない単語についてその単語頻度を小さくさせるような値として、求められている。従って、各患者Aにつき当該重みを掛け合わせた特徴ベクトルV'[A]を特徴ベクトルとしてクラスタリングすることで、特定疾病が特定期間内において発症しうるか否かという観点に特化したクラスタリング結果を得ることができる。 Here, as described later in the detailed description of the weight calculation unit 11, each word i _1, i ₂ to the multiplying, ..., each of the weights w _1, w ₂ of the i _m, ..., w _m is the specific disease It is calculated | required as a value which makes the word frequency large about the word relevant to onset within a specific period, and makes the word frequency small about the word which is not related. Therefore, by clustering the feature vector V ′ [A] multiplied by the weight for each patient A as a feature vector, a clustering result specialized in terms of whether or not a specific disease can develop within a specific period is obtained. be able to.

図１の例であれば、重みw₁, w₂, …, w_mを用いることで、分析しようとしている「糖尿病」とは関係ない要因として、結果R10に現れる「骨折」や「風邪」に関連する単語の頻度を小さくし、同時に、「糖尿病」に関連する単語の頻度を大きくすることができる。こうして、クラスタリング対象の特徴ベクトルを、原データV[A]をそのまま用いるのではなく、予め「糖尿病」に関する特徴を大きく反映したものV'[A]へと変換しておくことにより、結果R20として示すような、「糖尿病」に特化したクラスタリング結果を得ることができる。 In the example of FIG. 1, by using the weights w ₁ , w ₂ ,..., W _m , the “fracture” and “cold” appearing in the result R10 as factors unrelated to “diabetes” to be analyzed. The frequency of related words can be reduced, and at the same time, the frequency of words related to “diabetes” can be increased. In this way, by converting the feature vector to be clustered into V ′ [A], which does not use the original data V [A] as it is, but largely reflects the features related to “diabetes”, as a result R20 As shown, a clustering result specialized for “diabetes” can be obtained.

次に、上記のような各単語i₁, i₂, …, i_mの重みw₁, w₂, …, w_mを求める重み計算部11の詳細について説明する。 Then, each word i _1, i ₂ as described above, ..., weights w _1, w ₂ of the i _m, ..., will be described in detail weight calculator 11 for determining the w _m.

ここで、データ分析を行うユーザが、重み計算部11に対する指定入力として、１年後に糖尿病を発症する人を予測したいという旨を入力した場合を例として説明する。すなわち、ユーザは分析対象の疾病名として「糖尿病」を指定し、発症期間として、「１年後」を指定した場合を例とする。 Here, a case where a user who performs data analysis inputs that he / she wants to predict a person who will develop diabetes after one year as a designation input to the weight calculation unit 11 will be described as an example. That is, the user designates “diabetes” as the disease name to be analyzed, and designates “after one year” as the onset period.

この場合、重み計算部11に対する入力としての図２で例示したような各患者の各年代における医療データ内の単語のうち、「１年後」に「糖尿病」を発症している事例に関連する単語を優先的に残すような重みが、求めるべきものとなる。 In this case, among the words in the medical data for each age of each patient as illustrated in FIG. 2 as an input to the weight calculation unit 11, it is related to the case of developing “diabetes” in “one year later”. A weight that preferentially leaves a word is to be obtained.

具体的に当該重みを計算するには、1年前（n-1歳の時点）の医療データと1年後（n歳の時点）の医療データでの糖尿病疾病との「共起」関係を見れば良い。すなわち、患者Aさんの1年前（n-1歳の時点）の医療データに薬剤名「a」が記載されており、1年後（n歳の時点）の医療データに診断結果としての「糖尿病」が記載されている場合には、「共起」が発生しており、関連性があると考え、当該薬剤名「a」についてのベクトルv[a]=(関連度、非関連度)=(共起数、非共起数)のベクトル要素のうち関連度に、患者Aさんからの共起件数に対する寄与分として「1」を加える。同様に、患者Bさんの1年前（m-1歳の時点）の医療データに薬剤名「a」の記載があるが、1年後（m歳の時点）の医療データに診断結果としての「糖尿病」の記載がない場合には、上記ベクトルv[a]の非関連度に、患者Bさんからの非共起件数に対する寄与として「1」を加える。 To calculate the weights specifically, the “co-occurrence” relationship between the medical data 1 year ago (at the age of n-1) and the medical data 1 year later (at the time of n years) is shown. Look at it. That is, the drug name “a” is described in the medical data of patient A one year ago (at the time of n-1 years), and the diagnosis result “ When “diabetes” is described, “co-occurrence” has occurred, it is considered to be related, and the vector v [a] = (relevance, non-relevance) for the drug name “a” In the vector element of = (number of co-occurrence, number of non-co-occurrence), “1” is added as a contribution to the number of co-occurrence cases from patient A. Similarly, the medical name “a” is written in the medical data of patient B 1 year ago (at the age of m-1), but the medical data 1 year later (at the time of m age) If there is no description of “diabetes”, “1” is added to the non-relevance degree of the vector v [a] as a contribution to the number of non-co-occurrence cases from patient B.

なお、上記で寄与として関連度または非関連度に「1」を「加える」とは、次の操作を意味している。すなわち、薬剤名「a」についてのベクトルv[a]に初期値として(関連度、非関連度)=(0, 0)を設定したうえで、全ての患者についてそれぞれ関連度（共起）又は非関連度（非共起）のいずれに該当するかを調べ、関連度に該当する患者総数と非関連度に該当する患者総数とを「集計」することにより最終的なベクトルv[a]の値を求める際の、当該「集計」の操作が、「1」を「加える」という操作である。 Note that “adding” “1” to the relevance level or the non-relevance level as a contribution in the above means the next operation. That is, after setting (relevance, non-relevance) = (0, 0) as an initial value in the vector v [a] for the drug name “a”, the relevance (co-occurrence) or The final vector v [a] is calculated by examining whether it is unrelated (non-co-occurrence) The operation of “aggregation” when obtaining a value is an operation of “adding” “1”.

こうして、分析対象文書化部12で得る特徴ベクトルの要素をなしている全ての単語i（上記の説明では当該単語iの一例として薬剤名「a」を用いた）に関して、重み計算部11に入力される医療データに含まれる全患者のそれぞれにつき、共起に該当する（n歳に「糖尿病」を発症しており、且つ、n-1歳データに単語iが存在する）か、あるいは非共起に該当する（n歳に「糖尿病」を発症しておらず、且つ、n-1歳データに単語iが存在する）か、を調べることで、共起となる患者の総数と非共起となる患者の総数とを求める。ここで、n歳で「糖尿病」を発症しているか否かは、n歳データに単語として「糖尿病」があるか否かによって判定すればよい。「糖尿病」以外の特定疾病をユーザが指定した場合も同様に、当該指定された特定疾病を表す単語の有無により判定すればよい。 In this way, all the words i constituting the element of the feature vector obtained by the analysis object documenting unit 12 (the drug name “a” is used as an example of the word i in the above description) are input to the weight calculating unit 11. All patients included in the medical data to be treated fall under the category of co-occurrence ("diabetes" occurs at age n and the word i is present in data at age n-1) or The total number of co-occurring patients and non-co-occurrence by investigating whether it falls under the origin ("diabetes" has not developed at age n and the word i is present in the n-1 year-old data) And the total number of patients. Here, whether or not “diabetes” has developed at the age of n may be determined by whether or not “diabetes” is included as a word in the n-year-old data. Similarly, when the user designates a specific disease other than “diabetes”, the determination may be made based on the presence or absence of a word representing the designated specific disease.

なお、n歳において「糖尿病」を発症しており、且つ、n-1歳データに単語iが存在しない患者と、n歳において「糖尿病」を発症しておらず、且つ、n-1歳データに単語iが存在しない患者と、も医療データ内に存在しうる。このような患者については、単語iにおける共起・非共起のいずれの患者総数においてもカウントしないものとして扱う。 It should be noted that patients who developed “diabetes” at n years old and the word i does not exist in the n-1 year old data, and those who did not develop “diabetes” at n years old, and n-1 year old data And the patient who does not have the word i can also be present in the medical data. Such patients are treated as not counting in the total number of co-occurring and non-co-occurring patients in word i.

そして、当該求めた総数により、単語iについてのベクトルv[i]=(関連度[i]、非関連度[i])=(共起の患者総数、非共起の患者総数)を構成する。当該構成されたベクトルv[i]により、当該単語iに関する重みw_iを以下の式(1)にて求めることができる。 Then, the vector v [i] = (relevance [i], non-relevance [i]) = (total number of co-occurring patients, total number of non-co-occurring patients) for the word i is formed by the obtained total number. . Based on the configured vector v [i], the weight w _i related to the word i can be obtained by the following equation (1).

当該式(1)より明らかなように、重みw_iは、単語iに関して特定疾病を特定期間に発症すること（糖尿病を１年で発症すること）に関する共起率という意味合いを有している。 As is clear from the formula (1), the weight w _i has the meaning of the co-occurrence rate related to the onset of the specific disease for the word i in the specific period (the onset of diabetes in one year).

図４は、当該共起率としての重みを各単語について計算した実際の例を表形式で示す図である。欄C1には説明の際に参照するための行数が、欄C2には各単語iに対応する医療データ内の記載内容のそれぞれが、欄C3には式(1)における関連度[i]すなわち1年後に「糖尿病」を発症し且つ1年前の医療データに単語iが存在する患者数（共起数）が、欄C4には式(1)における非関連度[i]すなわち1年後に「糖尿病」を発症せず且つ1年前の医療データに単語iが存在する患者数（非共起数）が、欄C5は当該欄C3,C4の関連度[i],非関連度[i]から式(1)を用いて計算される共起率（重みw_i）が、それぞれ記載されている。図４では、共起率の高いデータの順にソートすることで、医療データの一部分のみを表示してある。 FIG. 4 is a diagram showing an actual example in which the weight as the co-occurrence rate is calculated for each word in a table format. The column C1 indicates the number of lines to be referred to in the explanation, the column C2 indicates the description contents in the medical data corresponding to each word i, and the column C3 indicates the relevance [i] in the expression (1). That is, the number of patients who developed “diabetes” one year later and the word i was present in the medical data one year ago (the number of co-occurrence) is shown in column C4, the degree of unrelevance [i] in formula (1), ie one year The number of patients who do not develop “diabetes” later and the word i exists in the medical data one year ago (the number of non-co-occurrence) is the relevance [i], relevance [ The co-occurrence rate (weight w _i ) calculated from i] using equation (1) is described. In FIG. 4, only a part of the medical data is displayed by sorting in the order of data with the highest co-occurrence rate.

なお、欄C2は、各患者の各年代における医療データの記載項目の例を示すものでもある。特定の疾病が診断されたことを表す疾病名に関する単語iとして、16行目の「２型糖尿病性腎症第１期」や、30行目の「食道ヘルニア」、31行目の「２型糖尿病・腎合併症あり」、33行目の「重症感染症」が挙げられる。分析ユーザが重み計算部11に対して指定する特定疾病の名称としては、「糖尿病」の他、こうしたものも指定可能であり、分析したい所定の疾病名を指定すればよい。 The column C2 also shows an example of the description items of the medical data for each patient in each age. The word i for the disease name indicating that a specific disease has been diagnosed is “Type 2 diabetic nephropathy stage 1” on line 16, “esophageal hernia” on line 30, “type 2” on line 31 Diabetes and renal complications ”,“ severe infection ”on line 33. As the name of the specific disease designated by the analysis user for the weight calculation unit 11, in addition to “diabetes”, such a name can be designated, and a predetermined disease name to be analyzed may be designated.

また、欄C2にて20行目や34行目は処方や処置に関する単語i、22行目及び25〜27行目は処置に用いる器具に関する単語iが示されている。その他の行は用いた薬剤名とその量とが示されている。例えば1行目には「A剤０．３ｍｇ」と記載され「A剤」を「０．３ｍｇ」だけ用いたことを表している。単語iとしては「A剤」のみを利用してよい。 In column C2, the word i relating to prescription and treatment is shown in the 20th and 34th lines, and the word i relating to instruments used in the treatment is shown in the 22nd and 25th to 27th lines. The other lines show the names and amounts of drugs used. For example, in the first line, “agent A 0.3 mg” is described, and “agent A” is used in an amount of “0.3 mg”. Only “agent A” may be used as the word i.

なお、重み計算部11において入力される全医療データを解析することで以上のように各単語iにつき、特定期限で発症する特定疾病との関係が、各患者において共起又は非共起であるかを判定するに際しては、各患者の医療データのうちいずれの年代（糖尿病の発症有無を確認する年代nとその1年前であるn-1）を調べて判定するかという点で自由度があるが、これに関しては、種々の実施形態が可能である。図５は当該調べる年代に関する各実施形態を説明するための、重み計算部11に入力される医療データの一部の例を示す図である。なお、以上の説明と同じく、１年後に糖尿病を発症する人を予測しようとする場合を例に説明する。図５では患者A〜Cが存在するが、これは図３の例とは独立の患者A〜Cである。 In addition, by analyzing all the medical data input in the weight calculation unit 11, the relationship between each word i and a specific disease that develops at a specific deadline is co-occurrence or non-co-occurrence in each patient as described above. In determining whether or not, it is possible to examine the age of medical data of each patient (the age n for confirming the onset of diabetes and n-1 which is one year before). In this regard, various embodiments are possible. FIG. 5 is a diagram illustrating an example of a part of medical data input to the weight calculation unit 11 for explaining each embodiment related to the age to be examined. As in the above description, a case where an attempt is made to predict a person who will develop diabetes after one year will be described as an example. In FIG. 5, there are patients A to C, which are patients A to C independent of the example of FIG.

図５では、最上段側に各患者の各年代（年齢）において取得した医療データの取得年代（西暦）が時間軸として示され、患者Aについては2010年〜2014年の各年において41歳〜45歳の医療データが取得されており、患者Bについては2011年〜2014年の各年において47歳〜50歳の医療データが取得されており、患者Cについては2010年〜2014年の各年において45歳〜49歳の医療データが取得されている場合を例示している。そして、破形マークで示しているように、患者Aは45歳(2014年)の時点で糖尿病を発症し、患者Bは48歳(2012年)の時点で糖尿病を発症し、患者Cは糖尿病を発症していないものとする。 In FIG. 5, the acquisition date (year) of medical data acquired at each age (age) of each patient is shown as the time axis on the uppermost side, and patient A is 41 years old or older in each year from 2010 to 2014. Medical data for 45 years old has been acquired, medical data for patients B from 47 years old to 50 years old has been acquired in each year from 2011 to 2014 for patient B, and each year from 2010 to 2014 for patient C Exemplifies the case where medical data of 45 years old to 49 years old is acquired. And as indicated by the broken mark, patient A developed diabetes at the age of 45 (2014), patient B developed diabetes at the age of 48 (2012), and patient C had diabetes. It is assumed that it does not develop.

共起又は非共起の判定を行うための対象年代の設定の一実施形態として、枠F1で囲んでいるように、糖尿病を発症しているか否かを調べる年代nと、その一年前で各単語iの共起／非共起を調べる年代n-1とを、暦に従って全ての受診者で共通となるように設定してもよい。 As one embodiment of setting the target age for determining co-occurrence or non-co-occurrence, as shown in the frame F1, age n to check whether or not diabetes has occurred, and one year before that The age n-1 for checking the co-occurrence / non-co-occurrence of each word i may be set to be common to all examinees according to the calendar.

枠F1の例では、各患者につき、西暦2014年の時点の医療データで糖尿病を発症しているかをまず調べた後、1年手前の西暦2013年の時点の医療データ内の各単語iを調べ、2014年時点で糖尿病を発症している患者の医療データに含まれる単語iは共起としてカウント（当該単語iに関し、当該患者からの寄与分を共起としてカウント）し、2014年時点で糖尿病を発症していない患者の医療データに含まれる単語iは非共起としてカウント（当該単語iに関し、当該患者からの寄与分を非共起としてカウント）する。 In the example of frame F1, for each patient, the medical data as of 2014 AD is first checked to see if it has developed diabetes, and then each word i in the medical data as of 2013 AD, one year before, is checked. , The word i included in the medical data of patients who have diabetes as of 2014 is counted as a co-occurrence (the contribution from the patient is counted as a co-occurrence for the word i), and diabetes as of 2014 The word i included in the medical data of the patient who does not develop is counted as non-co-occurrence (for the word i, the contribution from the patient is counted as non-co-occurrence).

図５の患者A〜Cの例では、患者A,Bの2013年の医療データ内の単語iが共起としてカウントされ、患者Cの2013年の医療データ内の単語iは非共起としてカウントされることとなる。なお、患者Bに関しては2012年に初めて糖尿病を発症するが、その後2014年まで当該症状が継続しており、その旨が2014年の医療データより判断可能であるという前提で、共起としてカウントしている。 In the example of patients A to C in FIG. 5, word i in 2013 medical data of patients A and B is counted as co-occurrence, and word i in 2013 medical data of patient C is counted as non-co-occurrence. Will be. Regarding Patient B, diabetes developed for the first time in 2012, but the symptoms continued until 2014, and it was counted as a co-occurrence on the assumption that this can be determined from the 2014 medical data. ing.

なお、当該暦に従って全患者で共通の年代n, n-1を調査する場合は、当該年代n（及びn-1）の設定に関しては、現在を年代nとするといった所定設定を用いてもよいし、分析を行うユーザが重み計算部11に疾病名及び発症期間を指定する際に併せて年代nを指定するようにしてもよい。 In addition, when investigating a common age n, n-1 in all patients according to the calendar, a predetermined setting such as the present age n may be used for setting the age n (and n-1). The user who performs the analysis may specify the age n when the disease name and the onset period are specified in the weight calculation unit 11.

また、共起又は非共起の判定を行うための対象年代の設定の別の一実施形態として、各患者につきまずその医療データが存在している全年齢範囲の医療データを重み計算部11において調べることにより、糖尿病の発症があるか否かを確認し、発症があった患者については初めて発症した年代（年齢）をn歳とし、n-1歳の医療データに含まれる単語iを全て共起としてカウントすることができる。逆に、発症がなかった患者については、医療データが存在している任意の年代をn歳とし、n-1歳の医療データに含まれる単語iを全て非共起としてカウントすることができる。 As another embodiment of setting the target age for determining co-occurrence or non-co-occurrence, the weight calculation unit 11 first calculates the medical data of all age ranges where the medical data exists for each patient. By examining whether or not there is diabetes onset, the age of first onset for patients with onset is n years old, and all words i included in the medical data of n-1 years are shared. It can be counted as a start. On the other hand, for a patient who has no onset, any age in which medical data exists can be defined as n years old, and all the words i included in the medical data of n-1 years old can be counted as non-cooccurrence.

図５の例であれば、患者Aについては範囲R1として示すように、糖尿病を初めて発症した年齢としてn=45歳を設定し、n-1=44歳の医療データ内の単語iを共起としてカウントする。患者Bについては範囲R2として示すように、糖尿病を初めて発症した年齢としてn=48歳を設定し、n-1=47歳の医療データ内の単語iを共起としてカウントする。患者Cについては範囲R3やR4その他といったように、任意にn,n-1歳を設定し、n-1歳の医療データ内の単語iを非共起としてカウントする。 In the example of FIG. 5, for patient A, as shown as range R1, n = 45 years is set as the age of first onset of diabetes, and the word i in the medical data of n-1 = 44 years co-occurs. Count as. For patient B, as shown as range R2, n = 48 years is set as the age of first onset of diabetes, and word i in the medical data of n-1 = 47 years is counted as a co-occurrence. For patient C, n, n-1 years are arbitrarily set, such as ranges R3, R4, etc., and word i in the medical data of n-1 years is counted as non-co-occurrence.

以上、１年後に特定疾病の一例としての糖尿病を発症するか否かという観点に特化したクラスタリングをクラスタリング部13において行うための、重み計算部11における重みの計算の手法を説明した。分析ユーザからの指定入力を受けて、２年後、３年後、また一般にN年後に特定疾病を発症するか否かという観点に特化したクラスタリングを行うための重み計算も、同様に可能である。 As described above, the weight calculation method in the weight calculation unit 11 for performing clustering specialized in the viewpoint of whether or not to develop diabetes as an example of the specific disease after one year has been described. It is possible to calculate weights for clustering specialized in terms of whether or not a specific disease will develop in 2 years, 3 years, and generally N years after receiving a specified input from the analysis user. is there.

すなわち、１年後に発症する場合は年代n,n-1の医療データにおいて、年代nで特定疾病が発症しているか否かと、年代n-1の医療データ内の単語iと、の情報から患者ごとに共起／非共起を判定し、式(1)に従って各単語iの重みを求めた。 In other words, if the disease develops after one year, the patient is determined from the information on whether or not the specific disease has occurred at the age n and the word i in the medical data at the age n-1 in the medical data at the age n and n-1. Co-occurrence / non-co-occurrence was determined every time, and the weight of each word i was obtained according to equation (1).

全く同様にして、N年後に発症する場合は年代n,n-Nの医療データにおいて、年代nで特定疾病が発症しているか否かと、年代n-Nの医療データ内の単語iと、の情報から患者ごとに共起／非共起を判定し、式(1)に従って各単語iの重みを求めることができる。年代n,n-Nの患者ごとの設定も、図５で説明したのと同様にすればよい。 Exactly in the same way, if it occurs after N years, in the medical data of age n, nN, for each patient from the information of whether or not a specific disease has occurred at age n and the word i in the medical data of age nN Co-occurrence / non-co-occurrence can be determined, and the weight of each word i can be obtained according to equation (1). The setting for each patient of the ages n and n-N may be performed in the same manner as described with reference to FIG.

また、重み計算部11では、分析ユーザからの指定入力を受けてN年以内に特定疾病を発症するか否かという観点に特化したクラスタリングをクラスタリング部13にて行うための重みを求めることも可能である。この場合1年後からN年後までの各k年後(k=1, 2, …, N)に特定疾病を発症するか否かについての、各単語i₁, i₂, …, i_mのそれぞれの重みw_1(k), w_2(k), …, w_m(k)を計算したうえで式(2)のように当該各年の平均を取ることで、N年以内に特定疾病を発症するか否かの観点に特化したクラスタリングのための重みw_{1(N年以内)}, w_{2(N年以内)}, …, w_{m(N年以内)}を求めることができる。 In addition, the weight calculation unit 11 may obtain a weight for performing clustering specialized in the viewpoint of whether or not a specific disease will occur within N years in response to a designation input from the analysis user in the clustering unit 13. Is possible. In this case after each k years after 1 year until after N years (k = 1, 2, ..., N) of whether developing a particular disease, each word _{_{i 1, i 2, ...,}} i m After calculating the respective weights w _{1 (k)} , w _{2 (k)} ,…, w _{m (k)} , and taking the average of each year as shown in equation (2), specify within N years Weights w _{1 (within N years)} , w _{2 (within N years)} , ..., w _{m (} _{within N years} ₎ for clustering specialized in terms of whether or not to develop a disease can be obtained.

また、重み計算部11では、分析ユーザからの指定入力を受けて、N年後に複数の特定疾病の中のいずれか（少なくとも１つ）を発症するか否かという観点に特化したクラスタリングをクラスタリング部13にて行うための重みを求めることも可能である。この場合、N年後に複数の中の個別の特定疾病を発症することに関する重みをそれぞれ求めて、上記の式(2)と概ね同様に、当該個別の特定疾病ごとの重みを複数の疾病で平均すればよい。 In addition, the weight calculation unit 11 receives a designation input from the analysis user, and performs clustering that specializes in whether or not (at least one) of a plurality of specific diseases will develop in N years. It is also possible to obtain weights to be performed by the unit 13. In this case, the weights related to the occurrence of individual specific illnesses in N years after are obtained, and the weights for the individual specific illnesses are averaged over the plurality of illnesses in the same manner as the above formula (2). do it.

さらに、重み計算部11では、分析ユーザからの指定入力を受けて、特定疾病が完治せずその治療が複数年に渡って継続されるものであるとき、特定疾病が（既に発症済みではなく）特定期間後に初めて発症することに関してより特化したクラスタリングをクラスタリング部13において行うための重みを求めることも可能である。 Further, the weight calculation unit 11 receives a designation input from the analysis user, and when the specific disease is not completely cured and the treatment is continued for a plurality of years, the specific disease is not already developed (not already developed). It is also possible to obtain weights for the clustering unit 13 to perform more specialized clustering for the first occurrence after a specific period.

例えば、糖尿病は完治しないため、複数年にわたって治療が続けられる。この際、特定の糖尿病治療薬が継続的に使用されることになるので、当該薬を表す単語は糖尿病を既に発症している多くの患者において自ずと共起する傾向が強いこととなる。従って、以上説明した手法でそのまま計算すると、当該特定の糖尿病治療薬を表す単語iの関連性[i]は大きな値となり、重みw_iも大きな値となる。 For example, diabetes is not completely cured and treatment continues for multiple years. At this time, since a specific antidiabetic drug is continuously used, the word representing the drug has a strong tendency to co-occur in many patients who have already developed diabetes. Therefore, if the calculation is performed as it is using the method described above, the relevance [i] of the word i representing the specific diabetes therapeutic drug is a large value, and the weight w _i is also a large value.

しかしながら、このような特定疾病の発症後に継続的に使用される薬を表す単語の重みを大きくしてしまうと、特定疾病を初めて発症する患者を高精度にクラスタリングし、そのリスク要因を分析したい場合に、クラスタリングの精度を下げてしまうこととなる。 However, if you increase the weight of a word representing a drug that is used continuously after the onset of such a specific disease, you want to cluster patients who develop the specific disease for the first time with high accuracy and analyze the risk factors In addition, the accuracy of clustering is lowered.

そこで、糖尿病治療薬といったような、特定疾病の発症後に継続的に現れることとなる単語の影響を排除するような形で重みを計算するようにしてもよい。影響を排除するためには、各単語iにつき同一年代における関連性[i]_{(同一年代)}を計算し、以上説明した手法で計算した重みから減算すればよい。具体的には、次の通りである。 Therefore, the weight may be calculated in such a way as to eliminate the influence of words that appear continuously after the onset of a specific disease, such as a therapeutic drug for diabetes. In order to eliminate the influence, the relevance [i] _{(same age)} in the _same age for each word i is calculated and subtracted from the weight calculated by the method described above. Specifically, it is as follows.

ここで、説明例として、ユーザからの分析指定が第一指定として特定疾病を１年後に発症し、且つ、第二指定として1年後に当該特定疾病を初めて発症する場合を用いる。この場合、以上の手法では第一指定のみを考慮することで、各患者につき、n歳で特定疾病を発症しているか否かを確認し、n-1歳における各単語iを調べることで、各単語iの共起／非共起を判定した。ここでさらに、第二指定を考慮することで、各患者につき、n-1歳の時点で既に特定疾病（n歳で発症したか否かを確認した特定疾病）を発症しているか否かを調べ、各単語iに関して、n-1歳の時点で既に特定疾病を発症している患者の数（既発症者数）を関連性[i]_{(同一年代)}として求めればよい。 Here, as an illustrative example, a case is used in which the analysis designation from the user first develops the specific disease one year later and the second designation is the first onset of the specific disease one year later. In this case, by considering only the first designation in the above method, for each patient, confirm whether or not a specific disease has developed at n years old, and by examining each word i at n-1 years old, Co-occurrence / non-co-occurrence of each word i was determined. Furthermore, by considering the second designation, whether or not each patient has already developed a specific disease (specific disease that has been confirmed to have occurred at the age of n) at the age of n-1 for each patient. For each word i, the number of patients who have already developed a specific disease at the age of n-1 (number of pre-existing patients ₎ may be determined as relevance [i] _{(same age)} .

そして、前述の式(1)においてさらに当該求めた関連性[i]_{(同一年代)}を減ずるようにすることで、以下の式(3)のようにして、重みw_iを計算すればよい。なお、当該式(3)を用いる場合、前提となる式(1)における関連度[i]の算出においては、n-1歳時点で当該特定疾病の発症の有無を考慮しない実施形態を適用するものとする。 Then, by further reducing the obtained relationship [i] _{(same age)} in the above equation (1), the weight w _i may be calculated as in the following equation (3). In addition, when using the formula (3), the calculation of the degree of association [i] in the prerequisite formula (1) applies an embodiment that does not consider the occurrence of the specific disease at the time of n-1 years. Shall.

なお、n-1歳の時点で既に特定疾病を発症しているか否かは、n歳の時点でその医療データを参照して特定疾病の単語の有無を調べたのと全く同様に、n-1歳の時点の医療データ内における特定疾病の単語の有無によって判定することができる。以下説明するその他の年齢の時点での特定疾病の発症の判定も同様に可能である。 Whether or not a specific disease has already developed at the age of n-1 is exactly the same as when the presence or absence of a word for a specific disease was examined with reference to the medical data at the time of n years old. It can be determined by the presence or absence of a specific disease word in the medical data at the age of 1 year. The determination of the onset of a specific disease at other age points described below is also possible.

上記の説明では１年後に初めて特定疾病を発症する場合を説明したが、N年以内に初めて特定疾病を発症する場合をユーザ指定によって扱う際は、次のようにすればよい。すなわち、n歳で特定疾病を発症しているか否かを確認し、n-N歳における各単語iを調べることで関連度[i]及び非関連度[i]を求めると共に、n-N歳において特定疾病を既に発症している患者総数を関連性[i]_{(同一年代)}として、上記の式(3)から重みを求めればよい。 In the above description, the case of developing a specific disease for the first time after one year has been described. However, when handling the case of developing a specific disease for the first time within N years by user designation, the following may be performed. That is, it is confirmed whether or not a specific disease has occurred at the age of n, and the degree of relevance [i] and the degree of unrelevance [i] are obtained by examining each word i at the age of nN. The total number of patients who have already developed is regarded as relevance [i] _{(same age)} , and the weight may be obtained from the above equation (3).

同様に、N年「以内」ではなくN年「後」に初めて特定疾病を発症する場合をユーザ指定によって扱う際は、n-N歳、n-N+1歳、n-N+2歳、…、n-1歳のいずれかの時点において特定疾病を既に発症している患者総数を関連性[i]_{(同一年代)}として求めて、上記の式(3)から重みを求めればよい。 Similarly, when handling a specific disease that occurs for the first time in “after” N years instead of “within” N years by user designation, nN, n-N + 1, n-N + 2, What is necessary is just to obtain | require weight from said Formula (3), calculating | requiring the total number of patients who have already developed the specific disease at any time of n-1 years old as a relationship [i] _{(same age)} .

次に、クラスタリング装置10の動作の別の一実施形態として、以上説明してきた実施形態（第一実施形態とする）よりも計算量を抑制することが可能な実施形態（第二実施形態とする）を説明する。 Next, as another embodiment of the operation of the clustering apparatus 10, an embodiment (a second embodiment) in which the amount of calculation can be suppressed as compared with the embodiment described above (referred to as the first embodiment). ).

すなわち、第一実施形態では、重み計算部11に対して分析ユーザが特定疾病及び発症期間その他の、自身が高精度にリスク要因を分析したい内容を指定することで、これに特化した単語iごとの重みが計算された。クラスタリング部13では、ユーザ指定条件より計算された重みを、分析対象の一連の患者の文書化された医療データに掛け合わせたうえでLDA等の潜在トピック分析を実施し、ユーザ指定条件に特化したクラスタリング結果を得ていた。従って、第一実施形態では、クラスタリング部13において個別のユーザ指定条件ごとにそれぞれ、LDA等の相応の計算量を伴う潜在トピック分析を行う必要があった。ユーザ指定条件が何種類もの多数に渡れば、その数だけLDA等の相応の計算量を伴う計算を実施する必要があった。 In other words, in the first embodiment, the analysis user designates the specific disease and onset period and other contents that he / she wants to analyze the risk factor with high accuracy to the weight calculation unit 11, and the word i specialized for this is specified. Each weight was calculated. The clustering unit 13 performs latent topic analysis such as LDA after multiplying the weight calculated from the user-specified condition with the documented medical data of a series of patients to be analyzed, and specializes in the user-specified condition Clustering results were obtained. Therefore, in the first embodiment, the clustering unit 13 needs to perform latent topic analysis with a corresponding calculation amount such as LDA for each individual user-specified condition. If there were many types of user-specified conditions, it was necessary to carry out calculations with a corresponding amount of calculation such as LDA.

これに対して、第二実施形態においては、クラスタリング部13におけるLDA等の相応の計算量を伴う潜在トピック分析は、多種類存在しうるユーザ指定条件に依存しない共通の計算処理として１回のみ行うようにすることで計算量を抑制し、得られたクラスタリング結果を、ユーザ指定条件により計算された重みによってそれぞれ修正することで、ユーザ指定条件に特化したクラスタリング結果を求める。当該修正処理は簡素な計算により可能であるため、第一実施形態にてLDAを個別に何回も適用するのと異なり、第二実施形態では計算量を抑制することが可能となる。具体的には、次の通りである。 On the other hand, in the second embodiment, the latent topic analysis with a corresponding calculation amount such as LDA in the clustering unit 13 is performed only once as a common calculation process that does not depend on user-specified conditions that can exist in many types. By doing so, the amount of calculation is suppressed, and the obtained clustering result is corrected by the weight calculated according to the user-specified condition, thereby obtaining a clustering result specialized for the user-specified condition. Since the correction process can be performed by simple calculation, unlike the case where the LDA is applied several times individually in the first embodiment, the calculation amount can be suppressed in the second embodiment. Specifically, it is as follows.

まず、第二実施形態では、クラスタリング部13でユーザ指定条件に依存しない形で行うLDA等による潜在トピック分析によるクラスタリングは、分析対象文書化部12より得られた一連の患者の特徴ベクトルをそのままの形でクラスタリングする。すなわち、前述のように、第一実施形態では患者Aの特徴ベクトルV[A]に重みを掛け合わせたV'[A]がクラスタリング対象であったのに対して、第二実施形態では、V[A]がそのままクラスタリングされる。 First, in the second embodiment, clustering by latent topic analysis by LDA or the like performed in a form that does not depend on user-specified conditions in the clustering unit 13 is performed by directly using a series of patient feature vectors obtained from the analysis target documenting unit 12. Cluster by shape. That is, as described above, V ′ [A] obtained by multiplying the feature vector V [A] of the patient A by the weight in the first embodiment is a clustering target, whereas in the second embodiment, V ′ [A] is V [A] is clustered as it is.

図６は、クラスタリング部13における当該潜在トピック分析によるクラスタリングの結果及びその意味合いを説明するための図である。後述する第二実施形態における重みによるクラスタリング結果の修正の説明の前提事項として、まず図６を説明する。 FIG. 6 is a diagram for explaining the result of clustering by the latent topic analysis in the clustering unit 13 and its meaning. First, FIG. 6 will be described as a premise for explaining the correction of the clustering result by the weight in the second embodiment to be described later.

図６にて、分析対象文書化部12より得られる分析対象の全データがDであり、各患者Aの特徴ベクトルV[A]を行ベクトルとして列挙した行列形式のデータである。すなわち、行列Dの各行は各患者の健康状態を表現したバグオブワーズとしての「文書」であり、その列方向に並ぶ要素は各単語i₁, i₂, …, i_mの頻度数となる。 In FIG. 6, all data to be analyzed obtained from the analysis target documenting unit 12 is D, and is data in a matrix format in which the feature vectors V [A] of each patient A are listed as row vectors. That is, each row of the matrix D is "document" as Baguobuwazu representing the health condition of each patient, the elements aligned in the column direction each word i _1, i _2, ..., a frequency count of i _m.

LDA等の潜在トピック分析では、文書（各患者に対応する文書）をこのようにバグオブワーズ、つまり単語とその出現頻度として取り扱い、文書においてそのトピックを推定する。例えば、「経済」トピックからは、「株価」、「増収増益」・・・といった単語が出現するだろうし、「スポーツ」トピックからは「野球」、「サッカー」といった単語が出現することになる。 In latent topic analysis such as LDA, a document (a document corresponding to each patient) is treated as a bug of words, that is, a word and its appearance frequency, and the topic is estimated in the document. For example, words such as “stock price”, “increased sales and profits” will appear from the “economy” topic, and words such as “baseball” and “soccer” will appear from the “sports” topic.

これは、観測されたバグオブワーズ表現、つまり単語i（列成分）と文書u（行成分）との関係行列Dを、図６に示すように、単語i（列成分）とトピックk（行成分）の関係行列Φと、文書u（行成分）とトピックk（列成分）の関係行列θとの積に分解することを意味している。このトピックkを推定するのがLDAであり、クラスタリング部13の計算により当該図６に示すような「D=θ×Φ」の分解結果が得られる。 This is because the observed bug-of-words expression, that is, the relationship matrix D between the word i (column component) and the document u (row component), as shown in FIG. 6, the word i (column component) and the topic k (row component). , And the product of the relationship matrix θ of the document u (row component) and the topic k (column component). This topic k is estimated by LDA, and the decomposition result “D = θ × Φ” as shown in FIG.

このようにLDAに代表されるトピックモデルでは、各文書が固有のトピック比率を持ち、単語はこのトピック比率に従いトピックを選択したあと、そのトピックに固有の比率で生成されるという仮定をおいている。 Thus, the topic model represented by LDA assumes that each document has a unique topic ratio, and words are generated at a ratio specific to the topic after selecting the topic according to this topic ratio. .

そして、各患者のクラスタリング結果は、文書（患者）uとトピックkとの関係行列θとして表現されているが、具体的にどう解釈するかは種々の態様が可能である。 The clustering result of each patient is expressed as a relationship matrix θ between the document (patient) u and the topic k, but various modes are possible as to how to interpret them specifically.

例えば、K個の各トピックk_iがそれぞれクラスタに対応しており、各患者はそのトピック比率(k₁, k₂, …, k_K)の値が最大となるトピックに対応するクラスタに属するという解釈を与えることで、クラスタリング結果を得ることができる。あるいは、K個の各トピックk_iがそれぞれクラスタに対応しており、各患者はそのトピック比率(k₁, k₂, …, k_K)の寄与分の人数（1人や0人といった整数ではなく、小数で表現される人数）に分かれて、各クラスタに分かれて所属しているという解釈により、クラスタリング結果を得ることもできる。 For example, each of K topics k _i corresponds to a cluster, and each patient belongs to the cluster corresponding to the topic having the maximum topic ratio (k ₁ , k ₂ ,…, k _K ). A clustering result can be obtained by giving an interpretation. Alternatively, each of K topics k _i corresponds to a cluster, and each patient has the number of contributions of the topic ratio (k ₁ , k ₂ ,…, k _K ) (in integers such as 1 or 0) (The number of people expressed in decimal), and the clustering result can be obtained by interpreting that each cluster belongs to each cluster.

以上、図６のような潜在トピック分析におけるクラスタリングの詳細を前提として、第二実施形態における、クラスタリング部13での重みの利用を説明する。なお、ユーザからの特定疾病等の指定入力を受けての重み計算部11における単語iごとの重みの計算自体は、第一実施形態と全く同様である。 The use of weights in the clustering unit 13 in the second embodiment will be described above on the premise of the details of clustering in the latent topic analysis as shown in FIG. Note that the calculation of the weight for each word i in the weight calculation unit 11 upon receiving a designation input such as a specific disease from the user is exactly the same as in the first embodiment.

クラスタリング部13では、重みを掛けない各患者Aのそのままの特徴ベクトルV[A]で構成された全データDの分解結果「D=θ×Φ」において、ユーザからの指定入力に従って重み計算部11にて得られた各単語i₁, i₂, …, i_mの重みw₁, w₂, …, w_mの情報を用いて簡素な修正計算（後述）を行うことで、クラスタリング結果を表す行列「θ」を「θ'」へと修正すると共に、行列「Φ」を「Φ'」へと修正したものとして、最終的なクラスタリング結果を得る。 In the clustering unit 13, in the decomposition result “D = θ × Φ” of all the data D composed of the intact feature vector V [A] of each patient A to which no weight is applied, the weight calculation unit 11 according to the designation input from the user Express the clustering result by performing simple correction calculation (described later) using the information on the weights w ₁ , w ₂ ,…, w _m of the words i ₁ , i ₂ ,…, i _m Assuming that the matrix “θ” is modified to “θ ′” and the matrix “Φ” is modified to “Φ ′”, the final clustering result is obtained.

当該修正計算の説明のため、まず、全データDの分解結果「D=θ×Φ」を以下の式(4)のように、要素表示する。すなわち、行列Dはn行m列であり、患者数がn、単語数がmである。行列θはn行K列であり、同じく患者数がnであり、トピック数がKである。行列ΦはK行m列であり、同じくトピック数がKであり、単語数がmである。 In order to explain the correction calculation, first, the decomposition result “D = θ × Φ” of all data D is displayed as an element as in the following equation (4). That is, the matrix D has n rows and m columns, the number of patients is n, and the number of words is m. The matrix θ is n rows and K columns, the number of patients is n, and the number of topics is K. The matrix Φ has K rows and m columns, similarly, the number of topics is K, and the number of words is m.

図７は、当該修正計算を説明するために、計算の主要な各段階を列挙した図である。まず、欄[10]に示すように、式(10)は上記要素表示した式(4)と同一であって、重み付与しない全データDが「D=θ×Φ」と分解された結果を表している。当該修正計算の方針は矢印で示すように、当該式(10)を出発点として式(20)に示すように、重み付与されたデータW*Dの分解結果「W*D=θ'×Φ'」へと逐次近似の繰り返しによって到達するというものである。ここで、逐次近似の際の各ステップの計算が順次、欄[1],[2], …, [i], …, [M]に示され、最終的に欄[10]における式(10)と式(20)とをつなげる関係式が欄[20]に示されている。 FIG. 7 is a diagram listing the main stages of the calculation in order to explain the correction calculation. First, as shown in the column [10], the equation (10) is the same as the equation (4) displaying the above elements, and the result obtained by decomposing all the data D without weighting into “D = θ × Φ” is as follows. Represents. The policy of the correction calculation is as shown by the arrow, and the decomposition result “W * D = θ ′ × Φ of the weighted data W * D as shown in the equation (20) starting from the equation (10). '"Is reached by repeated successive approximations. Here, the calculation of each step in the successive approximation is sequentially shown in the columns [1], [2], ..., [i], ..., [M], and finally the equation (10 ) And equation (20) are shown in column [20].

なお、式(20)その他の箇所において「*」はアダマール積であり、行列Wは重み計算部11にて得られた各単語i₁, i₂, …, i_mの重みw₁, w₂, …, w_mを各患者につき共通の重みの行ベクトルとしたものを、全患者数nだけ並べて得られる重み行列Wである。すなわち、式(4)の成分表示に倣って式(20)の右辺を成分表示すると以下の式(5)となる。 In addition, in Expression (20) and other places, “*” is a Hadamard product, and the matrix W is the weight w ₁ , w _{2 of} each word i ₁ , i ₂ ,..., I _m obtained by the weight calculator 11. ,..., W _m is a weighting matrix W obtained by arranging a row vector having a common weight for each patient by arranging the total number n of patients. That is, when the right side of the equation (20) is displayed as a component following the component display of the equation (4), the following equation (5) is obtained.

逐次計算の各段階は以下の通りであり、欄[1],[2], …, [i], …, [M]に示す計算がこの順番に実行される。まず、欄[1]では、式(1-1)〜(1-4)がこの順番に計算される。 Each step of the sequential calculation is as follows, and the calculations shown in the columns [1], [2], ..., [i], ..., [M] are executed in this order. First, in the column [1], equations (1-1) to (1-4) are calculated in this order.

まず、式(1-1)では、出発点である式(10)を近似変形している。すなわち、式(5),(20)における重み行列Wと共通サイズ（n行m列）であり、その全ての要素が十分に1に近い行列W₁を式(10)に対してアダマール積として乗じたもの「D₁=W₁*D」をLDA等潜在トピック分析（以下、「LDA等」と称する）により分解した結果の近似値を、既にLDA等による分解結果が得られている「D=θ×Φ」に対する微小変化として、「Φ」の部分は変化せずに「θ」の部分のみが「θ₁」へと変化したものとして「D₁≒θ₁×Φ」とする。なお、記号「≒」は「近似的に等しい」を意味する。図７では、当該記号「≒」を通常の「＝」の上に「〜」（チルダ）を付与したものとして描いているが、同じ意味である。 First, in equation (1-1), equation (10), which is the starting point, is approximated. That is, the weight matrix W in Equations (5) and (20) and the common size (n rows and m columns), and a matrix W ₁ whose elements are sufficiently close to 1 is defined as a Hadamard product with respect to Equation (10) An approximate value of the result obtained by decomposing the multiplied "D ₁ = W ₁ * D" by latent topic analysis such as LDA (hereinafter referred to as "LDA etc.") has already been obtained as a result of decomposition by LDA "D As a minute change with respect to = θ × Φ, “D ₁ ≈θ ₁ × Φ” is assumed assuming that only the portion of “θ” is changed to “θ ₁ ” without changing the portion of “Φ”. The symbol “≈” means “approximately equal”. In FIG. 7, the symbol “≈” is drawn as “˜” (tilde) added to the normal “=”, but it has the same meaning.

当該式(1-1)の近似により、近似値「θ₁」は、式(1-2)に示すように「D₁」に対してΦの逆行列Φ^-1を右から乗ずることで、「θ₁= D₁×Φ^-1」として求めることができる。なお、ここで逆行列Φ^-1は行列Φに対する周知の擬似逆行列（ムーアペンローズの擬似逆行列）として求めればよい。 By approximation of the formula (1-1), the approximate value “θ ₁ ” is obtained by multiplying “D ₁ ” by the inverse matrix Φ ⁻¹ of Φ from the right as shown in the formula (1-2). It can be obtained as “θ ₁ = D ₁ × Φ ⁻¹ ”. Here, the inverse matrix Φ ⁻¹ may be obtained as a known pseudo inverse matrix (Moore Penrose pseudo inverse matrix) for the matrix Φ.

そして、以上の式(1-1),(1-2)の結果を引き継いで同様の近似計算を「Φ」について実施するのが続く式(1-3),(1-4)である。すなわち、式(1-3)では式(1-1)で定義した行列「D₁」に対して、式(5),(20)における重み行列Wと共通サイズ（n行m列）であり、その全ての要素が十分に1に近い行列W₂を式(1-1)に対してアダマール積として乗じたもの「D₂=W₂* D₁」をLDA等潜在トピック分析（以下、「LDA等」と称する）により分解した結果の近似値を、既にLDA等による分解結果が式(1-1),(1-2)にて近似値として得られている「D₁=θ₁*Φ」に対する微小変化として、「θ₁」の部分は変化せずに「Φ」の部分のみが「Φ₁」へと変化したものとして「D₂≒θ₁×Φ₁」とする。 Then, the following approximate expressions (1-3) and (1-4) are carried out by performing the same approximate calculation on “Φ” by taking over the results of the above expressions (1-1) and (1-2). That is, in the expression (1-3), for the matrix “D ₁ ” defined in the expression (1-1), the weight matrix W in the expressions (5) and (20) is a common size (n rows and m columns). LDA and other potential topic analysis (hereinafter referred to as “D ₂ = W ₂ * D ₁ ”) obtained by multiplying the expression (1-1) by a Hadamard product with a matrix W ₂ whose elements are all sufficiently close to 1 The approximate value of the result of decomposition by LDA etc. is already obtained as the approximate value by the formulas (1-1) and (1-2) as “D ₁ = θ ₁ * As a minute change with respect to Φ, it is assumed that “θ ₁ ” is not changed, and only “Φ” is changed to “Φ ₁ ” as “D ₂ ≈θ ₁ × Φ ₁ ”.

当該式(1-3)の近似により、近似値「Φ₁」は、式(1-4)に示すように「D₂」に対してθ₁の逆行列θ₁ ^-1を左から乗ずることで、「Φ₁=θ₁ ^-1×D₂」として求めることができる。なお、ここで逆行列θ₁ ^-1は行列θ₁に対する周知の擬似逆行列（ムーアペンローズの擬似逆行列）として求めればよい。 The approximation of the equation (1-3), the approximation value "[Phi _1", it multiplies the inverse matrix theta ₁ ^-1 of theta ₁ with respect to "D _2" as shown in equation (1-4) from left Thus, it can be obtained as “Φ ₁ = θ ₁ ⁻¹ × D ₂ ”. Here, the inverse matrix θ ₁ ⁻¹ may be obtained as a well-known pseudo inverse matrix (Moor Penrose pseudo inverse matrix) for the matrix θ ₁ .

以上、欄[1]に示す１段階目の計算では、出発点の式(10)「D=θ×Φ」に対する近似の適用により、LDA等で実際に行列分解を行うことなく、データ行列「D」を微小変化したデータ行列「D₂」について、LDA等による行列分解の近似結果「D₂≒θ₁×Φ₁」を得た。2段階目以降も、1段階目と全く同様の近似計算を継続的に実施すればよい。従って、欄[i]に示す任意のi段階目(i=2, 3 ,…, M)の計算を説明する。 As described above, in the first stage calculation shown in the column [1], by applying an approximation to the starting point equation (10) “D = θ × Φ”, the data matrix “ For the data matrix “D ₂ ” in which “D” is slightly changed, an approximation result “D ₂ ≈θ ₁ × Φ ₁ ” of matrix decomposition by LDA or the like was obtained. From the second stage onward, it is sufficient to continuously perform the approximate calculation exactly the same as the first stage. Therefore, the calculation of an arbitrary i-th stage (i = 2, 3,..., M) shown in the column [i] will be described.

すなわち、i-1段階目の計算が完了した時点（i段階目の計算を開始する時点）では近似分解結果「D_2(i-1)≒θ_i-1×Φ_i-1」が得られている。従って、欄[i]の式(i-1)において「θ_i-1」が微小変化した「θ_i」を導入して、「D_2(i-1)」にn行m列サイズであり全ての要素が1に十分近い行列「W_2i-1」をアダマール積として乗じたもの「D_2i-1= W_2i-1* D_2(i-1)」の近似分解結果「D_2i-1=θ_i×Φ_i-1」とする。そして、微小変化した「θ_i」の具体的な値は、式(i-2)に示すように「Φ_i-1」の擬似逆行列「Φ_i-1 ^-1」を乗ずることで「θ_i = D_2i-1×Φ_i-1 ^-1」として求める。 That is, when the calculation at the i-1 stage is completed (when the calculation at the i stage is started), an approximate decomposition result “D _{2 (i-1)} ≈θ _i-1 × Φ _i-1 ” is obtained. ing. Thus, by introducing a "theta _i-1" was minimal change in the formula of the column [i] (i-1) "theta _i" be a n row m column size "D _{2 (i-1)"} Approximate decomposition result “D _2i-1 ” of “D _2i-1 = W _2i-1 * D _{2 (i-1)} ” obtained by multiplying the matrix “W _2i-1 ” whose elements are sufficiently close to ₁ as a Hadamard product = θ _i × Φ _i-1 ”. Then, the specific value of the slightly changed “θ _i ” is obtained by multiplying the pseudo inverse matrix “Φ _i-1 ⁻¹ ” of “Φ _i-1 ” as shown in the equation (i-2). _i = D _2i-1 × Φ _i-1 ^-1 ".

続いて同様に式(i-3)では、式(i-1),(i-2)の「Φ_i-1」が微小変化した「Φ_i」を導入して、「D_2i-1」にn行m列サイズであり全ての要素が1に十分近い行列「W_2i」をアダマール積として乗じたもの「D_2i= W_2i* D_2i-1」の近似分解結果「D_2i=θ_i×Φ_i」とする。そして、微小変化した「Φ_i」の具体的な値は、式(i-4)に示すように「θ_i」の擬似逆行列「θ_i ^-1」を乗ずることで「Φ_i =θ_i ^-1×D_2i」として求める。 Subsequently, in the same manner as the formula (i-3), formula (i-1), (i -2) is "[Phi _i-1" of introducing "[Phi _i" was slightly changed, "D _2i-1" Approximate decomposition result “D _2i = θ _i ” of “D _2i = W _2i * D _2i-1 ” obtained by multiplying the matrix “W _2i ” with n rows and m columns size and all elements sufficiently close to 1 as Hadamard product XΦ _i ”. The specific values for the minimal change "[Phi _i" is by multiplying a pseudo-inverse "theta _i ^-1" of the formula (i-4) as shown in "theta _i" "Φ _{_i} = _{θ i} ^-1 x D _2i ".

以上、一般のi段階目の場合を説明した。なお、上記の説明は、「D=D₀」、「θ=θ₀」、「Φ=Φ₀」と文字を設定すればi=1段階目についてもあてはまる。 The general i-th case has been described above. The above description also applies to the i = 1 stage if characters “D = D ₀ ”, “θ = θ ₀ ”, and “Φ = Φ ₀ ” are set.

以上のような近似計算をi=1,2,…,M段階目まで繰り返した結果、図７の欄[M]の式(M-3)として示すように、近似分解結果「D_2M=θ_M×Φ_M」が得られる。最終的に求めるべき式(20)の分解結果「W*D=θ'×Φ'」は、当該M段階の計算結果により欄[20]に式(13)として示すように、「θ'=θ_M」及び「Φ'=Φ_M」として得られる。その根拠が欄[20]に式(11),(12)として示されている。特に、式(11)は当該得られることを可能とする条件の式であり、式(12)は当該式(11)の条件のもとで分解結果が当該得られる根拠となる式である。 Above-described approximations of the i = 1, 2, ..., a result of repeated until M stage, as shown as formula (M-3) in the column of FIG. 7 [M], the approximate decomposition result "D _2M = theta _M × Φ _M ”is obtained. The decomposition result “W * D = θ ′ × Φ ′” of the equation (20) to be finally obtained is “θ ′ ==” as shown in the equation [13] in the column [20] based on the calculation result of the M stage. θ _M ”and“ Φ ′ = Φ _M ”. The basis for this is shown in the column [20] as equations (11) and (12). In particular, the expression (11) is an expression of a condition that makes it possible to obtain the expression, and the expression (12) is an expression that is a basis for obtaining the decomposition result under the condition of the expression (11).

すなわち、式(11)に示すように、i=1,2,…,M段階目まで繰り返す近似計算の各段階で用いた全ての要素が1に近い行列W₁,W₂,W₃,W₄, …, W_2i-1,W_2i, …, W_2M-1, W_2Mは、それらの全てのアダマール積が、式(20)における重み行列Wに等しくなるようなものとして用意しておく。従って式(20)に対して、式(11)と、欄[1],[2], …, [i], …,[M]に示す式(1-3),(2-3), …, (i-3), …, (M-3)と、を代入することにより、式(12)が得られるため、近似的な分解結果として式(13)に示すような「θ'=θ_M」及び「Φ'=Φ_M」を採用することができる。 That is, as shown in the equation (11), matrixes W ₁ , W ₂ , W ₃ , W in which all elements used at each stage of the approximate calculation repeated up to the M = 1 stage are close to 1 as shown in Equation (11). ₄ ,…, W _2i-1 , W _2i ,…, W _2M-1 , W _2M are prepared so that all their Hadamard products are equal to the weight matrix W in equation (20). . Therefore, for Equation (20), Equation (11) and Equations (1-3), (2-3), [M] shown in columns [1], [2],…, [i],…, [M] By substituting…, (i-3),…, (M-3), Equation (12) is obtained. As an approximate decomposition result, θ ′ = θ _M ”and“ Φ ′ = Φ _M ”can be adopted.

なお、近似計算を可能としている各段階で用いた全ての要素が1に近い行列W₁,W₂,W₃,W₄, …, W_2i-1,W_2i, …, W_2M-1, W_2Mのそれぞれの具体的な値及び当該計算を繰り返す回数2Mは、式(11)を満たし且つ各行列が所定基準で全ての要素が1に近いと判定されるような任意のものを、所定手法によって設定すればよい。 Note that all elements used at each stage enabling approximate calculation are matrices W ₁ , W ₂ , W ₃ , W ₄ ,…, W _2i−1 , W _2i ,…, W _2M−1 , Each specific value of W _{2M and} the number of times 2M to repeat the calculation are predetermined values that satisfy Equation (11) and that each matrix is determined to be close to 1 on a predetermined basis. What is necessary is just to set by a method.

例えば、以下の式(6)に示すように、行列W₁,W₂,W₃,W₄, …, W_2i-1,W_2i, …, W_2M-1, W_2Mは互いに全て値が等しく、重み行列Wの「2M乗根」となるように設定することができる。 For example, as shown in the following equation (6), the matrices W ₁ , W ₂ , W ₃ , W ₄ ,…, W _2i-1 , W _2i ,…, W _2M-1 , W _2M are all mutually valued. It can be set to be equal to the “2M root” of the weight matrix W.

なお、式(6)に示すように、重み行列Wの「2M乗根」とは、アダマール積の意味での2M乗根として、各要素w_i(i=1, 2, …, m)が2M乗根w_i ^1/2Mとなっているものとして定義すればよい。繰り返し回数2Mは、各2M乗根の要素w_i ^1/2M (i=1, 2, …, m)がそれぞれ、所定基準で十分に1に近いと判定される数として設定すればよい。 As shown in Equation (6), the “2M power root” of the weight matrix W means that each element w _i (i = 1, 2,..., M) is a 2M power root in the sense of Hadamard product. It may be defined as a 2M root w _i ^{1 / 2M} . The number of repetitions 2M may be set as a number at which each 2M power root element w _i ^{1 / 2M} (i = 1, 2,..., ^M ) is determined to be sufficiently close to 1 on a predetermined basis.

以上のように、第二実施形態においては重み行列Wを、全ての要素を1として得られる恒等変換に微小変換を加えた変換と判定することのできる複数の行列W₁,W₂,W₃,W₄, …, W_2i-1,W_2i, …, W_2M-1, W_2Mの積（アダマール積）に分解し、当初、重みWを付与せずに得られている分解結果「D=θ×Φ」の「θ」及び「Φ」に対して、分解された行列W₁,W₂,W₃,W₄, …, W_2i-1,W_2i, …, W_2M-1, W_2Mの各々を逐次的に摂動として加えていくことにより、重みを付与した際の分解結果「W*D=θ'×Φ'」における「θ'」及び「Φ'」を、実際にLDA等の高負荷な計算を行うことなく、近似値として得ることができる。ここで、逐次的な摂動の各段階は、図７に欄[1]〜[M]として示す通りである。 As described above, in the second embodiment, the weight matrix W is a plurality of matrices W ₁ , W ₂ , W that can be determined as a transformation obtained by adding a small transformation to the identity transformation obtained by setting all elements to 1. ₃ , W ₄ ,…, W _2i-1 , W _2i ,…, W _2M-1 , W _2M is decomposed into a product (Hadamard product), and the decomposition result “ For “θ” and “Φ” of “D = θ × Φ”, the decomposed matrices W ₁ , W ₂ , W ₃ , W ₄ ,..., W _2i−1 , W _2i ,…, W _2M-1 , W _2M is added as a perturbation sequentially, so that the decomposition results “W * D = θ ′ × Φ ′” when weighting is applied are actually “θ ′” and “Φ ′”. An approximate value can be obtained without performing a heavy load calculation such as LDA. Here, each stage of the sequential perturbation is as shown as columns [1] to [M] in FIG.

なお、恒等変換に加える「微小変換」とは、恒等変換に対して差分値として微小変化を伴うようにするものである。例えば単語数m=3の場合全ての要素が1の重み(1, 1, 1)によって恒等変換が定義されるが、これに微小変換(+0.01, -0.02, +0.03)を加えることで重み(1.01, 0.98, 1.03)が得られる。このような重みにより分解された行列W₁,W₂,W₃,W₄, …, W_2i-1,W_2i, …, W_2M-1, W_2Mの各々を構成（重み行列Wと同様に、(1.01, 0.98, 1.03)等の共通の行ベクトルを全患者数nだけ並べることで構成）して、逐次的な摂動計算が可能となる。ここで、微小変換か否かの判定は、その各要素が所定基準で0に近いか否かによって判定すればよい。例えば±0.1以内に値がある場合に微小変換と判定すれば、当該例示した変換(+0.01, -0.02, +0.03)は微小変換である。 It should be noted that “small conversion” added to the identity conversion is to cause a slight change as a difference value with respect to the identity conversion. For example, when the number of words is m = 3, the identity transformation is defined by the weight (1, 1, 1) of all elements being 1. By adding a small transformation (+0.01, -0.02, +0.03) to this, Weights (1.01, 0.98, 1.03) are obtained. Each of the matrices W ₁ , W ₂ , W ₃ , W ₄ , ..., W _2i-1 , W _2i , ..., W _2M-1 , W _2M decomposed by such weights (similar to the weight matrix W) In addition, it is possible to perform sequential perturbation calculation by arranging common row vectors such as (1.01, 0.98, 1.03) by the total number of patients n. Here, the determination as to whether or not it is a minute conversion may be made according to whether or not each element is close to 0 on a predetermined basis. For example, if it is determined that the conversion is minute when the value is within ± 0.1, the exemplified conversion (+0.01, −0.02, +0.03) is a minute conversion.

以下、本発明における補足的事項（１）〜（４）を述べる。 Hereinafter, supplementary items (1) to (4) in the present invention will be described.

（１）分析対象の一連の患者の医療データにおいて、各患者のデータは、分析対象文書化部12において健康評価に関連する単語によるバグオブワーズの形式により、当該患者の健康状態に関する特徴ベクトルへと変換（文書化）するものとして説明した。予め分析対象の一連の患者の医療データに当該変換が施されていれば、分析対象文書化部12は省略してもよい。 (1) In the medical data of a series of patients to be analyzed, the data of each patient is converted into a feature vector related to the health state of the patient in the form of a bug of words using words related to health evaluation in the analysis target documenting unit 12. Described as (documented). If the conversion is performed in advance on the medical data of a series of patients to be analyzed, the analysis target documenting unit 12 may be omitted.

（２）医療データの取得対象は「患者」として説明したが、同様のデータが取得できさえすれば、医師に対する「患者」のみに限らず一般の「対象者」について全く同様に、本発明を適用可能である。 (2) The medical data acquisition target has been described as “patient”. However, as long as similar data can be acquired, the present invention is not limited to the “patient” for doctors but the general “target person”. Applicable.

（３）本発明は、コンピュータをクラスタリング装置10として機能させるプログラムとしても提供可能である。当該コンピュータは、CPU(中央演算装置)、メモリ及び各種I/Fといった周知のハードウェアで構成することができ、当該プログラムを読み込んで実行するCPUがクラスタリング装置10の各部として機能することとなる。 (3) The present invention can also be provided as a program that causes a computer to function as the clustering apparatus 10. The computer can be configured by known hardware such as a CPU (Central Processing Unit), a memory, and various I / Fs, and the CPU that reads and executes the program functions as each unit of the clustering device 10.

（４）第二実施形態に関する図７の説明では、繰り返し回数を2Mとして偶数回であるものとして説明したが、当該説明の趣旨より明らかなように、奇数回2M+1や2M-1であってもよい。前者（奇数回2M+1）の場合は、全ての要素が1に近い行列W₁,W₂,W₃,W₄, …, W_2i-1,W_2i, …, W_2M-1, W_2M, W_2M+1を用いて、近似的な分解結果として「θ'=θ_M+1」及び「Φ'=Φ_M」を採用すればよい。後者（奇数回2M-1）の場合は、全ての要素が1に近い行列W₁,W₂,W₃,W₄, …, W_2i-1,W_2i, …, W_2M-2, W_2M-1を用いて、近似的な分解結果として「θ'=θ_M」及び「Φ'=Φ_M-1」を採用すればよい。また同様に、図７の説明では「θ₁」→「Φ₁」→「θ₂」→「Φ₂」→…のように「θ」→「Φ」の順番で近似値を求めたが、逆に「Φ」→「θ」の順番で求めるようにしてもよい。当該順番（「θ」→「Φ」又は「Φ」→「θ」）が各段階i毎に定まっていてもよい。 (4) In the description of FIG. 7 regarding the second embodiment, the number of repetitions is 2M, and it is assumed that the number of repetitions is an even number. However, as is clear from the gist of the description, the odd number of times is 2M + 1 or 2M-1. May be. In the case of the former (odd number 2M + 1), matrices W ₁ , W ₂ , W ₃ , W ₄ ,…, W _2i-1 , W _2i ,…, W _2M-1 , W are all elements close to 1 _{Using 2M} and W _{2M + 1} , “θ ′ = θ _{M + 1} ” and “Φ ′ = Φ _M ” may be adopted as approximate decomposition results. In the latter case (odd number 2M-1), matrices W ₁ , W ₂ , W ₃ , W ₄ ,…, W _2i-1 , W _2i ,…, W _2M-2 , W are all elements close to ₁ . _{Using “2M−1} ”, “θ ′ = θ _M ” and “Φ ′ = Φ _M-1 ” may be adopted as approximate decomposition results. Similarly, in the description of FIG. 7, approximate values are obtained in the order of “θ” → “Φ” as “θ ₁ ” → “Φ ₁ ” → “θ ₂ ” → “Φ ₂ ” →. Conversely, it may be obtained in the order of “Φ” → “θ”. The order (“θ” → “Φ” or “Φ” → “θ”) may be determined for each stage i.

10…クラスタリング装置、11…重み計算部、12…分析対象文書化部、13…クラスタリング部 10 ... clustering device, 11 ... weight calculation unit, 12 ... analysis target documenting unit, 13 ... clustering unit

Claims

Accepts designation of the disease name to be analyzed and the onset period of the disease from the user, analyzes the medical data for weight calculation as a word list as a result of evaluating the health of each subject over multiple ages in each era A weight calculation unit that calculates, as a weight, a degree of relevance to the onset of the disease through the onset period for each word;
For each of a series of subjects to be analyzed, a feature vector relating to the health status of the subject is received in the form of a bug of words with words related to health assessment, and is calculated for each word for each word frequency of the feature vector. And a clustering unit for clustering the series of subjects to be analyzed by latent topic analysis after adding a weight.

Accepts designation of the disease name to be analyzed and the onset period of the disease from the user, analyzes the medical data for weight calculation as a word list as a result of evaluating the health of each subject over multiple ages in each era A weight calculation unit that calculates, as a weight, a degree of relevance to the onset of the disease through the onset period for each word;
Receive a feature vector for the subject's health status in the form of a bug of words with words related to health assessment, for each of a series of subjects to be analyzed,
By performing a latent topic analysis on the matrix data D enumerating the feature vectors of the series of subjects to be analyzed, the matrix data is subjected to a relationship matrix θ between the subjects and topics and a relationship matrix Φ between topics and words. Break into products,
Subject and topic-corrected relationship matrix θ ′ and topic obtained by performing latent topic analysis on matrix data W * D with weight W calculated for each word on matrix data D And a modified relation matrix Φ ′ of words as an approximate value without performing latent topic analysis,
A clustering unit that sets the modified relation matrix θ ′ obtained as the approximate value as a clustering result of the series of subjects to be analyzed, and
When obtaining the approximate value, the clustering unit decomposes the weight W into a product of a plurality of transformations determined to be an identity transformation plus a minute transformation, and sequentially converts the plurality of transformations into the relation matrix. A clustering apparatus characterized in that the corrected relation matrix θ ′ and the corrected relation matrix Φ ′ are respectively obtained as approximate values by adding as a perturbation to θ and the relation matrix Φ.

The weight calculator analyzes the medical data for weight calculation,
Regarding the subject in which the disease name exists in a specific age, for each word existing in the age returned from the specific age by the onset period, the subject is counted as contributing to co-occurrence,
Regarding the subject who does not have the disease name in a specific age, for each word that exists in the age returned from the specific age by the onset period, the subject is counted as contributing to non-co-occurrence,
The clustering apparatus according to claim 1, wherein a weight for each word is calculated as a ratio of the total number of co-occurrence counted to the total number of co-occurrence and non-co-occurrence counted.

The weight calculation unit further analyzes the medical data for weight calculation,
Regarding the subject who has the disease name in the specific age, for each word existing in the age that has returned by the onset period from the specific age, the subject is counted as contributing to co-occurrence, and the subject's If the disease name already exists in the age that has returned only to the onset age, the subject is counted as an onset,
Calculating the weight for each word as a percentage of the total number of co-occurrence subtracted from the total number of co-occurrence counted to the total number of co-occurrence and non-co-occurrence counted. The clustering apparatus according to claim 3, wherein the clustering apparatus is characterized in that:

A clustering method executed by a computer,
Accepts designation of the disease name to be analyzed and the onset period of the disease from the user, analyzes the medical data for weight calculation as a word list as a result of evaluating the health of each subject over multiple ages in each era A weight calculation stage for calculating, as a weight, a degree of relevance to the onset of the disease through the onset period for each word;
For each of a series of subjects to be analyzed, a feature vector relating to the health status of the subject is received in the form of a bug of words with words related to health assessment, and is calculated for each word for each word frequency of the feature vector. And a clustering step of clustering the series of subjects to be analyzed by latent topic analysis after adding a weight.

A clustering method executed by a computer,
Accepts designation of the disease name to be analyzed and the onset period of the disease from the user, analyzes the medical data for weight calculation as a word list as a result of evaluating the health of each subject over multiple ages in each era A weight calculation stage for calculating, as a weight, a degree of relevance to the onset of the disease through the onset period for each word;
Receive a feature vector for the subject's health status in the form of a bug of words with words related to health assessment, for each of a series of subjects to be analyzed,
By performing latent topic analysis on matrix data listing feature vectors of a series of subjects to be analyzed, the matrix data is multiplied by the relationship matrix θ of subjects and topics and the relationship matrix Φ of topics and words. Disassembled into
Subject and topic-corrected relationship matrix θ ′ and topic obtained by performing latent topic analysis on matrix data W * D with weight W calculated for each word on matrix data D And a modified relation matrix Φ ′ of words as an approximate value without performing latent topic analysis,
A clustering stage that uses the corrected relation matrix θ ′ obtained as the approximate value as a clustering result of the series of subjects to be analyzed, and
In the clustering step, when the approximate value is obtained, the weight W is decomposed into a product of a plurality of transformations determined to be an identity transformation plus a minute transformation, and the plurality of transformations are sequentially converted into the relation matrix. A clustering method, wherein the corrected relation matrix θ ′ and the corrected relation matrix Φ ′ are respectively obtained as approximate values by adding as a perturbation to θ and the relation matrix Φ.

A program for causing a computer to function as the clustering apparatus according to any one of claims 1 to 4.