JP2017027510A

JP2017027510A - Selection device

Info

Publication number: JP2017027510A
Application number: JP2015147855A
Authority: JP
Inventors: 圭介小川; Keisuke Ogawa; 橋本　真幸; Masayuki Hashimoto; 真幸橋本; 一則松本; Kazunori Matsumoto
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2015-07-27
Filing date: 2015-07-27
Publication date: 2017-02-02
Anticipated expiration: 2035-07-27
Also published as: JP6474132B2

Abstract

PROBLEM TO BE SOLVED: To provide a selection device that automatically selects data that is usable as a model for performing a medical expense prediction and other health state-related predictions with high accuracy using medical data that could include a partial loss or noise, etc., as input.SOLUTION: A clustering unit 2 clusters health state data in bag-of-words format. A feature amount vector generation unit 3 generates a feature amount vector of each individual data by a topic ratio of clustering results. A threshold determination unit 4 determines, for each element of the feature amount vector, a threshold as a value that would be determined as one that helps to obtain a correlation with a prescribed prediction object associated with a health state. A data division unit 5 determines, for each individual data, the one that contains an element of the feature amount vector whose value equals the threshold or greater as belonging to a prediction possible group and makes it as output of the selection device 10, and determines the other as belonging to a prediction impossible group.SELECTED DRAWING: Figure 1

Description

本発明は、選別装置に関し、特に、欠損やノイズ等を含みうる医療データを入力として、人手による予測因子抽出の作業を必要とせずに、医療費予測その他の健康状態に関連する予測を高精度に行うためのモデルとして利用することが可能なデータを選別することのできる選別装置に関する。 The present invention relates to a sorting apparatus, and in particular, receives medical data that may contain defects, noises, etc. as input, and does not require manual labor for extracting predictive factors, so that medical cost prediction and other predictions related to health conditions are highly accurate. The present invention relates to a sorting apparatus capable of sorting data that can be used as a model for performing the above.

特許文献１，２等に示されるように、医療費削減を目的として保健指導や本人への注意喚起を行うために医療費を予測したい場合がある。 As shown in Patent Documents 1 and 2, etc., there are cases where it is desired to predict medical expenses in order to provide health guidance and alert to the person for the purpose of reducing medical expenses.

特開2014-225176号公報JP 2014-225176 A 特開2014-225175号公報JP 2014-225175 A

このように医療費を予測しようとする際に、予測のための入力データとしてはレセプト（診療明細）情報を用いることが多いが、診療明細に記載される特徴量次元は極めて大きく、一般的な回帰等の予測方式をそのまま適用することが難しいという課題がある。 When trying to predict medical expenses in this way, receipt (medical specification) information is often used as input data for prediction, but the feature quantity dimension described in the medical specification is very large, There is a problem that it is difficult to apply a prediction method such as regression as it is.

このため一般的には、人為的に疾病名等を抽出して特徴量とすることが多いが、人為的に削られたデータに潜在的な医療費予測因子が存在している可能性が否めない。すなわち、人為的な抽出では医療費予測因子の見落としが発生し、予測精度を下げてしまうという課題がある。またそもそも、仮に見落としが発生しなかったとしても、膨大なレセプト等を人手で解析したうえで、膨大な疾病名等のそれぞれにつき抽出すべきか否かの判断をするという手間が、人為的な抽出においては発生してしまう。 For this reason, in general, it is often the case that a disease name, etc. is artificially extracted and used as a feature amount, but the possibility that a potential medical cost predictor exists in the artificially deleted data is denied. Absent. In other words, artificial extraction causes a problem that a medical cost prediction factor is overlooked and the prediction accuracy is lowered. In the first place, even if there is no oversight, it is necessary to manually analyze the enormous receipts and determine whether or not to extract each enormous disease name. Will occur.

また、レセプトは医療機関にかかったときに発生するレシート情報であり、特定の状態になったときに医療機関に行くかどうかは個人の自由意思にかかっている。そのため、現状のレセプトからは将来の医療費が予測できないような群が必ず発生することになる。すなわち、予測を行うためのレセプト情報自体も多くの欠損やノイズ等を含んでいるのと等しい状態にあるといえるため、医療費予測因子の見落としの有無によらず、予測精度を下げてしまうという課題がある。 The receipt is receipt information that is generated when a person visits a medical institution, and whether or not to go to the medical institution when a specific state is reached depends on the individual's free will. For this reason, there will always be a group whose future medical expenses cannot be predicted from the current reception. In other words, the receipt information itself for making predictions is in a state that is equivalent to including many deficits and noise, etc., so that the prediction accuracy is lowered regardless of the oversight of the medical cost prediction factor. There are challenges.

本発明は、上記のような従来技術の課題に鑑み、欠損やノイズ等を含みうる医療データを入力として、人手による予測因子抽出の作業を必要とせずに、医療費予測その他の健康状態に関連する予測を高精度に行うためのモデルとして利用することが可能なデータを選別することのできる、選別装置を提供することを目的とする。 In view of the problems of the prior art as described above, the present invention is related to medical cost prediction and other health conditions without requiring manual operation of extracting predictive factors by inputting medical data that may include defects, noise, and the like. An object of the present invention is to provide a sorting apparatus capable of sorting data that can be used as a model for performing prediction with high accuracy.

上記目的を達成するため、本発明は、選別装置であって、バグオブワードの形で与えられた一連の対象者の一連の年代における健康状態データを、対象者及び年代ごとの個別データの集まりとして潜在トピック分析によりクラスタリングを行ってクラスタリング結果を得るクラスタリング部と、前記クラスタリング結果における各個別データのトピック比率より、各個別データの特徴量ベクトルを生成する特徴量ベクトル生成部と、前記特徴量ベクトルの各要素につき、健康状態に関連する所定の予測対象との相関が得られると判定されるような値として、閾値を決定する閾値決定部と、各個別データに対して、その特徴量ベクトルの要素の値のうち前記閾値以上であるものが存在するものを予測可能群に属すると判定し、当該存在しないものを予測不可能群に属すると判定するデータ区別部と、を備え、前記データ区別部により予測可能群に属すると判定された個別データを出力することを特徴とする。 In order to achieve the above object, the present invention is a sorting apparatus, which is a collection of individual data for each subject and each age group. A clustering unit that performs clustering by latent topic analysis to obtain a clustering result, a feature vector generation unit that generates a feature vector of each individual data from a topic ratio of each individual data in the clustering result, and the feature vector For each element, a threshold value determination unit that determines a threshold value as a value that is determined to be correlated with a predetermined prediction target related to the health state, and for each individual data, The element value that is equal to or greater than the threshold value is determined to belong to the predictable group, and the element value does not exist And a determining data distinguishing unit as belonging to the unpredictable group a, and outputs the individual data which are determined to belong to predictable group by the data distinguishing unit.

本発明によれば、入力される健康状態データが欠損やノイズ等を含むものであっても、一度潜在トピック分析によりクラスタリングを行ったうえで、トピック比率の各要素につき、予測能力があると判定されるような閾値以上となっているような個別データのみを自動で抽出して、予測可能群に属するものとして出力することができる。 According to the present invention, even if the input health status data includes missing data, noise, and the like, after performing clustering by latent topic analysis once, it is determined that each element of the topic ratio has a prediction capability. It is possible to automatically extract only individual data that is equal to or greater than the threshold value and output it as belonging to the predictable group.

一実施形態に係る選別装置の機能ブロック図である。It is a functional block diagram of the sorting device concerning one embodiment. 入力される全医療データの模式的な例を示す図である。It is a figure which shows the typical example of all the medical data input. 潜在トピック分析のクラスタリングにおいて得られる行列分解結果を示す図である。It is a figure which shows the matrix decomposition result obtained in the clustering of a latent topic analysis. 閾値決定部が作成して参照するクロス集計表を示す図である。It is a figure which shows the cross tabulation table which a threshold value determination part produces and refers. 閾値決定部にて閾値をスイープする際の各値につき、第一データ群のデータにおいて説明変数と目的変数との分布を示す図である。It is a figure which shows distribution of an explanatory variable and an objective variable in the data of a 1st data group about each value at the time of sweeping a threshold value in a threshold value determination part. 図５の[1]に示されるデータを全データとした際の、閾値の設定により第一データ群と第二データ群を切り分ける模式的な例を示す図である。FIG. 6 is a diagram showing a schematic example in which the first data group and the second data group are separated by setting a threshold when the data shown in [1] of FIG. 5 is all data. 図５の[2]のデータに対して実施した単回帰の模式的な例と、これにより予測の的中／不的中をカウントする対象データの存在領域の模式的な例と、を示す図である。The figure which shows the typical example of the single regression implemented with respect to the data of [2] of FIG. 5, and the typical example of the presence area | region of the target data which counts the hit / not-forecast of prediction by this It is. 特徴量削減部にて作成して参照するクロス集計表を示す図である。It is a figure which shows the cross tabulation table created and referred in the feature-value reduction part. 繰り返し処理の内容を模式的に説明するための図である。It is a figure for demonstrating the content of the repetition process typically. 図４，８等のクロス集計表を一般化した表である。9 is a generalized table of the cross tabulation tables of FIGS. 図１０のクロス集計表に対応する従属モデルにおける確率の表である。11 is a table of probabilities in the dependent model corresponding to the cross tabulation table of FIG. 10. 図１０のクロス集計表に対応する独立モデルにおける確率の表である。11 is a table of probabilities in the independent model corresponding to the cross tabulation table of FIG. 10.

図１は、一実施形態に係る選別装置の機能ブロック図である。選別装置10は、文書化部1、クラスタリング部2、特徴量ベクトル生成部3、閾値決定部4、データ区別部5、モデル生成部6、特徴量削減部7及び予測部8を備える。 FIG. 1 is a functional block diagram of a sorting apparatus according to an embodiment. The sorting device 10 includes a documenting unit 1, a clustering unit 2, a feature quantity vector generation unit 3, a threshold value determination unit 4, a data distinction unit 5, a model generation unit 6, a feature quantity reduction unit 7, and a prediction unit 8.

図１にて、繰り返し部20として示されている各部2〜7は、図示するようなループ構造の処理の流れで各データの入出力を行うことにより、繰り返し処理を行う。当該繰り返し処理にて生成されるデータの詳細は図９を参照して後述するが、繰り返し処理の各回における各部2〜7の処理内容は共通であり、各部2〜7においてはループ処理の各I回目(I=1,2, …, N：ここでNはループ処理回数)の入力を各部の前段側より受け取り、I回目の出力を各部の後段部へと出力する。 In FIG. 1, each of the units 2 to 7 shown as the repeating unit 20 performs the repetitive processing by inputting / outputting each data in the process flow of the loop structure as illustrated. Details of the data generated by the repetition processing will be described later with reference to FIG. 9, but the processing contents of the units 2 to 7 in each iteration of the repetition processing are the same, and each unit 2 to 7 has each I of the loop processing. The input of the first time (I = 1, 2,..., N: where N is the number of times of loop processing) is received from the front side of each part, and the output of the I time is output to the subsequent stage part of each part.

以下、図１の各部の処理の詳細を説明するが、上記のようにカウンタ変数Iを用いることでループ処理の各回を各I回(I=1,2, …, N)として説明することとする。 In the following, the details of the processing of each part in FIG. 1 will be described, but by using the counter variable I as described above, each loop processing is described as I times (I = 1, 2,..., N). To do.

文書化部1では、選別装置10（のクラスタリング部2）によるクラスタリング（及び当該クラスタリングの結果に基づく予測）のための入力データとしての全医療データを読み込み、当該全データを構成する各対象者Xの各年代n（年齢n）における文書化された医療データD(X, n)を生成してクラスタリング部2へと出力する。 The documenting unit 1 reads all medical data as input data for clustering (and prediction based on the result of the clustering) by the sorting device 10 (the clustering unit 2 thereof), and each subject X constituting the all data The documented medical data D (X, n) at each age n (age n) is generated and output to the clustering unit 2.

当該医療データD(X, n)への文書化とは、周知のバグオブワード(bag of words)の形式、すなわち所定の各単語の頻度（出現回数）を要素とする文書ベクトルの形式へ変換することであり、データD(X, n)は対象者Xのn歳時点での健康状態を反映したベクトルとなっている。後段側のクラスタリング部2でのクラスタリングを可能とするための前処理として、当該文書化がなされる。具体的には以下の通りである。 Documenting to the medical data D (X, n) is converted into a well-known bug of words format, that is, a document vector format with the frequency (number of occurrences) of each predetermined word as an element. The data D (X, n) is a vector reflecting the health state of the subject X at the age of n. The documenting is performed as preprocessing for enabling clustering in the clustering unit 2 on the subsequent stage side. Specifically, it is as follows.

まず、入力される全医療データは、一連の対象者の一連の時期における健康状態を評価したものであり、具体的には例えば健康組合等のもとで実施される健康診断結果や、医師による問診の結果、あるいはレセプト（診療報酬明細書）等やこれらの組み合わせを用いることができる。 First, all input medical data is an evaluation of the health status of a series of subjects at a series of times. Specifically, for example, the results of a health check conducted under a health association, etc. As a result of an inquiry, a receipt (medical remuneration statement), or a combination thereof can be used.

あらかじめ、当該医療データに記載されている、あるいは、記載されうることが既知の健康状態を表す所定の複数m個の単語i₁, i₂, …, i_mを用意しておき、文書化部1において対象者Xのn歳における医療データのテキストを解析することで、単語i₁, i₂, …, i_mの頻度ベクトルとして健康状態を表すベクトルD(X, n)を生成することができる。 A predetermined plurality of m words i ₁ , i ₂ ,..., I _m that are described in the medical data or are known to be able to be described are prepared in advance, and the documenting unit in 1 by analyzing the text of the medical data in n age of the subject X, the word _{_{i 1, i 2, ...,}} i m vector D (X, n) representing the health condition as the frequency vector of be generated it can.

例えば、問診データ等における特定の疾病の名称に相当する単語i_bが対象者Xのn歳の医療データに存在すれば、ベクトルD(X, n)の当該i_bの要素の値を「1」とし、存在しなければ同要素の値を「0」とすることができる。レセプトデータ等における処方された薬剤名などの単語i_bについても同様に当該単語が存在するか否かで「1」または「0」とすることができる。また、同単語i_bが問診データ等に複数回現れている場合は現れた回数分の要素の値としてもよいし、以下に説明する数値評価項目等の場合と同様に当該現れた回数に所定関数を適用した値を要素の値としてもよい。 For example, if the word i _b corresponding to the name of a specific disease in medical examination data or the like exists in the medical data of the subject X at the age of n, the value of the element of i _b of the vector D (X, n) is set to “1”. If it does not exist, the value of the same element can be set to “0”. Similarly, the word i _b such as the prescribed drug name in the receipt data or the like can be set to “1” or “0” depending on whether or not the word exists. The predetermined in number the word i _b is interview may be a value of number of times of the elements appear when appearing more than once in the data or the like, which appeared the as in the case of numerical evaluation items and the like to be described below The value to which the function is applied may be used as the element value.

また、健康診断データにおける体重や血液検査の結果等、数値で評価される項目については当該項目に応じた所定の単語を用意しておき、評価数値に応じた所定規則（所定関数等）により当該単語の頻度を算出してベクトルD(X,n)の要素の値とすることができる。このような評価数値から単語頻度への変換に関しては、本出願人による特開２０１５−３２０１３号公報（発明の名称：数値データ解析装置及びプログラム）、特願２０１３−１６３２０７号（数値データ解析装置及びプログラム）、特願２０１３−２１７８１７号（数値データ解析装置及びプログラム）を利用してもよい。 In addition, for items to be evaluated numerically, such as body weight and blood test results in health checkup data, a predetermined word corresponding to the item is prepared, and according to a predetermined rule (predetermined function, etc.) according to the evaluation numerical value The frequency of words can be calculated and used as the element value of the vector D (X, n). Regarding the conversion from the evaluation numerical value to the word frequency, Japanese Patent Application Laid-Open No. 2015-32013 (invention name: numerical data analysis device and program) and Japanese Patent Application No. 2013-163207 (numerical data analysis device and Program), Japanese Patent Application No. 2013-217817 (numerical data analysis apparatus and program) may be used.

なお、上記のような数値（量的データ）の場合の他、質的データ（例えば、問診票等に記載された喫煙習慣の有無など）の場合も、同様に所定規則により対応する単語の頻度へと変換し、ベクトルD(X,n)の要素の値とすることができる。 In addition to the case of numerical values (quantitative data) as described above, in the case of qualitative data (for example, the presence or absence of a smoking habit described in a questionnaire, etc.) To the value of the element of the vector D (X, n).

以上のように、単語i₁, i₂, …, i_mの各々は、入力される医療データにおける健康状態の評価項目（直接的に評価するもののみではなく、レセプトデータにおける薬剤名のように健康状態を間接的に反映する項目も含む）の各々に対応する単語であり、対象者Xのn歳における当該評価結果に対して所定規則（単語i₁, i₂, …, i_mの各々に個別規則を用意しておくことができる）を適用することで、文書化部1では文書ベクトルD(X, n)を生成する。 As described above, each of the words i ₁ , i ₂ ,..., I _m is a health condition evaluation item in the input medical data (not only a direct evaluation but also a drug name in the receipt data. Each of the predetermined rules (words i ₁ , i ₂ , ..., i _m) for the evaluation result of the subject X at the age of n The document unit 1 generates a document vector D (X, n).

図２に、文書化部1に入力される全医療データの模式的な例を示す。当該例に示すように、入力としての全医療データには欠損があることが想定されている。ここで、医療費等の予想モデル構築を精度よく実施するには、各対象者につき数十年等の長期間に渡るデータが存在していることが望まれるが、実際には図２の例のように、数年の短期間に渡るデータしか利用できないということが多い。 FIG. 2 shows a schematic example of all medical data input to the documenting unit 1. As shown in the example, it is assumed that all medical data as input has a defect. Here, in order to accurately construct a prediction model for medical expenses, etc., it is desired that each subject has long-term data such as several decades. In many cases, only data over a short period of several years can be used.

さらに、このように各対象者につき数年しか存在しないデータの内部においても、前述のように各対象者の健康状態が必ずしも十分には評価されていない状態として、欠損やノイズ等を含んで構成されていること等が想定される。以下説明するように、本発明によれば当該欠損やノイズ等を含む入力データより、医療費予測その他を高精度に実現可能なモデルとしてのクラスタリング結果を得るようにすることができる。 Furthermore, even within the data that exists only for several years for each subject as described above, as described above, the health status of each subject is not necessarily fully evaluated, including defects, noise, etc. It is assumed that it is done. As will be described below, according to the present invention, it is possible to obtain a clustering result as a model that can realize medical cost prediction and the like with high accuracy from input data including such defects and noise.

なお、図２の例では、例えばAさんに関しては40歳〜43歳のデータが存在しているので、文書化部1においてAさんの医療データよりD(A,40),D(A,41),D(A,42),D(D43)という4個のデータが出力されることとなる。Gさん、Dさんといったその他の対象者についても同様に医療データが存在する年代分のデータが出力されることとなる。 In the example of FIG. 2, for example, there is data for 40-year-old to 43-year-old for Mr. A, so in the documentation unit 1, D (A, 40), D (A, 41 ), D (A, 42), and D (D43) are output. For other subjects such as Mr. G and Mr. D, data corresponding to the age at which medical data exist will be output.

クラスタリング部2では、上記の文書化部1が出力する一連の対象者Xにおける一連の年代nのデータD(X,n)を対象としてクラスタリングを行い、クラスタリング結果を特徴量ベクトル生成部3へと出力する。当該クラスタリングする手法には、以下の非特許文献１等に開示の、潜在トピックモデルに基づくLDA(Latent Dirichlet allocation：潜在的ディリクレ配分法)等の、潜在トピック分析によるクラスタリングを用いることができる。
[非特許文献１] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research,3:993-1022, January 2003. In the clustering unit 2, clustering is performed on a series of data D (X, n) of the age n in the series of subjects X output from the documenting unit 1, and the clustering result is sent to the feature vector generation unit 3. Output. As the clustering method, clustering based on latent topic analysis such as LDA (Latent Dirichlet allocation) based on a latent topic model disclosed in the following Non-Patent Document 1 or the like can be used.
[Non-Patent Document 1] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3: 993-1022, January 2003.

ここで特に、共通の対象者Xであっても年代がn歳、m歳(m≠n)と異なるデータD(X,n),D(X,m)は、別データとしてクラスタリングが行われる。例えば、図２の例におけるAさんの4つの異なる年代におけるデータD(A,40),D(A,41),D(A,42),D(D43)は、4つの異なるデータとしてクラスタリング対象となる。なお、対象者がX及びYと異なるデータD(X, n),D(Y, m)（ここで年齢nと年齢mは同じでも異なっていてもよい）は当然、別データとしてクラスタリング対象となる。 In particular, data D (X, n) and D (X, m) whose age is different from that of n years old and m years old (m ≠ n) even for the common target person X is clustered as separate data. . For example, Mr. A's data D (A, 40), D (A, 41), D (A, 42), D (D43) in Mr. A in the example of FIG. It becomes. Note that the data D (X, n), D (Y, m), where the target is different from X and Y (where age n and age m may be the same or different) are naturally classified as clustering targets. Become.

また、図１に線L1として示しているように、クラスタリング部2では繰り返し処理の初回(I=1回目)においては、文書化部1が出力した一連の対象者Xにおける一連の年代nのデータD(X,n)の全てを入力としてクラスタリングを行う。 In addition, as indicated by a line L1 in FIG. 1, in the clustering unit 2, in the first iteration process (I = 1), a series of data of the generation n in the series of subjects X output by the documenting unit 1 Clustering is performed using all D (X, n) as input.

一方、図１中に線L2として示しているように、クラスタリング部2では繰り返し処理の2回目以降の各回(I≧2の各回)においては、後述する特徴量削減部7が出力したデータD(X,n)の全てを入力としてクラスタリングを行う。 On the other hand, as indicated by a line L2 in FIG. 1, in the clustering unit 2, the data D () output from the feature amount reduction unit 7 (to be described later) is output each time after the second iteration (I ≧ 2). Clustering is performed using all of X, n) as inputs.

詳細は図９等を参照して後述するように、特徴量削減部7の出力するデータD(X,n)は文書化部1が出力したのと同様に単語頻度ベクトルの形で構成されているが、その要素のうち削減対象となったものが要素から除外された形（すなわち、ベクトルの次元数が減った形）で構成されていると共に、当初に文書化部1が出力した全データの中からモデル生成部6でモデル生成に利用されたものが削除されて（すなわち、構成データ数が減少して）構成されている。 As will be described in detail later with reference to FIG. 9 and the like, the data D (X, n) output from the feature amount reduction unit 7 is configured in the form of a word frequency vector in the same manner as that output from the documenting unit 1. However, all the data output by the documenting unit 1 at the beginning is configured in a form in which the elements to be reduced are excluded from the elements (that is, the form in which the number of vector dimensions is reduced). Those used in model generation by the model generation unit 6 are deleted (that is, the number of configuration data is reduced).

従って、クラスタリング部2では繰り返し処理の各I回目において同様の手法でクラスタリングを行い、その結果を特徴量ベクトル生成部3へと出力することとなるが、その際の入力データ（及びこれより得られる出力データ）が各I回で変化することとなる。繰り返し処理部20を構成する後段側の各部3〜7についても同様に、各I回の処理内容は共通であるが扱う入力データ（及びこれより得られる出力データ）が変化していくこととなる。 Therefore, the clustering unit 2 performs clustering by the same method at each I-th iteration and outputs the result to the feature vector generation unit 3, but the input data at that time (and obtained from this) Output data) changes every I times. Similarly, for each of the subsequent units 3 to 7 constituting the repetitive processing unit 20, the input data to be handled (and output data obtained therefrom) will change although the processing contents of each I-time are common. .

特徴量ベクトル生成部3は、繰り返し処理の各I回目においてそれぞれ、クラスタリング部2の出力したクラスタリング結果における各データD(X,n)のトピック比率に当該データD(X,n)に対応する文書の単語数を乗じたものを、各データD(X,n)の特徴量ベクトルとして生成し、閾値決定部4へと出力する。 The feature vector generation unit 3 is a document corresponding to the topic ratio of each data D (X, n) in the clustering result output from the clustering unit 2 at each I-th iteration process. Is multiplied as the feature quantity vector of each data D (X, n) and output to the threshold value determination unit 4.

ここで、特徴量ベクトル生成部3の上記処理の意味を説明するために、前提事項としてのクラスタリング部2の出力するクラスタリング結果の内容を説明する。すなわち、クラスタリング部2によるクラスタリング結果は、図３に示すような行列分解結果として得られている。 Here, in order to explain the meaning of the above processing of the feature vector generation unit 3, the contents of the clustering result output from the clustering unit 2 as a premise will be described. That is, the clustering result by the clustering unit 2 is obtained as a matrix decomposition result as shown in FIG.

図３に示すように、LDA等の潜在トピック分析では分類対象の全データD（クラスタリング部2の各I回目の入力データD）は単語iの頻度ベクトルとして与えられている各文書u（本発明では文書化部1あるいは特徴量削減部7の出力する各データD(X,n)に相当）からなる。そして、当該全データDにクラスタリングを行った結果が、文書uとトピックkとの関係を表すθ行列とトピックkと単語iとの関係を表すΦ行列との行列としての積「D=θ×Φ」として得られ、クラスタリング部2では当該行列分解結果を出力する。 As shown in FIG. 3, in latent topic analysis such as LDA, all data D to be classified (each I-th input data D of the clustering unit 2) is each document u given as a frequency vector of the word i (the present invention). Is equivalent to each data D (X, n) output from the documenting unit 1 or the feature amount reducing unit 7). The result of clustering all the data D is the product “D = θ × as a matrix of the θ matrix representing the relationship between the document u and the topic k and the Φ matrix representing the relationship between the topic k and the word i. Φ ”, and the clustering unit 2 outputs the matrix decomposition result.

当該行列分解結果において、各トピックkが各クラスタに対応するものとすると、文書uのトピック比率を表すθ行列の各行は、各文書uのクラスタ所属確率と解釈できる。当該クラスタ所属確率は、各文書uにおける各トピックkのトピック比率であり、対応する元のデータD(X,n)の健康要因を表現したベクトルとなっている。従って例えば、各文書u(=各データD(X,n))は、その最大のトピック比率の値のトピックに対応するクラスタに所属しているものとして、クラスタリング結果を解釈することができる。 Assuming that each topic k corresponds to each cluster in the matrix decomposition result, each row of the θ matrix representing the topic ratio of the document u can be interpreted as the cluster membership probability of each document u. The cluster affiliation probability is the topic ratio of each topic k in each document u, and is a vector representing the health factor of the corresponding original data D (X, n). Therefore, for example, each document u (= each data D (X, n)) can interpret the clustering result as belonging to the cluster corresponding to the topic having the maximum topic ratio value.

そして、特徴量ベクトル生成部3では、各データD(X,n)に対応している各文書uにおける各トピックkのトピック比率（図３のθ行列の各行のベクトル）に対して、元のデータD(X,n)の単語総数を乗ずることで、各データD(X,n)の特徴量ベクトルとする。以下、当該得られるD(X,n)に対応する特徴量ベクトルをθ(X,n)と表記する。（なお、図３のθ行列の各行ベクトルそのものではなく、各行ベクトルに対応する単語総数を乗じたものをθ(X,n)と表記することに注意されたい。） Then, in the feature vector generation unit 3, the original ratio of each topic k in each document u corresponding to each data D (X, n) (vector in each row of the θ matrix in FIG. 3) is the original. By multiplying the total number of words of the data D (X, n), the feature vector of each data D (X, n) is obtained. Hereinafter, the obtained feature vector corresponding to D (X, n) is expressed as θ (X, n). (Note that not the row vectors themselves of the θ matrix of FIG. 3 but the product of the total number of words corresponding to each row vector is represented as θ (X, n).)

ここで、特徴量ベクトルθ(X,n)を得るために乗ずる単語総数は、文書化部1が出力した直後の時点における各データD(X,n)の単語総数（繰り返し処理の回数Iには依存せず、各D(X,n)につき一定値となる単語総数）を用いればよい。すなわち、各データD(X,n)は前述のように健康状態を表す所定の複数m種類の単語i₁, i₂, …, i_mの頻度ベクトルとして構成されているので、当該頻度ベクトルの要素としての各単語i₁, i₂, …, i_mの出現回数の総和が単語総数であり、当該単語総数を乗ずることで特徴量ベクトルθ(X,n)を得ることができる。 Here, the total number of words to be multiplied to obtain the feature vector θ (X, n) is the total number of words of each data D (X, n) at the time immediately after the output of the documenting unit 1 (the number of repetitions I) Does not depend on the total number of words, which is a constant value for each D (X, n). That is, since each data D (X, n) is configured as a frequency vector of a predetermined plurality of m types of words i ₁ , i ₂ ,..., I _m representing a health state as described above, The total number of occurrences of each word i ₁ , i ₂ ,..., I _m as an element is the total number of words, and the feature vector θ (X, n) can be obtained by multiplying the total number of words.

なお、単語総数を乗ずることにより、次のような場合を適切に区別した特徴量ベクトルを生成することができる。例えば、Aさんのある年代nにおけるレセプトには「風邪」と「糖尿病」が2回ずつ発生しており、Bさんのある年代mにおけるレセプトには「風邪」と「糖尿病」が50回ずつ発生している際に、Bさんの方が重症と考えられるが、トピック比率（≠単語）としては「風邪に関連するトピック比率」及び「糖尿病に関連するトピック比率」でAさん（年代n）とBさん（年代m）との間に大差がないといった場合がありうる。ここで、単語総数を乗ずることで、Bさんの方が重症であることが適切に表現された特徴量ベクトルを生成することができる。 Note that by multiplying the total number of words, it is possible to generate a feature vector appropriately distinguishing the following cases. For example, Mr. A's receipt at a certain age n has two occurrences of "cold" and "diabetes", and Mr. B's receipt at a certain age m has 50 occurrences of "cold" and "diabetes". While Mr. B is considered to be more severe, the topic ratio (≠ word) is “topic ratio related to cold” and “topic ratio related to diabetes” as Mr. A (age n). There may be cases where there is no big difference with Mr. B (age m). Here, by multiplying the total number of words, it is possible to generate a feature vector that appropriately expresses that Mr. B is more severe.

閾値決定部4は、繰り返し処理の各I回目においてそれぞれ、特徴量ベクトル生成部3の出力した各データD(X,n)に対応する特徴量ベクトルθ(X,n)の各要素に対して、特徴量として「意味をなす」閾値を求め、当該求めた閾値をデータ区別部5へと出力する。 The threshold value determination unit 4 performs each element of the feature vector θ (X, n) corresponding to each data D (X, n) output from the feature vector generation unit 3 at each I-th iteration process. Then, a “meaning” threshold value is obtained as the feature amount, and the obtained threshold value is output to the data distinguishing unit 5.

ここで、特徴量として当該「意味をなす」閾値とは、特徴量ベクトルθ(X,n)の各要素（ベクトル成分としての各要素であり、図３で説明したように各トピックに対応する）について、その値が各要素ごとに求められた閾値以上である場合に、後述するモデル生成部6で予測を行う際の精度達成に寄与すると考えられるような閾値である。逆に、各要素につきその値が当該閾値未満であるような場合には、当該要素の値はモデル生成部6で予測を行う際の精度達成には寄与しないと考えられる。 Here, the “meaning” threshold value as a feature quantity is each element of the feature quantity vector θ (X, n) (each element as a vector component, and corresponds to each topic as described in FIG. ) Is a threshold that is considered to contribute to achieving accuracy when the model generation unit 6 described below performs prediction when the value is equal to or larger than the threshold obtained for each element. On the contrary, when the value of each element is less than the threshold value, the value of the element is considered not to contribute to the achievement of accuracy when the model generation unit 6 performs prediction.

ここで、次のように各種の量の表記を定める。特徴量ベクトルθ(X,n)の各要素に対応するトピックをk₁, k₂, …, k_Mとする。すなわち、Mはトピック数であり、ベクトルθ(X,n)の要素数（すなわち、サイズあるいは次元）であって、繰り返し処理のI回目ごとに後述の特徴量削減部7の処理により減っていく値となるが、表記が煩雑となるためMのI依存性は明示しない。また、特徴量ベクトルθ(X,n)におけるj番目(j=1,2, …, M)のトピックk_jに対応する要素の値（すなわち、特徴量ベクトルθ(X,n)のj番目の要素の値）をθ(X,n)_[j]と表記する。また、トピックk_jに対応する要素に対して閾値決定部4が求める閾値をTH[k_j]と表記する。 Here, the notation of various quantities is defined as follows. Feature vector θ (X, n) the topics that correspond to elements of k _1, k _2, ..., a k _M. That is, M is the number of topics, which is the number of elements (ie, size or dimension) of the vector θ (X, n), and is reduced by the processing of the feature amount reduction unit 7 described later for each I-th iteration. Although it is a value, the I dependency of M is not specified because the notation is complicated. In addition, the value of the element corresponding to the j-th (j = 1, 2, ..., M) topic k _j in the feature vector θ (X, n) (that is, the j-th feature vector θ (X, n) Is expressed as θ (X, n) _[j] . In addition, a threshold value that the threshold value determination unit 4 obtains for an element corresponding to the topic k _j is expressed as TH [k _j ].

閾値決定部4では、具体的には次のようにして、各トピックk_jの要素に対する閾値TH[k_j]を計算することができる。すなわち、各トピックk_jの要素において、実際に閾値THを0から1まで（変数として）少しずつ増やしながら（変数の各値としての）閾値THの「良さ」を計算し、最適と判断された際のTHを求めた結果としての閾値TH[k_j]とする。ここで、THを0から1まで少しずつ増やす（スイープする）際は、所定ステップの値で実施すればよい。例えば、TH=0, 0.2, 0.4, 0.6, 0.8のように、増分0.2で5ステップのスイープを行うようにすればよい。 Specifically, the threshold value determination unit 4 can calculate the threshold value TH [k _j ] for the elements of each topic k _j as follows. That is, for each topic k _j element, the threshold TH was actually increased from 0 to 1 (as a variable) little by little (as each value of the variable), and the “goodness” of the threshold TH was calculated and determined to be optimal. The threshold TH [k _j ] is obtained as a result of obtaining the time TH. Here, when TH is increased little by little from 0 to 1 (sweep), the value of a predetermined step may be used. For example, a 5-step sweep may be performed with an increment of 0.2, such as TH = 0, 0.2, 0.4, 0.6, 0.8.

なお、ここで、上記のようにスイープ変数THを「0から1まで」スイープするとは、トピックk_jにおける一連の要素の値θ(X,n)_[j]がとりうる範囲の長さを「1」として規格化したうえでの「0（最小）から1（最大）まで」を意味しているものとし、以下の説明（図５，６等の説明を含む）においても同様とする。すなわち、実際にスイープする範囲は要素の最小値min{θ(X,n)_[j]}から最大値max{θ(X,n)_[j]}までの間の範囲（あるいはこれを包含する所定範囲）であるが、説明の簡素化のために当該範囲を0から1までの範囲に規格化したものとして、0から1までの間でスイープを行うものとして説明する。従って、求まった最適値としての閾値TH[k_j]は、規格化をいわば「解除」した実際の値としてデータ区別部5に出力されるものとする。例えば最適値としての閾値TH[k_j]がx（0≦x≦1の範囲にある規格化値）として求まった場合であって、当該規格化された区間が上記の最小値から最大値の範囲に対応している場合、実際の値としては以下の値がデータ区別部5に出力される。
(1-x)*min{θ(X,n)_[j]}+x*max{θ(X,n)_[j]} Note that sweeping the sweep variable TH from “0 to 1” as described above means that the length of the range that the value θ (X, n) _[j] of a series of elements in the topic k _j can take is “ It means “from 0 (minimum) to 1 (maximum)” after standardization as “1”, and the same applies to the following description (including descriptions of FIGS. 5, 6, etc.). In other words, the actual sweep range is the range between the minimum value min {θ (X, n) _[j] } and the maximum value max {θ (X, n) _[j] } (or includes this) (Predetermined range), but in order to simplify the description, it is assumed that the range is normalized to a range from 0 to 1, and the sweep is performed from 0 to 1. Therefore, the obtained threshold value TH [k _j ] as the optimum value is output to the data discriminating unit 5 as an actual value “cancelled” so to speak. For example, when the threshold value TH [k _j ] as an optimum value is obtained as x (standardized value in the range of 0 ≦ x ≦ 1), the standardized interval is from the above minimum value to the maximum value. When it corresponds to the range, the following values are output to the data distinction unit 5 as actual values.
(1-x) * min {θ (X, n) _[j] } + x * max {θ (X, n) _[j] }

また、ここで、上記のTHの「良さ」については、後述の予測部8で予測する対象の予測を実際に行う際の「良さ」、すなわち予測精度を推定するものとして、THによって選別されたデータの予測のしやすさを定量化することによって計算すればよい。具体的には、閾値THの「良さ」を、変数としてスイープさせる閾値THの各値につき、以下の[手順1]〜[手順3]のように計算すればよい。ここで、予測部8において特徴量ベクトルθ(X,n)を入力として予測する（すなわち予測結果として出力する）対象の値をP(X,n)と表記する。 Here, the above “goodness” of TH was selected by TH as an estimation of “goodness” when actually predicting a target to be predicted by the prediction unit 8 described later, that is, to predict prediction accuracy. What is necessary is just to calculate by quantifying the ease of prediction of data. Specifically, the “goodness” of the threshold TH may be calculated as in the following [Procedure 1] to [Procedure 3] for each value of the threshold TH to be swept as a variable. Here, a value to be predicted by the prediction unit 8 using the feature vector θ (X, n) as an input (that is, output as a prediction result) is denoted as P (X, n).

なお、以下の[手順1]〜[手順3]によれば、閾値THの「良さ」は、THで選別されたデータは予測しやすいが、当該データ以外の同THで選別されなかったデータは予測しにくい、という傾向の強さという形で計算されることとなる。 In addition, according to the following [Procedure 1] to [Procedure 3], the “goodness” of the threshold TH is easy to predict the data sorted by TH, but the data not sorted by the same TH other than the data is It is calculated in the form of the strength of the tendency that it is difficult to predict.

[手順1] 当該トピックの要素k_jに対する閾値THにより、実際に全データθ(X,n)（繰り返し処理の当該I回目の時点において特徴量ベクトル生成部3から出力されてきた全データθ(X,n)）を分類し、要素k_jについての値θ(X,n)_[j]が当該閾値TH以上となっているような第一データ群と、当該閾値TH未満となっている第二データ群とに分ける。 [Procedure 1] According to the threshold value TH for the element k _j of the topic, all the data θ (X, n) (all data θ ( X, n)), and the first data group in which the value θ (X, n) _[j] for the element k _j is greater than or equal to the threshold TH and the first data group that is less than the threshold TH Divide into two data groups.

[手順2] 上記の[手順1]にて閾値TH以上／未満として分類された第一データθ(X,n)群及び第二データθ(X,n)群のそれぞれにおいて、予測部8で予測する対象の値P(X,n)を目的変数として、単回帰を行う。 [Procedure 2] In each of the first data θ (X, n) group and the second data θ (X, n) group classified as not less than / less than the threshold TH in [Procedure 1], the prediction unit 8 A single regression is performed with the target value P (X, n) to be predicted as an objective variable.

ここで、当該単回帰は対象としているトピックの要素k_jについての値θ(X,n)_[j]を説明変数とする。すなわち、以下の式(1)の１次式で与えられる、単回帰における傾きをaと切片をbとを、分類された第一データ群及び第二データ群のそれぞれにおいて決定する。当該決定は単回帰において周知のように最小二乗法などで決定することができる。
P(X,n)=aθ(X,n)_[j]+b …(1) Here, in the simple regression, the value θ (X, n) _[j] for the element k _j of the target topic is used as an explanatory variable. That is, the slope a and the intercept b in simple regression given by the following linear expression (1) are determined for each of the classified first data group and second data group. This determination can be made by a least square method or the like as is well known in simple regression.
P (X, n) = aθ (X, n) _[j] + b (1)

なお、予測する対象の値P(X,n)については、各データθ(X,n)に対して事前に既知のものとして（いわゆる教師データとして）、文書化部1が出力するデータD(X,n)に紐付ける形で与えておくものとする。 Note that the value P (X, n) to be predicted is known in advance (as so-called teacher data) for each data θ (X, n), and the data D ( It shall be given in the form linked to X, n).

[手順3]、上記[手順2]にて求めた単回帰モデル「Y=aX+b」（Xは説明変数、Yは目的変数）による予測が的中しているデータと的中していないデータとを第一データ群及び第二データ群のそれぞれにおいてカウントすることにより、図４に示すような2×2のクロス集計表を作成し、当該クロス集計表におけるAIC値（赤池情報量基準の値）を、トピックの要素k_jに対する閾値THの「良さ」を定量化したものAIC(k_j,TH)として算出する。 [Procedure 3], single regression model “Y = aX + b” obtained in [Procedure 2] (X is an explanatory variable, Y is an objective variable). By counting the data in each of the first data group and the second data group, a 2 × 2 cross tabulation table as shown in FIG. 4 is created, and the AIC value (of Akaike information criterion) Value) is calculated as AIC (k _j , TH) obtained by quantifying the “goodness” of the threshold TH for the topic element k _j .

ここで、求めた単回帰モデル「Y=aX+b」が的中しているか否かは、各データθ(X,n)_[j]に式(1)を適用して求めた予測値「aθ(X,n)_[j]+b」と実際の値「P(X,n)」とを比較し、その差の絶対値「| P(X,n)-( aθ(X,n)_[j]+b)|」が判定用の所定の閾値以内であれば予測が的中していると判断し、当該差の絶対値が当該閾値を超えていれば予測が的中していないと判断することができる。 Here, whether or not the obtained single regression model “Y = aX + b” is correct depends on the predicted value “Equation (1) applied to each data θ (X, n) _[j] ”. aθ (X, n) _[j] + b ”is compared with the actual value“ P (X, n) ”and the absolute value of the difference is“ | P (X, n)-(aθ (X, n) _{If [j]} + b) | "is within the predetermined threshold for judgment, it is judged that the prediction is correct. If the absolute value of the difference exceeds the threshold, the prediction is not correct. It can be judged.

なお、AIC値として算出されるTHの「良さ」AIC(k_j,TH)は、その値が小さいほど閾値THが「良い」という意味合いを有する。従って、以下の式(2)のように、スイープ変数THにつきそれぞれ算出される値AIC(k_j,TH)を最小とするようなTHを、最適な閾値TH[k_j]として決定することができる。 Note that the “goodness” AIC (k _j , TH) of TH calculated as an AIC value has a meaning that the threshold TH is “good” as the value is smaller. Therefore, as shown in the following equation (2), TH that minimizes the value AIC (k _j , TH) calculated for each sweep variable TH can be determined as the optimum threshold TH [k _j ]. it can.

なお、図４のクロス集計表からAIC値としてのAIC(k_j,TH)を算出する具体的な手法については、後述する。当該後述する手法により、AIC(k_j,TH)を「従属モデルのAIC値」から「独立モデルのAIC値」を引いたものとして求め、当該求めた値が最小となるようなもの（あるいは、当該求めた値が「-2」以下となるうちの任意のもの）におけるTHを最適値として決定することができる。 A specific method for calculating AIC (k _j , TH) as an AIC value from the cross tabulation table of FIG. 4 will be described later. Using the method described later, AIC (k _j , TH) is obtained by subtracting the “AIC value of the independent model” from the “AIC value of the dependent model” and the obtained value is minimized (or (TH in any of the obtained values being “−2” or less) can be determined as the optimum value.

図５は、以上の[手順1]〜[手順3]によりスイープ変数としての閾値THのそれぞれの値につき、単回帰を行う対象のデータを閾値THによって選別し、説明変数θ(X,n)_[j]を横軸に、目的変数P(X,n)を縦軸にプロットした例を、第一データ群に関して示す図である。ここで、目的変数P(X,n)は翌年の医療費である場合を例とする。すなわち、データθ(X,n)は対象者Xのn歳時点のデータである際に、同対象者Xのn+1歳時点で要する医療費をP(X,n)として設定した場合を例とする。 FIG. 5 shows that the target data to be subjected to simple regression is selected by the threshold TH for each value of the threshold TH as the sweep variable by the above [Procedure 1] to [Procedure 3], and the explanatory variable θ (X, n) It is a figure which shows the example which plotted the objective variable P (X, n) on the horizontal axis _| shaft with _[j] on the vertical axis _| shaft regarding the 1st data group. Here, the case where the objective variable P (X, n) is the medical expenses for the next year is taken as an example. That is, when the data θ (X, n) is the data at the time of n years of the subject X, the medical cost required at the time of n + 1 years of the subject X is set as P (X, n) Take an example.

図５では、TH=0, 0.2, 0.4, 0.6, 0.8のように、増分0.2で5ステップのスイープを行った場合に選別されたデータのプロットが[1]〜[5]としてそれぞれ示されており、閾値を上げるにつれて単回帰のあてはまりが良くなってくる様子を見て取ることができる。 In FIG. 5, plots of data selected when a 5-step sweep is performed at an increment of 0.2, such as TH = 0, 0.2, 0.4, 0.6, and 0.8, are shown as [1] to [5], respectively. It can be seen how the single regression fits better as the threshold is raised.

ただし、算出されるAIC値に関しては、THの増分に伴って単調減少から単調増加に転じるという振る舞いを見せ、当該単回帰のあてはまりが良くなってくる途中において最小値を取ることとなり、例えば図５では[3]に示す閾値TH=0.4で最小値を取ることとなる。（なお、閾値THのステップ幅や具体的なデータ分布によっては、閾値THを上げるとAIC値が単調増加する場合や、単調減少する場合もありうる。また、当該単調増加・減少といった振る舞いはノイズ的な変動を除外した際の振る舞いである。） However, with respect to the calculated AIC value, it shows the behavior that the monotonic decrease is changed to the monotonic increase with the increase of TH, and the minimum value is taken while the fit of the single regression is improved. For example, FIG. Then, the minimum value is taken at the threshold TH = 0.4 shown in [3]. (Note that depending on the step width of the threshold TH and the specific data distribution, increasing the threshold TH may cause the AIC value to monotonously increase or monotonously decrease. In addition, the behavior of the monotonous increase / decrease is noise. This is the behavior when excluding typical fluctuations.)

なお、図５では第一データ群しか示していないが、閾値THにより分けられた残りのデータである第二データ群に関しても、スイープ変数THの各値につき同様にそれぞれ、単回帰の対象となる。例えば、図５の[1]の場合では、第二データ群は空集合となり、この場合、単回帰はできないが、図４のクロス集計表の2つの要素値をそれぞれ0としてAIC値を求めればよい。 Although only the first data group is shown in FIG. 5, the second data group, which is the remaining data divided by the threshold TH, is also subject to single regression for each value of the sweep variable TH. . For example, in the case of [1] in FIG. 5, the second data group is an empty set. In this case, simple regression is not possible, but if the two element values in the cross tabulation table in FIG. Good.

図６に、図５の[1]に示されるデータを全データとした際の、閾値THの設定により第一データ群と第二データ群を切り分ける模式的な例（それぞれについて単回帰を行う旨の説明が付与されている）を示す。 FIG. 6 is a schematic example of separating the first data group and the second data group by setting the threshold value TH when the data shown in [1] of FIG. Is described).

図７に、図５の[2]のデータ（第一データ群）に対して実施した単回帰の模式的な例と、これにより前述の[手順3]で予測の的中／不的中をカウントする対象データの存在領域の模式的な例と、を示す。図７にて直線L0が当該データで求めた回帰直線であり、グラフ上で縦方向に所定値だけ下がった下限の直線L0_[下]と、同所定値だけ上がった上限の直線L0_[上]と、の間の領域にあるデータは、予測が的中したデータと判定され、当該領域の外部にあるデータは、予測が的中しなかったデータと判定される。 FIG. 7 shows a schematic example of simple regression performed on the data [2] in FIG. 5 (first data group), and whether or not the prediction is correct in the above [Procedure 3]. A schematic example of the existence area of target data to be counted is shown. In FIG. 7, the straight line L0 is a regression line obtained from the data, and the lower limit straight line L0 _{[lower] which} is lowered by a predetermined value in the vertical direction on the graph, and the upper limit straight line L0 _{[upper] which is} increased by the same predetermined value on the graph _. The data in the area between and is determined as the data that has been predicted correctly, and the data outside the area is determined as the data that has not been predicted.

以上、閾値決定部4では、特徴量ベクトルの各トピックkjに対応する各要素につき、その値以上であれば予測性能を確保できると判定される閾値TH[k_j]を求め、結果をデータ区別部5に出力する。 As described above, the threshold value determination unit 4 obtains the threshold value TH [k _j ] for which it is determined that the prediction performance can be ensured if the element value corresponding to each topic kj of the feature vector is equal to or higher than the value, and the results are classified into data. Output to part 5.

なお、閾値決定部4では上記のような手順で求めた閾値TH[k_j]に対してさらに、当該閾値TH[k_j]が妥当なものであるかの確認を行ったうで、結果をデータ区別部5へと出力するようにしてもよい。具体的には、上記手順で求まった閾値TH[k_j]において定まる、上記手順の[手順1]における第一データ群（閾値以上であり、予測性能が確保されていると判定されるデータ群）に対して統計分野において周知である無相関検定を実施し、検定をパスすることを確認したうえで、当該パスした結果をデータ区別部5へと出力するようにしてもよい。 In addition, the threshold value determination unit 4 further confirms whether the threshold value TH [k _j ] is appropriate for the threshold value TH [k _j ] obtained by the procedure as described above. The data may be output to the data distinction unit 5. Specifically, the first data group in [Procedure 1] of the above procedure determined by the threshold value TH [k _j ] obtained in the above procedure (the data group that is determined to be more than the threshold value and the predicted performance is ensured. ), A correlation test that is well-known in the statistical field is performed, and it is confirmed that the test is passed, and then the passed result is output to the data distinction unit 5.

なお、上記の無相関検定を実施した結果、検定をパスしなかった場合には、当該パスしなかった閾値TH[k_j]より小さな値の閾値のもとでのおける第一データ群につき順次、無相関検定を実施し、検定をパスした中で値が最大となる閾値THを、データ区別部5へ結果として出力すればよい。小さな値の閾値は、上記手順でスイープする変数として所定ステップ幅で変動させた際のと同様のものから選択するようにすればよい。 As a result of performing the above uncorrelated test, if the test is not passed, the first data group in a threshold value smaller than the threshold TH [k _j ] that did not pass is sequentially determined. Then, an uncorrelated test may be performed, and the threshold value TH having the maximum value after passing the test may be output to the data distinguishing unit 5 as a result. The small threshold value may be selected from the same value as when the variable is swept by a predetermined step width as the variable to be swept in the above procedure.

データ区別部5では、繰り返し処理の各I回においてそれぞれ、閾値決定部4より得られた閾値TH[k_j]により、当該I回目の処理対象となっているデータ（特徴量ベクトル生成部3が出力した特徴量ベクトルθ(X,n)の全部）を、予測可能群と予測不可能群とに分類して区別し、当該区別した結果をモデル生成部6及び特徴量削減部7へと出力する。 In the data distinction unit 5, the data (feature quantity vector generation unit 3 is the target of the I-th processing) according to the threshold value TH [k _j ] obtained from the threshold value determination unit 4 at each I time of the iterative processing. All of the output feature quantity vectors θ (X, n)) are classified into a predictable group and an unpredictable group, and the discrimination results are output to the model generation unit 6 and the feature quantity reduction unit 7 To do.

具体的には、各データθ(X,n)に関して、トピックk_jごとに対応する要素の値θ(X,n)_[j]を閾値TH[k_j]と比較し、要素の値が当該閾値以上となる、すなわち、「θ(X,n)_[j]≧TH[k_j]」となるような要素が少なくとも１個は存在しているようなデータθ(X,n)は、予測可能群に属するものと判断する。一方、当該判断不能な場合、すなわち、全てのトピックk_jに対応する要素の値θ(X,n)_[j]が閾値TH[k_j]未満である、すなわち全ての要素につき「θ(X,n)_[j]<TH[k_j]」であるようなデータθ(X,n)は、予測不可能群に属するものとして判断する。 Specifically, for each data θ (X, n), the value θ (X, n) _[j] of the element corresponding to each topic k _j is compared with the threshold TH [k _j ], and the value of the element Data θ (X, n) that has a threshold value or more, that is, at least one element that satisfies “θ (X, n) _[j] ≧ TH [k _j ]” is predicted. Judge as belonging to the possible group. On the other hand, if the determination is impossible, that is, the values θ (X, n) _{[j] of} the elements corresponding to all the topics k _j are less than the threshold TH [k _j ], that is, “θ (X , n) _[j] <TH [k _j ] ”is determined as belonging to the unpredictable group.

例えば、トピックが3種類(k₁,k₂,k₃)であり（すなわちデータθ(X,n)が３次元であり）、対応する閾値が(10,20,30)である場合に、データ(11,9,5)は３要素のうち少なくとも第一要素における値11が閾値10以上となっているので、予測可能群に属するものと判断され、データ(1, 2, 3)は全要素の値が対応する閾値未満であるので、予測不可能群に属するものと判断される。 For example, if there are three types of topics (k ₁ , k ₂ , k ₃ ) (ie, data θ (X, n) is three-dimensional) and the corresponding threshold is (10, 20, 30), Data (11,9,5) is judged to belong to the predictable group because the value 11 in at least the first element of the three elements is greater than or equal to the threshold 10, and data (1, 2, 3) is all Since the value of the element is less than the corresponding threshold value, it is determined that it belongs to the unpredictable group.

モデル生成部6では、繰り返し処理の各I回においてそれぞれ、データ区別部5が出力した予測可能群のデータを入力として、健康状態その他の予測をするためのモデルを生成して予測部8へと出力する。 In the model generation unit 6, at each I time of the iterative process, the predictable group data output from the data distinction unit 5 is input, and a model for predicting the health condition and the like is generated, and the prediction unit 8 Output.

ここで、最も簡素な実施形態としては、データ区別部5が出力した予測可能群のデータθ(X,n)全体をそのまま、予測部8（あるいはデータ分析を実施するユーザ）へと出力するようことができる。ここで、予測可能群のデータθ(X,n)は、後述する特徴量削減部7によって、繰り返し処理の各I回ごとにその構成要素が削減されたデータとして構成されている。（なお、繰り返し処理の初回においては、特徴量削減部7の処理を経ないため文書化部1が出力した際の全要素で構成されたデータとなる。） Here, as the simplest embodiment, the entire data θ (X, n) of the predictable group output by the data distinguishing unit 5 is output as it is to the predicting unit 8 (or the user who performs data analysis). be able to. Here, the data θ (X, n) of the predictable group is configured as data in which the constituent elements are reduced every I times of the iterative processing by the feature amount reduction unit 7 described later. (Note that, in the first iteration process, since the process of the feature amount reduction unit 7 is not performed, the data is composed of all elements when the documenting unit 1 outputs.)

当該予測可能群のデータは、前述の課題で述べたような欠損やノイズ等が自動的に除外されたものとして構成されることとなるので、種々の予測を実施する際の元データとして利用することができる。 Since the data of the predictable group is configured such that defects, noise, and the like as described in the above problem are automatically excluded, it is used as original data when performing various predictions. be able to.

一方、一実施形態では、モデル生成部6では予測部8で行う具体的な予測方式に整合させた形で、予測可能群から具体的な予測モデルを生成したうえで、予測部8へと出力することもできる。すなわち、データθ(X,n)を入力として、当該n歳の対象者Xの翌年（n+1歳の時点）の医療費といったような所定種類の予測値P(X,n)を出力するためのモデル式を求めて、予測部8へと出力するようにしてもよい。 On the other hand, in one embodiment, the model generation unit 6 generates a specific prediction model from the predictable group in a form consistent with the specific prediction method performed by the prediction unit 8, and then outputs it to the prediction unit 8. You can also That is, with data θ (X, n) as an input, a predetermined type of predicted value P (X, n) such as medical expenses for the next year (at the time of n + 1) of the subject X who is n years old is output. A model formula for this may be obtained and output to the prediction unit 8.

ここで、モデル式の構築については、入力データθ(X,n)（すなわち、文書化部1に入力された各データのうち繰り返し処理の当該I回目において予測可能群と判定されたデータ）に対して予め、出力データP(X,n)を教師データとして与えておき、SVR（サポートベクトル回帰）などの回帰計算の式として構築することができる。 Here, for the construction of the model formula, input data θ (X, n) (that is, data determined as a predictable group in the I-th iteration of the data input to the documenting unit 1) On the other hand, the output data P (X, n) can be given as teacher data in advance, and can be constructed as a regression calculation expression such as SVR (support vector regression).

特徴量削減部7では、繰り返し処理の各I回においてそれぞれ、データ区別部5が出力した予測不可能群に属するデータθ(X,n)（に対応する元の単語頻度ベクトルD(X,n)）を、繰り返し処理の次の回であるI+1回目の入力としてクラスタリング部2へと出力する。この際、特徴量削減部7では単語頻度ベクトルD(X,n)のうち、ノイズとなり、予測性能に寄与しないと考えられるような単語iを特定し、当該特定された単語iは削除することでベクトル次元数を減らした形で、I回目の予測不可能群の各ベクトルD(X,n)をI+1回目の処理対象としてクラスタリング部2へと出力する。 In the feature amount reduction unit 7, the original word frequency vector D (X, n corresponding to the data θ (X, n) (belonging to the unpredictable group output from the data discrimination unit 5 at each I time of the iterative processing, respectively. )) Is output to the clustering unit 2 as the I + 1th input, which is the next iteration. At this time, the feature amount reduction unit 7 identifies a word i that is considered to be noise and does not contribute to the prediction performance from the word frequency vector D (X, n), and deletes the identified word i. Then, each vector D (X, n) of the I-th unpredictable group is output to the clustering unit 2 as an I + 1-th processing target in a form in which the number of vector dimensions is reduced.

ここで、予測性能に寄与しないような単語iについては、図８に示すようなクロス集計表を作成し、（閾値決定部4において図４のクロス集計表を用いて計算したのと同様に、）そのAIC値を計算することにより、後述するような従属モデルが良いと判定されるような単語iを予測性能に寄与するものと判定して残すようにし、当該判定されなかったような単語iを削除対象とする。 Here, for the word i that does not contribute to the prediction performance, a cross tabulation table as shown in FIG. 8 is created (as in the case where the threshold determination unit 4 calculates using the cross tabulation table of FIG. ) By calculating the AIC value, it is determined that a word i for which a dependent model as described later is determined to be good is determined to contribute to the prediction performance, and the word i that has not been determined is left. Is to be deleted.

すなわち、従属モデルが良いと判定されるような単語iについては、予測可能群における当該単語iの分布と予測不可能群における当該単語iの分布とに有意差がある（相違がある）と判定されるような単語であるため、継続するI+1回目以降の繰り返し処理において、さらに解析対象とされる予測不可能群(現I回目で区別された予測不可能群)を予測する能力を持っている可能性があるものとして、残す対象とする。逆に、当該判定されないような単語iについては、予測可能群における分布と予測不可能群における分布とに相違がないと判定される単語であるため、上記のようにさらに解析を行うに際しても予測能力がない（従って、クラスタリング等で解析する際のノイズとなってしまう可能性がある）ものとして、削除対象とする。（なお、単語を残す／削除する例として図９を後述する。）なお、図８のクロス集計表のカウントを行うに際して、各単語iが存在するかしないかは、元の単語頻度ベクトルD(X,n)において当該単語iの頻度が閾値以上であれば存在し、閾値未満であれば存在しないものとして判断すればよい。閾値は単語iごとに事前に定義しておいたものを用いればよい。 That is, for a word i for which it is determined that the dependent model is good, it is determined that there is a significant difference (difference) between the distribution of the word i in the predictable group and the distribution of the word i in the unpredictable group. Therefore, it has the ability to predict the unpredictable group to be analyzed (the unpredictable group distinguished at the current I time) in the subsequent I + 1 and subsequent iterations. It is assumed that there is a possibility of being left. Conversely, the word i that is not determined is a word that is determined to have no difference between the distribution in the predictable group and the distribution in the unpredictable group. It is considered as a deletion target because it has no capability (thus, there is a possibility that it may become noise when analyzing by clustering or the like). (Note that FIG. 9 will be described later as an example of leaving / deleting words.) When counting the cross tabulation table of FIG. 8, whether or not each word i exists is determined based on the original word frequency vector D ( In X, n), if the frequency of the word i is greater than or equal to the threshold, it exists, and if it is less than the threshold, it may be determined that it does not exist. The threshold value defined in advance for each word i may be used.

予測部8では、モデル生成部6が繰り返し処理の各I回目において出力した予測モデル式をそれぞれ保持しておき、ユーザ指示を受け取って当該予測モデルを適用した予測結果をユーザへと出力する。ユーザ指示としては、各I回の予測モデルのうちどれを利用するかの指示と、予測対象データと、を与えることで、予測部8は当該予測対象データに対する予測結果を出力することができる。 The prediction unit 8 holds the prediction model formula output by the model generation unit 6 at each I-th iteration process, receives a user instruction, and outputs a prediction result to which the prediction model is applied to the user. By giving an instruction as to which of the I prediction models is to be used and prediction target data as the user instruction, the prediction unit 8 can output a prediction result for the prediction target data.

なお、予測対象データは当該I回目の予測モデルをモデル生成部6で構築した際の特徴量ベクトルθ(X,n)と同様の要素で構成されるものとして、ユーザ側で用意しておいたうえで与える。これにより、予測部8ではモデル生成部6が出力した予測式（関数fとする）を適用して、f(θ(X,n))として予測値P(X,n)を出力することができる。 Note that the prediction target data is prepared on the user side as being composed of the same elements as the feature quantity vector θ (X, n) when the I-th prediction model is constructed by the model generation unit 6. Give above. As a result, the prediction unit 8 can apply the prediction expression (referred to as function f) output from the model generation unit 6 and output the predicted value P (X, n) as f (θ (X, n)). it can.

以上、図１の各部1〜8をそれぞれ説明した。ここで、図９を参照して、図１の繰り返し処理部20による繰り返し処理で得られるデータについて説明する。 In the above, each part 1-8 of FIG. 1 was demonstrated, respectively. Here, with reference to FIG. 9, data obtained by the iterative processing by the iterative processing unit 20 in FIG. 1 will be described.

図９では、[0]に、文書化部1の出力する全データDが示されると共に、ベン図の形式により、当該全データDのうち繰り返し処理の1回目で予測可能群と判定されるデータD1（白色領域として表記）と、繰り返し処理の2回目で予測可能群と判定されるデータD2（縦縞領域として表記）と、繰り返し処理の3回目で予測可能群と判定されるデータD3（横縞領域として表記）と、が示されている。図示するように、当該例においては繰り返し処理を3回行うことにより、全データDが各回に分けて予想可能群として判定され、D=D1∪D2∪D3の関係がある。 In FIG. 9, [0] shows all data D output from the documenting unit 1, and data D1 determined to be a predictable group in the first iteration of all the data D according to the Venn diagram format. (Shown as a white area), data D2 determined as a predictable group in the second iteration (represented as a vertical stripe area), and data D3 determined as a predictable group in the third iteration (as a horizontal stripe area) Notation). As shown in the figure, in this example, by repeating the process three times, all data D is determined as a predictable group at each time, and there is a relationship of D = D1∪D2∪D3.

そして、繰り返し処理の1〜3回目の各回で得られる結果が、欄[1]〜[3]にそれぞれ示されている。[1]に示すように、当初の全データDは6単語要素の単語頻度ベクトルとして構成されているものとし、繰り返し処理の1回目では、当該全6単語(i₁, i₂, i₃, i₄, i₅, i₆)で単語頻度ベクトルを構成し、全データDを入力としてクラスタリング部2以降の処理が行われ、結果としてデータD1が予測可能群と判定され、D1に基づいた予測モデルが生成される。また、特徴量削減部7にて2単語(i₁, i₂)を要素から削除する。 The results obtained in each of the first to third iterations are shown in columns [1] to [3], respectively. As shown in [1], it is assumed that the initial all data D is configured as a word frequency vector of 6 word elements, and in the first iteration, all the 6 words (i ₁ , i ₂ , i ₃ , i ₄ , i ₅ , i ₆ ) constitute a word frequency vector, and all the data D is input to perform processing from the clustering unit 2 onward. As a result, the data D1 is determined to be a predictable group, and prediction based on D1 A model is generated. Further, the feature amount reduction unit 7 deletes two words (i ₁ , i ₂ ) from the element.

[2]に示すように、繰り返し処理の2回目では、削除されずに残った4単語(i₃, i₄, i₅, i₆)で単語頻度ベクトルを構成し、1回目の処理で予測不可能群と判定されたデータD＼D1を入力としてクラスタリング部2以降の処理が行われ、結果としてデータD2が予測可能群と判定され、D2に基づいた予測モデルが生成される。また、特徴量削減部7にて2単語(i₃, i₄)を要素から削除する。 As shown in [2], in the second iteration, the word frequency vector is composed of four words (i ₃ , i ₄ , i ₅ , i ₆ ) that remain without being deleted, and predicted in the first process Data D \ D1 determined as an impossible group is input to perform processing after the clustering unit 2, and as a result, the data D2 is determined as a predictable group, and a prediction model based on D2 is generated. Further, the feature amount reduction unit 7 deletes two words (i ₃ , i ₄ ) from the element.

なお、数学記号として周知のように「＼」は集合演算において「除く」を意味し、上記ではデータD＼D1とは、データDからデータD1を除いた残りのデータを意味する。 As is well known as a mathematical symbol, “\” means “exclude” in the set operation, and in the above, data D \ D1 means the remaining data obtained by removing data D1 from data D.

最後に、[3]に示すように3回目の処理では、削除されずに残った2単語(i₅, i₆)で単語頻度ベクトルを構成し、2回目の処理で予測不可能群と判定されたデータD＼(D1∪D2)＝D3を入力としてクラスタリング部2以降の処理が行われ、結果としてデータD3の全体が予測可能群と判定され、D3に基づいた予測モデルが生成される。 Finally, as shown in [3], in the third process, the word frequency vector is composed of two words (i ₅ , i ₆ ) that remain without being deleted, and the second process determines that the group is unpredictable The data D \ (D1∪D2) = D3 is input, and the processing after the clustering unit 2 is performed. As a result, the entire data D3 is determined as a predictable group, and a prediction model based on D3 is generated.

なお、図３では繰り返し処理を３回行うことによって全データがそれぞれ予測可能群と判定されそれぞれの予測モデルが生成された例が示されているが、繰り返し処理の回数はユーザ指定により与えておいて、当該回数に到達した時点で繰り返し処理を打ち切るようにしてもよい。また、打ち切るための回数を事前に与えず、所定の打ち切り条件が満たされた時点で繰り返し処理を終了するようにしてもよい。例えば、データ区別部5において当該I回目の時点で区別したデータの全てが予測可能群となった、あるいは予測不可能群となった際に打ち切るようにしてもよいし、I回目の時点で予測不可能群と判定されたデータの個数あるいは次元数が閾値以下となった際に打ち切るようにしてもよい。 FIG. 3 shows an example in which all data is determined as a predictable group and each prediction model is generated by performing the iteration process three times. However, the number of iteration processes is given by user designation. In addition, the repetition process may be aborted when the number of times is reached. In addition, the repetition process may be terminated when a predetermined termination condition is satisfied without giving the number of times for termination in advance. For example, all of the data distinguished at the time of the I-th time in the data distinction unit 5 may be terminated when it becomes a predictable group or an unpredictable group, or the prediction at the time of the I-th time You may make it censor when the number of data determined as an impossible group or the number of dimensions becomes below a threshold value.

以上、本発明によれば、図９に例示したように、最終的に特徴量とその特徴量から予測可能な集団とが複数特定されることになるため、当該特徴量を使用してSVR等を利用すれば予測精度が向上する。この際、これらの特徴量以外の特徴量を補助情報として用いても良い。 As described above, according to the present invention, as illustrated in FIG. 9, a plurality of feature amounts and a group that can be predicted from the feature amounts are finally specified. Use of to improve prediction accuracy. At this time, feature quantities other than these feature quantities may be used as auxiliary information.

すなわち、本発明では、トピックモデル特に、潜在トピック分析（Latent dirichlet allocation:LDA）を用いてデータの潜在トピックを明らかにし、当該トピックを特徴量として医療費を予測する。その際に、一般的な診療明細等の医療データから生成したトピックは、必ずノイズを含む。例えば特定の薬剤は複数の疾病に用いられることもあるし、糖尿病患者でも健康な人でも風邪にかかることがありうる。そこで、得られたトピック値に特定のカットオフ基準を設け、スムージングをすることで一定以上の予測力を持つ特徴量のみを持つ対象者を特定する。次に予測に使えない特徴量しか持ち得ない対象者を分離し、特徴量を選択して再度トピック分類を行う。この手順を繰り返すことで、予測が難しい群の医療費予測を、重要な特徴量のみで予測することが可能になり、予測精度の向上が可能である。 That is, in the present invention, a topic topic, in particular, latent topic analysis (Latent dirichlet allocation: LDA) is used to clarify potential topics of data, and medical costs are predicted using the topics as feature quantities. At that time, topics generated from medical data such as general medical details always include noise. For example, certain drugs may be used for multiple illnesses, and both diabetics and healthy people can get a cold. Therefore, a specific cut-off criterion is set for the obtained topic value, and smoothing is performed to identify a target person having only a feature amount having a certain level of predictive power. Next, a target person who can only have a feature quantity that cannot be used for prediction is separated, a feature quantity is selected, and topic classification is performed again. By repeating this procedure, it is possible to predict the medical cost prediction of a group that is difficult to predict using only important feature amounts, and the prediction accuracy can be improved.

以下、本発明における補足的事項を説明する。 Hereinafter, supplementary matters in the present invention will be described.

（１）図４や図８のクロス集計表からAIC値を算出することについて (1) About calculating the AIC value from the cross tabulation table of FIG. 4 or FIG.

図１０は、図４や図８のクロス集計表を一般化した表である。すなわち、図１０の集計数n_ijは図８、図９等と共通のものを一般の場合として示しており、何らかの基準に該当するか否かを縦軸（行要素）として、クラスタ等の分類結果を横軸（列要素）として、構成されている。なお、図１０ではサイズが2×mと一般化されているが、m=2とすることにより、図４や図８の場合に該当させることができる。 FIG. 10 is a generalized table of the cross tabulation tables of FIG. 4 and FIG. That is, the total number n _ij in FIG. 10 shows the common case as in FIG. 8, FIG. 9, etc. as a general case. The result is configured with the horizontal axis (column element). In FIG. 10, the size is generalized to 2 × m, but by setting m = 2, the case of FIGS. 4 and 8 can be applied.

すなわち、図４の「予測が的中した人数」及び「予測が的中しなかった人数」がそれぞれ、図１０の「該当」及び「未該当」に対応し、図４の「第一データ群」及び「第二データ群」がそれぞれ、図１０の「クラスタ１」及び「クラスタ２」に対応する。また、図８の「当該単語が存在する人数」及び「当該単語が存在しない人数」がそれぞれ、図１０の「該当」及び「未該当」に対応し、図８の「予測可能群」及び「予測不可能群」がそれぞれ、図１０の「クラスタ１」及び「クラスタ２」に対応する。 That is, “number of people who predicted” and “number of people who did not hit” in FIG. 4 correspond to “applicable” and “not applicable” in FIG. 10, respectively, and “first data group” in FIG. "And" second data group "correspond to" cluster 1 "and" cluster 2 "in FIG. 10, respectively. Further, “the number of people in which the word exists” and “the number of people in which the word does not exist” in FIG. 8 correspond to “applicable” and “not applicable” in FIG. 10, respectively, and “predictable group” and “ “Unpredictable group” corresponds to “cluster 1” and “cluster 2” in FIG.

図１０に示すように、当該クロス集計表における集計数n_ijにより、ただちに周辺度数ki(i=1,2,…,m)、h,N-h等を計算することができ、これらの値を用いて以下のようにAIC値を計算することがきる。 As shown in FIG. 10, the peripheral frequency ki (i = 1, 2,..., M), h, Nh, etc. can be immediately calculated from the total number n _ij in the cross tabulation table, and these values are used. The AIC value can be calculated as follows.

当該AIC値は、次のいずれかの手法の値として求める。第一手法では、当該クロス集計表に対して従属モデルを適用することにより、以下の[式1]のような従属モデルのAIC値AIC(DM)[ここでDMはDependent Modelの略である]として求める。第二手法では、さらに、当該クロス集計表に対して独立モデルを適用して、以下の[式2]のような独立モデルのAIC値AIC(IM)[ここでIMはIndependent Modelの略である]を求めたうえで、[式3]のように、従属モデルのAIC値から独立モデルのAIC値を引いた差の値として、求める。 The AIC value is obtained as one of the following methods. In the first method, by applying a dependent model to the cross tabulation table, the AIC value AIC (DM) of the dependent model as shown in [Formula 1] below, where DM is an abbreviation of Dependent Model Asking. In the second method, an independent model is applied to the cross tabulation table, and the AIC value AIC (IM) of the independent model as shown in [Equation 2] below, where IM stands for Independent Model. Then, as [Equation 3], the difference is obtained by subtracting the AIC value of the independent model from the AIC value of the dependent model.

なお、[式1]等においてMLL(DM)は、従属モデルにおける最大対数尤度であって、[式1-2]のような値として求めることができる。また、[式2]等において、MLL(IM)は、独立モデルにおける最大対数尤度であって、[式2-2]のような値として求めることができる。なお、上記の各式における文字は、図１２のクロス集計表において説明した通りであり、以降説明する各式においても同様である。 In [Expression 1] and the like, MLL (DM) is the maximum log likelihood in the dependent model, and can be obtained as a value like [Expression 1-2]. In [Expression 2] and the like, MLL (IM) is the maximum log likelihood in the independent model, and can be obtained as a value as in [Expression 2-2]. The characters in the above equations are as described in the cross tabulation table of FIG. 12, and the same applies to the equations described below.

以下、従属モデルにおける最大対数尤度MLL(DM)と、独立モデルにおける最大対数尤度MLL(IM)と、がそれぞれ、上記の[式1-2]及び[式2-2]のように算出されることと、当該算出されたそれぞれの最大対数尤度を用いて、従属モデルにおけるAIC値が[式1]のように算出され、また、独立モデルにおけるAIC値が[式2]のように算出されることを説明する。 Hereinafter, the maximum log likelihood MLL (DM) in the dependent model and the maximum log likelihood MLL (IM) in the independent model are calculated as [Equation 1-2] and [Equation 2-2] above, respectively. And using each calculated maximum log likelihood, the AIC value in the dependent model is calculated as [Equation 1], and the AIC value in the independent model is as [Equation 2]. The calculation will be described.

図１１は、[式1]及び[式1-2]として示した従属モデルにおける算出を説明するための、図１０のクロス集計表に対応する従属モデルにおける確率の表である。当該表に示されている確率により、以下のように算出がなされる。 FIG. 11 is a table of probabilities in the dependent model corresponding to the cross tabulation table of FIG. 10 for explaining the calculation in the dependent model shown as [Expression 1] and [Expression 1-2]. Calculation is performed as follows according to the probability shown in the table.

まず、従属モデルの確率変数は以下の通りである。 First, the random variables of the dependent model are as follows.

一方、図１３に示された2m個の全てが自由に動かせるわけではなく、以下の制約がある。 On the other hand, not all of the 2m pieces shown in FIG. 13 can be moved freely, and there are the following restrictions.

従って、従属モデルの自由度は2m-1であり、AICの定義（AIC＝-2×MLL+2×自由度）より、[式1]の2*(2m-1)の項が得られる。さらに、上記確率変数より対数尤度LLを計算すると、以下のようになる。 Therefore, the degree of freedom of the dependent model is 2m-1, and the 2 * (2m-1) term of [Equation 1] is obtained from the definition of AIC (AIC = -2 × MLL + 2 × degree of freedom). Further, when the log likelihood LL is calculated from the above random variable, it is as follows.

上記対数尤度LLを最大にするときの条件は以下である。 The conditions for maximizing the log likelihood LL are as follows.

上記最大とする条件より、以下が得られる。 The following is obtained from the maximum condition.

上記と同様にして、さらに As above, further

等が得られる。そこで、 Etc. are obtained. there,

とすると、 Then,

等となるので、それぞれを足すと、 And so on,

となるから、以下の場合が最尤推定となる。 Therefore, the following case is the maximum likelihood estimation.

従って、上記の値をLLに代入することで、その最大値として前述の[式1-2]が得られる。 Therefore, by substituting the above value into LL, the above-described [Equation 1-2] is obtained as the maximum value.

図１２は、[式2]及び[式2-2]として示した独立モデルにおける算出を説明するための、図１０のクロス集計表に対応する従属モデルにおける確率の表である。当該表に示されている確率により、以下のように算出がなされる。 FIG. 12 is a table of probabilities in the dependent model corresponding to the cross tabulation table in FIG. 10 for explaining the calculation in the independent model shown as [Expression 2] and [Expression 2-2]. Calculation is performed as follows according to the probability shown in the table.

まず、図１０の周辺度数k_mと、対応する図１２の周辺確率q_mと、において、以下のような制約がある。 First, the peripheral power k _m of FIG. 10, and the marginal probability q _m of the corresponding FIG. 12, in, the following restrictions.

従って、自由に動かせるのはq₁〜q_m-1とpとであるから、パラメータの自由度は(m-1)+1=mであって、AIC算出の定義より、[式2]の2×mの項が得られる。また、独立モデルの確率変数は以下の通りとなる。 Therefore, since q _{1 to} q _m-1 and p can be moved freely, the degree of freedom of the parameter is (m-1) + 1 = m. From the definition of AIC calculation, A 2 × m term is obtained. The random variable of the independent model is as follows.

従って、その対数尤度LLは以下の通りとなる。 Therefore, the log likelihood LL is as follows.

対数尤度の最大値を与える条件を求めるべく、これをp、q₁・・・で偏微分してゼロに等しいとすることにより、以下等の一連の計算ができる。 In order to obtain the condition that gives the maximum value of the logarithmic likelihood, this is partially differentiated by p, q ₁ ...

従って、 Therefore,

となり、また、 And again

とすると、 Then,

等となるので、それぞれ足して、 And so on,

となり、 And

となるから、最大尤度は Therefore, the maximum likelihood is

等において得られることとなる。従って、上記の値をLLに代入することで、最大値としての[式2-2]が得られる。 And so on. Therefore, by substituting the above value into LL, [Equation 2-2] as the maximum value is obtained.

（２）文書化部1では、入力される健診データその他の医療データを、各対象者Xの各年代nにおける健康状態に対応するバグオブワードとしてのデータD(X,n)に変換するものとして説明したが、入力されるデータが予め当該バグオブワードの形式に変換されている場合、文書化部1は省略されてもよい。 (2) The documentation unit 1 converts the input medical examination data and other medical data into data D (X, n) as a bug of word corresponding to the health state of each subject X in each age n As described above, the documenting unit 1 may be omitted when the input data is converted into the bug of word format in advance.

（３）文書化部1では、各対象者Xの各年代n（年齢n）における文書化された医療データD(X, n)を生成するものとし、当該年代nは1年毎に与えられているものとして以降の説明を行ったが、1年に限らず、任意の長さの所定期間（２年あるいは半年など）ごとの年代nで区切ってデータD(X,n)を生成してもよい。 (3) The documenting unit 1 generates documented medical data D (X, n) for each subject X at each age n (age n), and the age n is given every year. However, the data D (X, n) is generated not only for one year but by dividing it by the age n for a given period of arbitrary length (two years or six months). Also good.

（４）本発明は、コンピュータを選別装置10の各部1〜8の全て又はその任意の一部分として機能させるプログラムとしても提供可能である。当該コンピュータには、CPU(中央演算装置)、メモリ及び各種I/Fといった周知のハードウェア構成のものを採用することができ、CPUが選別装置10の各部の機能に対応する命令を実行することとなる。 (4) The present invention can also be provided as a program that causes a computer to function as all of the units 1 to 8 of the sorting device 10 or any part thereof. The computer can adopt a known hardware configuration such as a CPU (Central Processing Unit), a memory, and various I / Fs, and the CPU executes instructions corresponding to the functions of each part of the sorting device 10. It becomes.

10…選別装置、1…文書化部、2…クラスタリング部、3…特徴量ベクトル生成部、4…閾値決定部、5…データ区別部、6…モデル生成部、7…特徴量削減部、8…予測部 DESCRIPTION OF SYMBOLS 10 ... Sorting device, 1 ... Documenting part, 2 ... Clustering part, 3 ... Feature quantity vector generation part, 4 ... Threshold determination part, 5 ... Data distinction part, 6 ... Model generation part, 7 ... Feature quantity reduction part, 8 ... Prediction unit

Claims

A clustering unit that obtains a clustering result by performing clustering by latent topic analysis on the health status data of a series of subjects given in the form of a bug of word as a collection of individual data for each subject and age;
From the topic ratio of each individual data in the clustering result, a feature vector generation unit that generates a feature vector of each individual data;
For each element of the feature vector, a threshold determination unit that determines a threshold as a value that is determined to be correlated with a predetermined prediction target related to a health state;
For each individual data, it is determined that the element value of the feature vector that is equal to or greater than the threshold value belongs to the predictable group, and the non-existing data value is determined to belong to the unpredictable group A data distinction unit that
A sorting apparatus that outputs individual data determined to belong to a predictable group by the data distinction unit.

Furthermore, only words that are determined to have a significant difference in the existence distribution of the words of the bug of words in the individual data between the predictable group and the unpredictable group determined by the data distinguishing unit are distinguished. A feature amount reduction unit that outputs individual data of the unpredictable group to the clustering unit after leaving as a bug of word element,
The clustering unit, the feature quantity vector generation unit, the threshold value determination unit, the data distinction unit, and the feature quantity reduction unit perform I iteration processing each time, and the feature quantity reduction unit outputs the I-th unpredictable result. 2. The sorting apparatus according to claim 1, wherein the individual data of the group is subjected to clustering in the clustering unit at the next I + 1 time.

Further, a model generation for receiving a feature vector of individual data distinguished as belonging to the predictable group at each I time of the iterative process and constructing a prediction model for a predetermined prediction target related to the health state The sorting apparatus according to claim 2, further comprising a section.

Further, the prediction model constructed by the model generation unit at each I time of the iterative process is received from the user an instruction as to which one to use for prediction, and an instruction of the prediction target data, 4. The apparatus according to claim 3, further comprising: a prediction unit that outputs a prediction result of a predetermined prediction object related to the health state by applying the instructed prediction object data to the instructed prediction model. The sorting device described.

The feature amount reduction unit includes, for each word of the bug of word in individual data, an element of a bug of word corresponding to the word in each of the predictable group and the unpredictable group determined by the data distinguishing unit. Whether there is a significant difference based on the cross tabulation table created by counting the number of individual data determined to be determined by the threshold determination and the number of individual data determined not to be determined by the threshold determination The screening apparatus according to claim 2, wherein the determination is performed.

The threshold determination unit sweeps a temporary threshold as a variable for each element of the feature vector,
A value of a predetermined prediction target related to the health state in the individual data corresponding to the feature quantity vector is used as an explanatory variable, with the element value in a series of feature quantity vectors such that the element value is equal to or greater than the temporary threshold value. Perform regression prediction with objective variables, count the number of hits and hits as the first count,
A value of a predetermined prediction target related to the health state in the individual data corresponding to the feature quantity vector is used as an explanatory variable, and the element value in a series of feature quantity vectors such that the element value is less than the temporary threshold value. Regression prediction is performed with the objective variable, and the number of hits and hits in the forecast is counted as the second count.
Based on the value of the Akaike information amount standard calculated by the cross tabulation table created by the first count and the second count, a predetermined prediction target related to the health condition from the temporary threshold values to be swept 6. The sorting apparatus according to claim 1, wherein a threshold value is determined as a value that is determined to be capable of obtaining a correlation with.

7. The sorting apparatus according to claim 1, wherein the predetermined prediction target related to the health condition is a medical cost in the future ahead of a predetermined age for each individual data.

The feature vector generation unit calculates the total number of words when the individual data is given in the form of the bug of words with respect to a vector configured with the topic ratio of each individual data in the clustering result as an element. 8. The sorting apparatus according to claim 1, wherein a feature quantity vector of each individual data is generated by multiplication.