JP2007052507A

JP2007052507A - Biological information processor, biological information processing method, and biological information processing program

Info

Publication number: JP2007052507A
Application number: JP2005235562A
Authority: JP
Inventors: Takeshi Kato; 毅加藤; Ko Fujifuchi; 航藤渕
Original assignee: National Institute of Advanced Industrial Science and Technology AIST
Current assignee: National Institute of Advanced Industrial Science and Technology AIST
Priority date: 2005-08-15
Filing date: 2005-08-15
Publication date: 2007-03-01

Abstract

<P>PROBLEM TO BE SOLVED: To provide a technology excellent in validity of prediction and classification reference in classifying biological information into three or more classes. <P>SOLUTION: A multi-margin support vector machine (MM-SVM) partitions input space into three or more regions by producing two or more mutually parallel separation faces in the input space. Therefore, risk of "over learning" or "excess adaptation" having been apt to occur in an existing support vector machine can be reduced. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、生物学的情報処理装置、生物学的情報処理方法および生物学的情報処理プログラムに関する。 The present invention relates to a biological information processing apparatus, a biological information processing method, and a biological information processing program.

パターン認識における最も基本的な問題は、未知の認識対象を計測して得られた入力ベクトルからその対象がどのクラスに属するかを判定する識別器を開発することである。そのためには、クラスの帰属が既知の訓練用のサンプルから入力ベクトルとクラスとの確率的な対応関係を知識として学習することが必要である。未知の認識対象の識別には、学習された確率的知識を利用してそれがどのクラスに属していたかを推定（決定）する方式を指定しなければならない。その際、間違って識別する確率をできるだけ小さくすることが望ましい。入力ベクトルとクラスとの確率的な対応関係が完全にわかっている理想的な場合には、理論的に最適な識別方式（ベイズ識別方式）が存在する。しかし、実際のパターン認識問題では、特徴ベクトルとクラスとの確率的な対応関係が完全にわかっていることは稀で、そのような確率的な関係を訓練データから学習する必要がある。 The most basic problem in pattern recognition is to develop a classifier that determines which class an object belongs to from an input vector obtained by measuring an unknown recognition object. For this purpose, it is necessary to learn as a knowledge a stochastic correspondence between an input vector and a class from a training sample whose class membership is known. In order to identify an unknown recognition target, a method for estimating (determining) which class it belongs to using the learned probabilistic knowledge must be specified. At that time, it is desirable to minimize the probability of erroneous identification. In the ideal case where the stochastic correspondence between the input vector and the class is completely known, there is a theoretically optimal identification method (Bayes identification method). However, in an actual pattern recognition problem, it is rare that a stochastic correspondence between a feature vector and a class is completely known, and it is necessary to learn such a stochastic relation from training data.

このようなパターン認識の手法として、最近、サポートベクトルマシン（ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅ、ＳＶＭ）と呼ばれるパターン認識手法が注目されている。サポートベクトルマシンは、線形しきい素子を用いて、２クラスのパターン識別器を構成する手法である。 As such a pattern recognition technique, a pattern recognition technique called a support vector machine (Support Vector Machine, SVM) has recently attracted attention. The support vector machine is a technique for constructing two classes of pattern classifiers using linear threshold elements.

カーネルトリックと呼ばれる方法を用いて、非線形の識別関数を構成できるように拡張したサポートベクトルマシンは、現在知られている手法の中でも最も認識性能の優れた学習モデルの一つであると考えられている。一般に、カーネル学習法を用いて学習された識別器が、訓練サンプルに含まれていない未学習データに対しても高い識別性能を発揮できるためには、汎化能力を向上させるための工夫が必要である。 The support vector machine, which has been extended so that a nonlinear discriminant function can be constructed using a method called kernel trick, is considered to be one of the learning models with the best recognition performance among the currently known methods. Yes. In general, in order for classifiers trained using the kernel learning method to exhibit high discrimination performance even for unlearned data that is not included in training samples, it is necessary to devise measures to improve generalization ability It is.

サポートベクトルマシンが優れた認識性能を発揮できるのは、未学習データに対して高い識別性能（汎化性能）を得るための工夫があるためである。すなわち、サポートベクトルマシンは、訓練サンプルから「マージン最大化」という基準で線形しきい素子のパラメータを学習する。ただし、サポートベクトルマシンは、基本的には２つのクラスを識別する識別器を構成するための学習法であり、生物学的情報などの多クラスの識別器を構成するためには、複数のサポートベクトルマシンを組み合わせるなどの工夫が必要となる。 The reason why the support vector machine can exhibit excellent recognition performance is because there is a device for obtaining high identification performance (generalization performance) for unlearned data. That is, the support vector machine learns the parameters of the linear threshold element from the training sample on the basis of “margin maximization”. However, the support vector machine is basically a learning method for constructing a discriminator for discriminating two classes. In order to construct a multi-class discriminator such as biological information, a plurality of support vectors are used. It is necessary to devise such as combining vector machines.

一方、近年、ヒトゲノムの解読完了およびＤＮＡマイクロアレイの開発の成果を受けて、バイオインフォマティクスと呼ばれる技術分野において、サポートベクトルマシンを生物学的情報処理に用いるための研究開発が活発に行われている。 On the other hand, in recent years, research and development for using a support vector machine for biological information processing has been actively conducted in a technical field called bioinformatics in response to completion of decoding of the human genome and development of a DNA microarray.

従来の遺伝子発現プロファイリングに基づく多発性骨髄腫の診断、予後、および治療標的候補の同定に関する技術としては、特許文献１に記載されたものがある。同文献に記載された技術は、ＤＮＡマイクロアレイによって遺伝子発現データを得て；さらにロジスティック回路、決定樹、投票集団、ナイーブベイズ、ベイズネットワークおよびサポートベクトルマシンより成る群から選択される方法によって前記データについて統計分析を実施する；各工程を含む。 A technique related to diagnosis, prognosis, and identification of treatment target candidates of multiple myeloma based on conventional gene expression profiling is described in Patent Document 1. The technique described in that document obtains gene expression data by means of a DNA microarray; and further on said data by a method selected from the group consisting of logistic circuits, decision trees, voting groups, naive Bayes, Bayesian networks and support vector machines. Perform statistical analysis; including each step.

また、従来の乗員状態を検知する技術としては、例えば特許文献２に記載されたものがある。同文献に記載された技術は、乗員の着座により圧力が生ずるシート部分に複数のセンサ部を配置して圧力パターンを計測し、その計測された圧力パターンデータをサポートベクトルマシン（ＳＶＭ）に入力し、ＳＶＭにより特徴抽出及びＳＶＭ分離曲面による識別を行い、着座乗員について大人／子供のクラス判別をする。 Further, as a conventional technique for detecting the occupant state, for example, there is one described in Patent Document 2. The technique described in this document measures a pressure pattern by arranging a plurality of sensor portions on a seat portion where pressure is generated by the seating of an occupant, and inputs the measured pressure pattern data to a support vector machine (SVM). The SVM performs feature extraction and discrimination by the SVM separation curved surface, and classifies adults / children for the seated occupant.

また、従来の生体信号の獲得及び解析を利用した動物の状態把握に関する技術としては、例えば特許文献３に記載されたものがある。同文献に記載された技術は、所定の方法で取得した特徴ベクトルに対して、各動物種類別に特定の状態での行動、意思、情緒を反映する所定のデータベースの基準ベクトルから学習されたサポートベクトルマシン分流器を適用して、前記動物の空腹度、緊張度と恐れの有無及び排便欲求を含む前記動物の情緒及び意思を把握することを含む。 Moreover, as a technique regarding the grasping of the state of an animal using acquisition and analysis of a conventional biological signal, for example, there is one described in Patent Document 3. The technique described in this document is based on a support vector learned from a reference vector of a predetermined database that reflects the behavior, intention, and emotion in a specific state for each animal type with respect to a feature vector acquired by a predetermined method. Applying a machine shunt to grasp the animal's feelings and intentions, including the animal's hunger, tension and fear, and stool desire.

また、従来のコンピュータの３次元病変検出に関する技術としては、例えば特許文献４に記載されたものがある。同文献に記載された技術は、線形識別選別器、２次識別選別器、ニューラルネットワーク、およびサポートベクトルマシンの内の少なくとも１つを使用して、前記病変候補のセットを区別することを含む。 Further, as a technique related to the conventional three-dimensional lesion detection of a computer, there is one described in Patent Document 4, for example. The technique described in that document includes distinguishing the set of lesion candidates using at least one of a linear discriminator, a secondary discriminator, a neural network, and a support vector machine.

また、従来の致死性の高い癌の分子シグネチャーに関する技術としては、例えば特許文献５に記載されたものがある。同文献に記載された技術は、ａ）クラスの各々の複数のメンバーの各々について１種以上の特徴に関する値を取得することと、ｂ）特徴の各々についてウィルコキソンランクスコアを決定し、非予測特徴を除外することと、ｃ）サポートベクトルマシンを使用して残りの特徴を予測真度によりランク付けすることを含む。 Moreover, as a technique regarding the molecular signature of a highly lethal cancer, there is one described in Patent Document 5, for example. The technique described in the document includes: a) obtaining a value for one or more features for each of a plurality of members of each class; b) determining a Wilcoxon rank score for each of the features, and non-predicting Excluding features, and c) ranking the remaining features by predicted accuracy using a support vector machine.

また、従来の生体系のパターンを識別する技術としては、例えば特許文献６に記載されたものがある。同文献に記載された技術は、医療診断、予後および治療に有用であるパターン識別のために、サポートベクトルマシン（ＳＶＭ）およびＲＦＥ（反復特徴排除）を使用することを含む。ＳＶＭ−ＲＦＥは、変動データセットとともに使用することができる。 Moreover, as a technique for identifying a pattern of a conventional biological system, for example, there is one described in Patent Document 6. The technique described in that document involves the use of Support Vector Machine (SVM) and RFE (Repeated Feature Elimination) for pattern identification that is useful for medical diagnosis, prognosis and treatment. SVM-RFE can be used with variable data sets.

特表２００５−５１２５５７号公報JP 2005-512557 A 特開２００３−３４４１９６号公報JP 2003-344196 A 特開２００４−１３８号公報Japanese Patent Laid-Open No. 2004-138 特表２００５−５０６１４０号公報JP 2005-506140 A 特表２００５−５０３７７９号公報JP-T-2005-503779 特表２００５−５０２０９７号公報JP 2005-502097 gazette

しかしながら、上記文献記載の従来技術は、生物学的情報処理に適用しようとする場合には、以下の点で改善の余地を有していた。 However, the prior art described in the above literature has room for improvement in the following points when it is intended to be applied to biological information processing.

一般に、サポートベクトルマシンをはじめとする学習機械を生物学的情報処理に適用する際の共通課題として、「過学習」または「過剰適合」のリスクを克服することが課題となる。 In general, as a common problem when a learning machine such as a support vector machine is applied to biological information processing, it is a problem to overcome the risk of “over-learning” or “over-fitting”.

しかし、サポートベクトルマシンは、基本的には２つのクラスを識別する識別器を構成するための学習法であり、生物学的情報の分類などの多クラスの識別器を構成するためには、複数のサポートベクトルマシンを組み合わせるなどの工夫が必要となる。そのため、入力ベクトルまたは特徴ベクトルを３以上のクラスに分類する際には、複数のサポートベクトルマシンのそれぞれについて、サポートベクトルマシンが訓練パターンに「過学習」または「過剰適合」してしまう可能性があるため、全体として「過学習」または「過剰適合」のリスクが大きくなる傾向がある。 However, the support vector machine is basically a learning method for constructing a discriminator for discriminating two classes. In order to construct a multi-class discriminator such as a classification of biological information, a plurality of discriminators are used. It is necessary to devise such as combining support vector machines. Therefore, when the input vector or feature vector is classified into three or more classes, the support vector machine may “overlearn” or “overfit” the training pattern for each of the plurality of support vector machines. Therefore, the risk of “overlearning” or “overfit” as a whole tends to increase.

また、このような「過学習」または「過剰適合」のリスクは、特許文献６の段落［００８８］にも記載されているように、入力ベクトルまたは特徴ベクトルの次元の数が、例えばマイクロアレイで研究された何千もの遺伝子のように大きく、且つ、訓練パターンの数が、数ダースの患者のように比較的小さいときに、発生しやすい。 Further, as described in paragraph [0088] of Patent Document 6, the risk of such “overlearning” or “overfit” is studied by the number of dimensions of the input vector or feature vector, for example, in a microarray. It is likely to occur when the number of training patterns is large, such as thousands of genes, and relatively small, such as a few dozen patients.

この点、特許文献１乃至５に記載の技術においては、「過学習」または「過剰適合」のリスクを克服するための工夫は施されていない。このため、特許文献１乃至５に記載の技術においては、特徴ベクトル（入力ベクトル）の数が比較的大きく、訓練パターンの数が比較的小さいケースが多い生物学的情報を処理する場合には、「過学習」または「過剰適合」のリスクが大きくなりやすい。 In this regard, in the techniques described in Patent Documents 1 to 5, no contrivance for overcoming the risk of “over-learning” or “over-fitting” is provided. For this reason, in the techniques described in Patent Documents 1 to 5, when processing biological information in which the number of feature vectors (input vectors) is relatively large and the number of training patterns is relatively small, The risk of “over-learning” or “over-fitting” tends to increase.

そのため、特許文献１乃至５に記載の技術においては、サポートベクトルマシンが訓練パターンに「過学習」または「過剰適合」してしまう結果、生物学的情報に関する一般問題を解く際における予測の妥当性が充分ではない場合があった。よって、特許文献１乃至５に記載の技術においては、生物学的情報を３以上のクラスに分類する際における予測または分類基準の妥当性の面でさらなる改善の余地があった。 For this reason, in the techniques described in Patent Documents 1 to 5, the support vector machine is “over-learned” or “over-fitted” to the training pattern, and as a result, the validity of prediction in solving a general problem related to biological information. In some cases, this was not sufficient. Therefore, the techniques described in Patent Documents 1 to 5 have room for further improvement in terms of validity of prediction or classification criteria when classifying biological information into three or more classes.

一方、特許文献６に記載の技術では、特許文献６の段落［００８８］に記載されているように、「過学習」または「過剰適合」のリスクを克服するために特徴空間の次元を減少することを課題として工夫を施している。 On the other hand, in the technique described in Patent Document 6, as described in paragraph [0088] of Patent Document 6, the dimension of the feature space is reduced in order to overcome the risk of “over-learning” or “over-fitting”. The idea is given as an issue.

すなわち、特許文献６に記載の技術では、サポートベクトルマシンとＲＦＥ（反復特徴排除）とを組み合わせて使用することにより、特徴空間の次元を減少し、「過学習」または「過剰適合」のリスクをある程度低減している。しかし、このようなＲＦＥ（反復特徴排除）による「過学習」または「過剰適合」のリスクの低減は、充分に満足できる水準ではなかった。 In other words, the technique described in Patent Document 6 uses a combination of a support vector machine and RFE (repetitive feature exclusion), thereby reducing the dimension of the feature space and reducing the risk of “over-learning” or “over-fitting”. It is reduced to some extent. However, the reduction of the risk of “over-learning” or “over-fitting” by such RFE (repetitive feature exclusion) has not been sufficiently satisfactory.

また、特許文献６に記載の技術では、ＲＦＥ（反復特徴排除）を用いて特徴空間の次元を減少することにより、排除された次元に含まれる一部の有用な情報を分類に用いることができない可能性がある。 In the technique described in Patent Document 6, some useful information included in the excluded dimension cannot be used for classification by reducing the dimension of the feature space using RFE (repetitive feature exclusion). there is a possibility.

そのため、特許文献６に記載の技術においては、サポートベクトルマシンが訓練パターンにある程度「過学習」または「過剰適合」してしまい、排除された次元に含まれる一部の有用な情報を分類に用いることができないために、生物学的情報に関する一般問題を解く際における予測の妥当性が充分ではない場合があった。よって、特許文献６に記載の技術においては、生物学的情報を３以上のクラスに分類する際における予測または分類基準の妥当性の面でさらなる改善の余地があった。 For this reason, in the technique described in Patent Document 6, the support vector machine is “over-learned” or “over-fitted” to the training pattern to some extent, and some useful information included in the excluded dimensions is used for classification. In some cases, the validity of predictions when solving general problems related to biological information is not sufficient. Therefore, the technique described in Patent Document 6 has room for further improvement in terms of the validity of prediction or classification criteria when classifying biological information into three or more classes.

本発明は上記事情に鑑みてなされたものであり、生物学的情報を３以上のクラスに分類する際における予測または分類基準の妥当性に優れる技術を提供することを目的とする。 The present invention has been made in view of the above circumstances, and an object thereof is to provide a technique excellent in the validity of prediction or classification criteria when classifying biological information into three or more classes.

本発明によれば、生物学的情報の分類を予測するための生物学的情報処理装置であって、３以上のクラスに分類されている複数の既知の生物学的情報を取得する既知情報取得部と、既知情報取得部により取得された複数の生物学的情報を含む複数の入力ベクトルを、入力空間内に生成する入力ベクトル生成部と、複数の入力ベクトルおよび複数の入力ベクトルにそれぞれ対応する複数の生物学的情報の分類に基づいて、２以上の互いに平行な分離面を入力空間内に生成して、入力空間を３以上のクラスにそれぞれ対応する３以上の領域に分離する分離部と、分類が未知である生物学的情報を取得する未知情報取得部と、未知情報取得部により取得された生物学的情報を含む未知ベクトルを、２以上の分離面により３以上の領域に分離される入力空間内に生成する未知ベクトル生成部と、２以上の分離面により３以上の領域に分離される入力空間内のうち、未知ベクトルの配置されている領域に基づいて、未知ベクトルに対応する生物学的情報の分類を予測判定する予測判定部と、予測判定部により予測判定された未知ベクトルに対応する生物学的情報の分類を出力する予測分類出力部と、を備えることを特徴とする生物学的情報処理装置が提供される。 According to the present invention, there is provided a biological information processing apparatus for predicting a classification of biological information, and acquiring a plurality of known biological information classified into three or more classes. An input vector generation unit that generates a plurality of input vectors including a plurality of biological information acquired by the unit and the known information acquisition unit in the input space, and corresponds to the plurality of input vectors and the plurality of input vectors, respectively. A separation unit that generates two or more parallel separation planes in the input space based on a plurality of biological information classifications, and separates the input space into three or more regions respectively corresponding to three or more classes; An unknown information acquisition unit that acquires biological information whose classification is unknown, and an unknown vector that includes biological information acquired by the unknown information acquisition unit is separated into three or more regions by two or more separation planes. Enter Biology corresponding to an unknown vector based on a region where an unknown vector is arranged in an input space that is separated into three or more regions by two or more separation planes and an unknown vector generation unit that generates in the space A biology comprising: a prediction determination unit that predicts and determines a classification of the target information; and a prediction classification output unit that outputs a classification of biological information corresponding to the unknown vector predicted and determined by the prediction determination unit. An information processing apparatus is provided.

この構成によれば、生物学的情報を３以上のクラスに分類する際に、２以上の互いに平行な分離面により入力空間を３以上のクラスにそれぞれ対応する３以上の領域に分離することにより、「過学習」または「過剰適合」を回避しつつ、高い識別性を実現することができるため、生物学的情報を３以上のクラスに分類する際における予測の妥当性を向上できる。 According to this configuration, when the biological information is classified into three or more classes, the input space is separated into three or more regions respectively corresponding to the three or more classes by two or more parallel separation surfaces. Therefore, it is possible to achieve high discrimination while avoiding “over-learning” or “over-fitting”, so that the validity of prediction when classifying biological information into three or more classes can be improved.

本発明によれば、生物学的情報の分類を予測するための生物学的情報処理装置であって、３以上のクラスに分類されている複数の既知の生物学的情報を取得する既知情報取得部と、既知情報取得部により取得された複数の生物学的情報を含む複数の入力ベクトルを入力空間内に生成する入力ベクトル生成部と、入力空間内の複数の入力ベクトルを非線形写像により変換することにより、複数の入力ベクトルにそれぞれ対応する複数の特徴ベクトルを、入力空間よりも高次限な特徴空間内に生成する変換部と、複数の特徴ベクトルおよび複数の特徴ベクトルにそれぞれ対応する複数の生物学的情報の分類に基づいて、２以上の互いに平行な分離面を特徴空間内に生成して、特徴空間を３以上のクラスにそれぞれ対応する３以上の領域に分離する分離部と、分類が未知である生物学的情報を取得する未知情報取得部と、未知情報取得部により取得された生物学的情報を含む未知ベクトルを非線形写像により変換することにより、未知ベクトルに対応する変換未知ベクトルを、２以上の分離面により３以上の領域に分離される特徴空間内に生成する未知ベクトル生成部と、２以上の分離面により３以上の領域に分離される特徴空間内のうち、変換未知ベクトルの配置されている領域に基づいて、変換未知ベクトルに対応する生物学的情報の分類を予測判定する予測判定部と、予測判定部により予測判定された変換未知ベクトルに対応する生物学的情報の分類を出力する予測分類出力部と、を備えることを特徴とする生物学的情報処理装置が提供される。 According to the present invention, there is provided a biological information processing apparatus for predicting a classification of biological information, and acquiring a plurality of known biological information classified into three or more classes. , An input vector generation unit for generating a plurality of input vectors including a plurality of biological information acquired by the known information acquisition unit in the input space, and converting the plurality of input vectors in the input space by nonlinear mapping A plurality of feature vectors respectively corresponding to the plurality of input vectors in a higher-order feature space than the input space, and a plurality of feature vectors and a plurality of feature vectors respectively corresponding to the plurality of feature vectors. Based on the classification of biological information, two or more parallel separation planes are generated in the feature space, and the feature space is separated into three or more regions respectively corresponding to three or more classes. A separation unit, an unknown information acquisition unit that acquires biological information whose classification is unknown, and an unknown vector that includes biological information acquired by the unknown information acquisition unit by non-linear mapping, An unknown vector generation unit that generates a corresponding transformed unknown vector in a feature space that is separated into three or more regions by two or more separation surfaces, and a feature space that is separated into three or more regions by two or more separation surfaces Of these, the prediction determination unit that predicts and determines the classification of biological information corresponding to the conversion unknown vector based on the area where the conversion unknown vector is arranged, and the conversion unknown vector that is predicted and determined by the prediction determination unit There is provided a biological information processing apparatus comprising: a predicted classification output unit that outputs a classification of biological information.

この構成によれば、生物学的情報を３以上のクラスに分類する際に、複数の入力ベクトルにそれぞれ対応する複数の特徴ベクトルを、入力空間よりも高次限な特徴空間内に生成した上で、２以上の互いに平行な分離面により特徴空間を３以上のクラスにそれぞれ対応する３以上の領域に分離することにより、「過学習」または「過剰適合」を回避しつつ、高い識別性を実現することができるため、生物学的情報を３以上のクラスに分類する際における予測の妥当性を向上できる。 According to this configuration, when biological information is classified into three or more classes, a plurality of feature vectors respectively corresponding to a plurality of input vectors are generated in a feature space higher in order than the input space. By separating the feature space into three or more regions respectively corresponding to three or more classes by using two or more parallel separation surfaces, high discriminability can be achieved while avoiding “over-learning” or “over-fitting”. Since it is realizable, the validity of prediction in classifying biological information into three or more classes can be improved.

本発明によれば、生物学的情報の分類を予測するための生物学的情報処理装置であって、３以上のクラスに分類されている複数の既知の生物学的情報を取得する既知情報取得部と、既知情報取得部により取得された複数の生物学的情報を含む複数の入力ベクトルを入力空間内に生成する入力ベクトル生成部と、入力空間内の複数の入力ベクトルを非線形写像により変換することにより、複数の入力ベクトルにそれぞれ対応する複数の特徴ベクトルを、入力空間よりも高次限な特徴空間内に生成する変換部と、複数の特徴ベクトルおよび複数の特徴ベクトルにそれぞれ対応する複数の生物学的情報の分類に基づいて、２以上の互いに平行な分離面を特徴空間内に生成して、特徴空間を３以上のクラスにそれぞれ対応する３以上の領域に分離する分離部と、分離部により生成された２以上の分離面を非線形写像により逆変換することにより、入力空間を３以上の領域に分離する２以上の分離面を前記入力空間内に生成する逆変換部と、分類が未知である生物学的情報を取得する未知情報取得部と、未知情報取得部により取得された生物学的情報を含む未知ベクトルを、２以上の分離面により３以上の領域に分離される入力空間内に生成する未知ベクトル生成部と、２以上の分離面により３以上の領域に分離される入力空間内のうち、未知ベクトルの配置されている領域に基づいて、未知ベクトルに対応する生物学的情報の分類を予測判定する予測判定部と、予測判定部により予測判定された未知ベクトルに対応する生物学的情報の分類を出力する予測分類出力部と、を備えることを特徴とする生物学的情報処理装置が提供される。 According to the present invention, there is provided a biological information processing apparatus for predicting a classification of biological information, and acquiring a plurality of known biological information classified into three or more classes. , An input vector generation unit for generating a plurality of input vectors including a plurality of biological information acquired by the known information acquisition unit in the input space, and converting the plurality of input vectors in the input space by nonlinear mapping A plurality of feature vectors respectively corresponding to the plurality of input vectors in a higher-order feature space than the input space, and a plurality of feature vectors and a plurality of feature vectors respectively corresponding to the plurality of feature vectors. Based on the classification of biological information, two or more parallel separation planes are generated in the feature space, and the feature space is separated into three or more regions respectively corresponding to three or more classes. Inverse transformation that generates two or more separation surfaces in the input space by separating the input space into three or more regions by inversely transforming the separation unit and two or more separation surfaces generated by the separation unit by a non-linear mapping An unknown vector including biological information acquired by the unknown information acquisition unit and the unknown information acquisition unit into three or more regions by two or more separation planes. Based on the area where the unknown vector is arranged in the input space generated by the unknown vector generation unit generated in the input space to be separated and the input space separated into three or more areas by two or more separation planes. A prediction determination unit that predicts and determines a corresponding biological information classification; and a prediction classification output unit that outputs a biological information classification corresponding to the unknown vector predicted and determined by the prediction determination unit. The biological information processing apparatus is provided.

この構成によれば、生物学的情報を３以上のクラスに分類する際に、複数の入力ベクトルにそれぞれ対応する複数の特徴ベクトルを、入力空間よりも高次限な特徴空間内に生成した上で、２以上の互いに平行な分離面により特徴空間を３以上のクラスにそれぞれ対応する３以上の領域に分離し、さらに生成された２以上の分離面を非線形写像により逆変換して、入力空間を３以上の領域に分離する２以上の分離面を前記入力空間内に生成することにより、「過学習」または「過剰適合」を回避しつつ、高い識別性を実現することができるため、生物学的情報を３以上のクラスに分類する際における予測の妥当性を向上できる。 According to this configuration, when biological information is classified into three or more classes, a plurality of feature vectors respectively corresponding to a plurality of input vectors are generated in a feature space higher in order than the input space. The feature space is separated into three or more regions respectively corresponding to three or more classes by two or more parallel separation surfaces, and the generated two or more separation surfaces are inversely transformed by a non-linear mapping to obtain an input space. By generating in the input space two or more separation planes that separate three into three or more regions, it is possible to achieve high discrimination while avoiding “overlearning” or “overfitting”. The validity of prediction when classifying scientific information into three or more classes can be improved.

本発明によれば、生物学的情報を分類するための分類基準を生成する生物学的情報処理装置であって、３以上のクラスに分類されている複数の既知の生物学的情報を取得する既知情報取得部と、既知情報取得部により取得された前記複数の生物学的情報を含む複数の入力ベクトルを、入力空間内に生成する入力ベクトル生成部と、複数の入力ベクトルおよび複数の入力ベクトルにそれぞれ対応する複数の生物学的情報の分類に基づいて、２以上の互いに平行な分離面を前記入力空間内に生成して、入力空間を前記３以上のクラスにそれぞれ対応する３以上の領域に分離する分離部と、分離部により生成された２以上の分離面を規定する情報を含む分類基準を出力する分類基準出力部と、を備える生物学的情報処理装置が提供される。 According to the present invention, a biological information processing apparatus that generates a classification standard for classifying biological information, and acquires a plurality of known biological information classified into three or more classes. A known information acquisition unit; an input vector generation unit that generates a plurality of input vectors including the plurality of biological information acquired by the known information acquisition unit in an input space; a plurality of input vectors and a plurality of input vectors; Two or more parallel separation planes are generated in the input space based on a plurality of biological information classifications corresponding respectively to three or more regions corresponding to the three or more classes, respectively. There is provided a biological information processing apparatus comprising: a separation unit that separates the information into two parts; and a classification reference output unit that outputs a classification reference including information defining two or more separation planes generated by the separation unit.

この構成によれば、生物学的情報を３以上のクラスに分類する際に、２以上の互いに平行な分離面により入力空間を３以上のクラスにそれぞれ対応する３以上の領域に分離することにより、「過学習」または「過剰適合」を回避しつつ、高い識別性を実現することができるため、生物学的情報を３以上のクラスに分類する際における分類基準の妥当性を向上できる。 According to this configuration, when the biological information is classified into three or more classes, the input space is separated into three or more regions respectively corresponding to the three or more classes by two or more parallel separation surfaces. Since high discrimination can be realized while avoiding “overlearning” or “overfitting”, the validity of the classification criteria when classifying biological information into three or more classes can be improved.

本発明によれば、生物学的情報を分類するための分類基準を生成する生物学的情報処理装置であって、３以上のクラスに分類されている複数の既知の生物学的情報を取得する既知情報取得部と、既知情報取得部により取得された複数の生物学的情報を含む複数の入力ベクトルを、入力空間内に生成する入力ベクトル生成部と、入力空間内の複数の入力ベクトルを非線形写像により変換することにより、複数の入力ベクトルにそれぞれ対応する複数の特徴ベクトルを、入力空間よりも高次限な特徴空間内に生成する変換部と、複数の特徴ベクトルおよび複数の特徴ベクトルにそれぞれ対応する複数の生物学的情報の分類に基づいて、２以上の互いに平行な分離面を特徴空間内に生成して、特徴空間を３以上のクラスにそれぞれ対応する３以上の領域に分離する分離部と、分離部により生成された２以上の分離面を規定する情報を含む分類基準を出力する分類基準出力部と、を備える生物学的情報処理装置が提供される。 According to the present invention, a biological information processing apparatus that generates a classification standard for classifying biological information, and acquires a plurality of known biological information classified into three or more classes. A known information acquisition unit, an input vector generation unit that generates a plurality of input vectors including a plurality of biological information acquired by the known information acquisition unit in the input space, and a plurality of input vectors in the input space are nonlinear By converting by mapping, a plurality of feature vectors respectively corresponding to a plurality of input vectors are generated in a higher-order feature space than the input space, and a plurality of feature vectors and a plurality of feature vectors, respectively Based on the classification of the corresponding plurality of biological information, two or more parallel separation planes are generated in the feature space, and the feature space corresponds to three or more classes respectively. A separation unit for separating the frequency, a classification reference output unit for outputting the classification criteria including information defining two or more separate surface produced by separation unit, the biological information processing apparatus including a is provided.

この構成によれば、生物学的情報を３以上のクラスに分類する際に、複数の入力ベクトルにそれぞれ対応する複数の特徴ベクトルを、入力空間よりも高次限な特徴空間内に生成した上で、２以上の互いに平行な分離面により特徴空間を３以上のクラスにそれぞれ対応する３以上の領域に分離することにより、「過学習」または「過剰適合」を回避しつつ、高い識別性を実現することができるため、生物学的情報を３以上のクラスに分類する際における分類基準の妥当性を向上できる。 According to this configuration, when biological information is classified into three or more classes, a plurality of feature vectors respectively corresponding to a plurality of input vectors are generated in a feature space higher in order than the input space. By separating the feature space into three or more regions respectively corresponding to three or more classes by using two or more parallel separation surfaces, high discriminability can be achieved while avoiding “over-learning” or “over-fitting”. Since it can be realized, the validity of the classification criteria when classifying biological information into three or more classes can be improved.

本発明によれば、生物学的情報を分類するための分類基準を生成する生物学的情報処理装置であって、３以上のクラスに分類されている複数の既知の生物学的情報を取得する既知情報取得部と、既知情報取得部により取得された複数の生物学的情報を含む複数の入力ベクトルを、入力空間内に生成する入力ベクトル生成部と、入力空間内の複数の入力ベクトルを非線形写像により変換することにより、複数の入力ベクトルにそれぞれ対応する複数の特徴ベクトルを、入力空間よりも高次限な特徴空間内に生成する変換部と、複数の特徴ベクトルおよび複数の特徴ベクトルにそれぞれ対応する複数の生物学的情報の分類に基づいて、２以上の互いに平行な分離面を特徴空間内に生成して、特徴空間を３以上のクラスにそれぞれ対応する３以上の領域に分離する分離部と、分離部により生成された２以上の分離面を非線形写像により逆変換することにより、入力空間を３以上の領域に分離する２以上の分離面を入力空間内に生成する逆変換部と、逆変換部により生成された２以上の分離面を規定する情報を含む分類基準を出力する分類基準出力部と、を備える生物学的情報処理装置が提供される。 According to the present invention, a biological information processing apparatus that generates a classification standard for classifying biological information, and acquires a plurality of known biological information classified into three or more classes. A known information acquisition unit, an input vector generation unit that generates a plurality of input vectors including a plurality of biological information acquired by the known information acquisition unit in the input space, and a plurality of input vectors in the input space are nonlinear By converting by mapping, a plurality of feature vectors respectively corresponding to a plurality of input vectors are generated in a higher-order feature space than the input space, and a plurality of feature vectors and a plurality of feature vectors, respectively Based on the classification of the corresponding plurality of biological information, two or more parallel separation planes are generated in the feature space, and the feature space corresponds to three or more classes respectively. Generates two or more separation surfaces in the input space by separating the input space into three or more regions by inversely transforming the separation unit that divides into regions and two or more separation surfaces generated by the separation unit by nonlinear mapping There is provided a biological information processing apparatus comprising: an inverse transforming unit that outputs a classification criterion that includes information defining two or more separation planes generated by the inverse transforming unit.

この構成によれば、生物学的情報を３以上のクラスに分類する際に、複数の入力ベクトルにそれぞれ対応する複数の特徴ベクトルを、入力空間よりも高次限な特徴空間内に生成した上で、２以上の互いに平行な分離面により特徴空間を３以上のクラスにそれぞれ対応する３以上の領域に分離し、さらに生成された２以上の分離面を非線形写像により逆変換して、入力空間を３以上の領域に分離する２以上の分離面を前記入力空間内に生成することにより、「過学習」または「過剰適合」を回避しつつ、高い識別性を実現することができるため、生物学的情報を３以上のクラスに分類する際における分類基準の妥当性を向上できる。 According to this configuration, when biological information is classified into three or more classes, a plurality of feature vectors respectively corresponding to a plurality of input vectors are generated in a feature space higher in order than the input space. The feature space is separated into three or more regions respectively corresponding to three or more classes by two or more parallel separation surfaces, and the generated two or more separation surfaces are inversely transformed by a non-linear mapping to obtain an input space. By generating in the input space two or more separation planes that separate three into three or more regions, it is possible to achieve high discrimination while avoiding “overlearning” or “overfitting”. The validity of classification criteria when classifying scientific information into three or more classes can be improved.

なお、上記の装置は本発明の一態様であり、本発明の装置は、以上の構成要素の任意の組合せであってもよい。また、本発明の方法、システム、コンピュータプログラム、記録媒体なども、同様の構成を有する。 Note that the above-described device is one embodiment of the present invention, and the device of the present invention may be any combination of the above components. The method, system, computer program, recording medium, etc. of the present invention have the same configuration.

本発明によれば、２以上の互いに平行な分離面により入力空間または特徴空間を３以上のクラスにそれぞれ対応する３以上の領域に分離するため、生物学的情報を３以上のクラスに分類する際における予測または分類基準の妥当性を向上できる。 According to the present invention, biological information is classified into three or more classes in order to separate the input space or feature space into three or more regions respectively corresponding to three or more classes by two or more parallel separation surfaces. Improve the validity of prediction or classification criteria.

本発明において、上述の変換部は、半正定置性を満たすカーネル関数を用いて、入力ベクトルを特徴ベクトルに変換した場合の計算をするように構成することができる。 In the present invention, the above-described conversion unit can be configured to perform calculation when an input vector is converted to a feature vector using a kernel function that satisfies semi-fixed property.

この構成によれば、半正定置性を満たすカーネル関数を用いて、入力ベクトルを特徴ベクトルに変換した場合の計算をすることにより、線形分離することが困難な入力ベクトルを、線形分離することが可能な特徴ベクトルに変換することができるため、生物学的情報を３以上のクラスに分類する際における予測または分類基準の汎化性を向上できる。 According to this configuration, it is possible to linearly separate an input vector that is difficult to be linearly separated by performing a calculation when the input vector is converted into a feature vector using a kernel function that satisfies the semi-fixed property. Since it can be converted into possible feature vectors, it is possible to improve generalization of prediction or classification criteria when classifying biological information into three or more classes.

本発明において、カーネル関数は、線形カーネル関数、多項式カーネル関数およびＲＢＦカーネル関数よりなる群から選ばれる１種以上のカーネル関数であってもよい。 In the present invention, the kernel function may be one or more kernel functions selected from the group consisting of a linear kernel function, a polynomial kernel function, and an RBF kernel function.

この構成によれば、線形カーネル関数、多項式カーネル関数およびＲＢＦカーネル関数よりなる群から選ばれる１種以上のカーネル関数は、いずれも半正定置性を満たすカーネル関数であるため、生物学的情報を３以上のクラスに分類する際における予測または分類基準の汎化性を向上できる。 According to this configuration, since one or more types of kernel functions selected from the group consisting of a linear kernel function, a polynomial kernel function, and an RBF kernel function are all kernel functions that satisfy semi-positive stationary properties, The generalization of prediction or classification criteria when classifying into three or more classes can be improved.

本発明によれば、線形分離部は、サポートベクトルマシンを用いて、２以上の互いに平行な分離面を生成するように構成することができる。 According to the present invention, the linear separation unit can be configured to generate two or more parallel separation planes using a support vector machine.

この構成によれば、サポートベクトルマシンは、高い識別性能を有するため、上述の各種の工夫と相俟って、生物学的情報を３以上のクラスに分類する際における予測または分類基準の識別性を向上できる。 According to this configuration, since the support vector machine has a high discrimination performance, in combination with the above-described various devices, the discrimination of the prediction or the classification criterion when classifying the biological information into three or more classes is performed. Can be improved.

本発明によれば、サポートベクトルマシンは、２以上の互いに平行な分離面のうち、マージン幅が最小である分離面のマージン幅を最大化するように構成することができる。 According to the present invention, the support vector machine can be configured to maximize the margin width of the separation surface having the smallest margin width among the two or more parallel separation surfaces.

この構成によれば、サポートベクトルマシンは、なるべく余裕を持って入力ベクトルまたは特徴ベクトルを分離することになるため、生物学的情報を３以上のクラスに分類する際における予測または分類基準の識別性および汎化性を向上できる。 According to this configuration, since the support vector machine separates the input vector or the feature vector with a margin as much as possible, the identification of the prediction or classification criterion when classifying the biological information into three or more classes. And generalization can be improved.

本発明によれば、分離部は、分離面を、マージン幅の中央に位置するように配置するように構成することができる。 According to the present invention, the separation unit can be configured to arrange the separation surface so as to be positioned at the center of the margin width.

この構成によれば、サポートベクトルマシンは、分離面を一意に決定することができるため、生物学的情報を３以上のクラスに分類する際における予測または分類基準の汎化性を向上できる。 According to this configuration, since the support vector machine can uniquely determine the separation plane, it is possible to improve generalization of prediction or classification criteria when classifying biological information into three or more classes.

本発明によれば、サポートベクトルマシンは、ソフトマージン法により拡張されたサポートベクトルマシンであってもよい。 According to the present invention, the support vector machine may be a support vector machine extended by a soft margin method.

この構成によれば、サポートベクトルマシンは、入力ベクトルまたは特徴ベクトルを線形分離することが困難な場合にも、ソフトマージン法により線形分離することが可能になるため、生物学的情報を３以上のクラスに分類する際における予測または分類基準の汎化性を向上できる。 According to this configuration, the support vector machine can linearly separate the input vector or the feature vector by the soft margin method even when it is difficult to linearly separate the input vector or the feature vector. The generalization of prediction or classification criteria when classifying into classes can be improved.

本発明において、２以上の互いに平行な分離面は、次式ｂ＝ｗ^Ｔｘ（この式で、ｘは入力ベクトルまたは特徴ベクトルであり、ｗはパラメータベクトルであり、ｂはバイアスであり、Ｔは転置を示す算術記号である）で表されるスコア関数により規定され、ｗは、前記２以上の分離面で同一であるパラメータベクトルであり、ｂは、前記２以上の分離面で互いに異なる値をとるバイアスであってもよい。 In the present invention, two or more parallel separation planes are expressed as follows: b = w ^T x (where x is an input vector or feature vector, w is a parameter vector, b is a bias, T Is an arithmetic symbol indicating transposition), w is a parameter vector that is the same on the two or more separation planes, and b is a value different from each other on the two or more separation planes. A bias that takes

この構成によれば、ｂ＝ｗ^Ｔｘで表される２以上の分離面は、互いに平行となり、「過学習」または「過剰適合」を回避しつつ、高い識別性を実現することができるので、生物学的情報を３以上のクラスに分類する際における予測または分類基準の妥当性を向上できる。 According to this configuration, two or more separation planes represented by b = w ^T x are parallel to each other, and high discrimination can be realized while avoiding “overlearning” or “overfitting”. The validity of prediction or classification criteria when classifying biological information into three or more classes can be improved.

以下、本発明の実施の形態について、図面を用いて説明する。尚、すべての図面において、同様な構成要素には同様の符号を付し、適宜説明を省略する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In all the drawings, the same reference numerals are given to the same components, and the description will be omitted as appropriate.

＜実施形態１＞
図１は、生物学的情報処理装置１００の構成の概要を示した機能ブロック図である。生物学的情報処理装置１００は、分類が未知である生物学的情報の分類を予測するための装置である。あるいは、生物学的情報処理装置１００は、分類が未知である生物学的情報を分類するための分類基準を生成する装置でもある。 <Embodiment 1>
FIG. 1 is a functional block diagram showing an outline of the configuration of the biological information processing apparatus 100. The biological information processing apparatus 100 is an apparatus for predicting the classification of biological information whose classification is unknown. Alternatively, the biological information processing apparatus 100 is also an apparatus that generates a classification standard for classifying biological information whose classification is unknown.

生物学的情報処理装置１００には、既知情報取得部１０２が設けられている。既知情報取得部１０２は、図２に示すように、３以上のクラスにあらかじめ分類されている複数の既知の生物学的情報を外部から取得する。 The biological information processing apparatus 100 is provided with a known information acquisition unit 102. As shown in FIG. 2, the known information acquisition unit 102 acquires a plurality of known biological information classified in advance into three or more classes from the outside.

図２では、既知情報取得部１０２が外部から取得する入力ファイル（分類が既知の生物学的情報）の例が記載されている。この例では、既知情報取得部１０２は、個々の患者から取得されたサンプルごとに、所定の（薬剤耐性、疾病度、生存率、遺伝子名、白血球量、血圧などの）順序付き医療観測データを取得する。 In FIG. 2, an example of an input file (biological information whose classification is known) acquired from the outside by the known information acquisition unit 102 is described. In this example, the known information acquisition unit 102 obtains predetermined medical observation data (such as drug resistance, disease level, survival rate, gene name, white blood cell volume, blood pressure, etc.) for each sample acquired from each patient. get.

これらの医療観測データは、医師、歯科医師、看護師、臨床検査技師、臨床受託会社の技術者、生物学分野の研究者などにより、３以上の複数のクラスにあらかじめ分類されている。図２の例では、第１段階、第２段階、・・・第Ｍ段階と、Ｍ個のクラスにあらかじめ分類されている。例えば、図２の例では、抗癌剤を研究する研究者が、抗癌剤耐性について、低、中、高と分類することができる。 These medical observation data are classified in advance into a plurality of three or more classes by doctors, dentists, nurses, clinical technologists, engineers of clinical contractors, researchers in the field of biology, and the like. In the example of FIG. 2, the first stage, the second stage,... The Mth stage, and M classes are classified in advance. For example, in the example of FIG. 2, a researcher who studies an anticancer drug can be classified into low, medium, and high for anticancer drug resistance.

また、生物学的情報処理装置１００には、特徴量抽出部１１８が設けられている。特徴量抽出部１１８は、入力ファイルに含まれる個々の患者から取得されたサンプルごとの分類が既知の生物学的情報から、所定の抽出基準に基づいて（薬剤耐性、疾病度、生存率、遺伝子名、白血球量、血圧などの）特徴量を抽出する。例えば、所定の抽出基準で上位ｄ個の特徴量１、特徴量２、・・・特徴量Ｄを抽出することができる。この場合、それぞれの患者から取得されたサンプルの生物学的情報について、抽出される特徴量の種類は同じである。一方、それぞれの特徴量の値については、それぞれの患者から取得されたサンプルの生物学的情報ごとに異なる。 In addition, the biological information processing apparatus 100 is provided with a feature amount extraction unit 118. The feature quantity extraction unit 118 is based on predetermined extraction criteria (drug resistance, morbidity, survival rate, gene, etc.) from biological information whose classification for each sample acquired from each patient included in the input file is known. Extract feature quantities (such as name, white blood cell volume, blood pressure). For example, the top d feature quantity 1, feature quantity 2,... Feature quantity D can be extracted based on a predetermined extraction criterion. In this case, the types of extracted feature quantities are the same for the biological information of the samples obtained from each patient. On the other hand, the value of each feature amount differs for each biological information of the sample acquired from each patient.

特徴量抽出部１１８で用いられる特徴量抽出法としては、例えば、一元配置分散分析のＦ検定またはｔ検定、二群比較によるｔ検定（等分散、不等分散）、各段階に値が存在する場合には相関係数などが使用可能である。 The feature quantity extraction method used in the feature quantity extraction unit 118 includes, for example, a one-way analysis of variance F-test or t-test, two-group comparison t-test (equal variance, unequal variance), and values at each stage. In some cases, a correlation coefficient or the like can be used.

特徴量抽出部１１８で特徴量を抽出された分類既知の生物情報は、既知情報記憶部１０４に格納され、分類基準生成部１０６に送られる。分類基準生成部１０６は、後述する図３に示すように、取得した分類既知の生物情報に基づいて、例えば、本発明者の開発したマルチマージンサポートベクトルマシンなどの学習機械を用いて、生物学的情報の分類を予測するための分類基準を生成する。 The biological information with known classification from which the feature quantity is extracted by the feature quantity extraction unit 118 is stored in the known information storage unit 104 and sent to the classification reference generation unit 106. As shown in FIG. 3 to be described later, the classification reference generation unit 106 uses, for example, a learning machine such as a multi-margin support vector machine developed by the present inventor based on the acquired biological information with known classification. Generating a classification criterion for predicting the classification of target information.

図３は、既存のサポートベクトルマシンと、分離部２１４にて用いられる本実施形態のマルチマージンサポートベクトルマシンとの機能の違いを説明するための概念図である。既存のサポートベクトルマシン（ＳＶＭ）では、所定の次元のベクトル（ｄ次元ベクトル、すなわちｄ個の特徴量の値を含むベクトル）を２クラスに分類する。一方、本発明者の開発したマルチマージンサポートベクトルマシン（ＭＭ−ＳＶＭ）は、所定の次元のベクトルを３以上のクラス（Ｍ段階）に分類する。なお、Ｍは３以上の整数である。また、マルチマージンサポートベクトルマシンについては、図６〜図１１において後述する。 FIG. 3 is a conceptual diagram for explaining a difference in function between an existing support vector machine and the multi-margin support vector machine of the present embodiment used in the separation unit 214. In an existing support vector machine (SVM), vectors of a predetermined dimension (d-dimensional vectors, that is, vectors including d feature value values) are classified into two classes. On the other hand, the multi-margin support vector machine (MM-SVM) developed by the present inventors classifies vectors of a predetermined dimension into three or more classes (M stages). M is an integer of 3 or more. The multi-margin support vector machine will be described later with reference to FIGS.

図１に戻って、分類基準生成部１０６で生成された分類基準は、分類基準記憶部１０８に格納され、分類予測判定部１１２に送られるか、あるいは出力部１１６に送られて、そのまま外部に出力される。なお、分類基準とは、後述するマルチマージンサポートベクトルマシンにより生成される分離面そのものであってもよく、分離面を規定するためのパラメータの組合せなどであってもよい。 Returning to FIG. 1, the classification standard generated by the classification standard generation unit 106 is stored in the classification standard storage unit 108 and sent to the classification prediction determination unit 112 or sent to the output unit 116, and directly to the outside. Is output. The classification standard may be a separation plane itself generated by a multi-margin support vector machine, which will be described later, or a combination of parameters for defining the separation plane.

一方、生物学的情報処理装置１００には、既知情報取得部１０２とは別に、未知情報取得部１１０が設けられている。未知情報取得部１１０は、分類が未知である生物学的情報を外部から取得する。 On the other hand, the biological information processing apparatus 100 is provided with an unknown information acquisition unit 110 in addition to the known information acquisition unit 102. The unknown information acquisition unit 110 acquires biological information whose classification is unknown from the outside.

未知情報取得部１１０が外部から取得する入力ファイル（分類が未知の生物学的情報）は、例えば、個々の患者から取得されたサンプルごとに、所定の（薬剤耐性、疾病度、生存率、遺伝子名、白血球量、血圧などの）順序付き医療観測データである。もっとも、図２の例とは異なり、これらの医療観測データは、医師、歯科医師、看護師、臨床検査技師、臨床受託会社の技術者、生物学分野の研究者などにより、３以上の複数のクラスにあらかじめ分類されてはいない。 The input file (biological information whose classification is unknown) acquired from the outside by the unknown information acquisition unit 110 is, for example, a predetermined (drug resistance, disease level, survival rate, gene, for each sample acquired from each patient) Ordered medical observation data (name, white blood cell volume, blood pressure, etc.). However, unlike the example of FIG. 2, these medical observation data are obtained by doctors, dentists, nurses, clinical technologists, clinical contractor engineers, biology researchers, etc. They are not pre-classified into classes.

特徴量抽出部１１８は、入力ファイルに含まれる個々の患者から取得されたサンプルごとの分類が未知の生物学的情報から、所定の抽出基準に基づいて（薬剤耐性、疾病度、生存率、遺伝子名、白血球量、血圧などの）特徴量を抽出する。例えば、所定の抽出基準で上位ｄ個の特徴量１、特徴量２、・・・特徴量Ｄを抽出することができる。この場合、分類が既知の生物学的情報の場合と、抽出される特徴量の種類は同じである。一方、それぞれの特徴量の値については、それぞれの患者から取得されたサンプルの生物学的情報ごとに異なる。特徴量抽出部１１８で特徴量を抽出された分類未知の生物情報は、分類予測判定部１１２に送られる。 The feature quantity extraction unit 118 is based on predetermined extraction criteria (drug resistance, morbidity, survival rate, gene, etc.) from biological information whose classification for each sample acquired from each patient included in the input file is unknown. Extract feature quantities (such as name, white blood cell volume, blood pressure). For example, the top d feature quantity 1, feature quantity 2,... Feature quantity D can be extracted based on a predetermined extraction criterion. In this case, the types of extracted feature quantities are the same as in the case of biological information whose classification is known. On the other hand, the value of each feature amount differs for each biological information of the sample acquired from each patient. The biological information with unknown classification from which the feature amount is extracted by the feature amount extraction unit 118 is sent to the classification prediction determination unit 112.

分類予測判定部１１２は、分類が未知の生物学的情報の分類を予測判定する。分類予測判定部１１２は、分類基準記憶部１０８から取得した分類基準に、特徴量抽出部１１０で特徴量を抽出された分類未知の生物学的情報をあてはめることにより、分類未知の生物学的情報の分類を予測判定する。そして、分類予測判定部１１２により予測判定された分類未知の生物学的情報の分類予測結果は、分類予測記憶部１１４に格納され、出力部１１６に送られる。 The classification prediction determination unit 112 predicts and determines the classification of biological information whose classification is unknown. The classification prediction determination unit 112 applies the biological information with unknown classification to the classification reference acquired from the classification reference storage unit 108 by applying the biological information with unknown classification extracted by the feature amount extraction unit 110. Predictive classification of The classification prediction result of the biological information with unknown classification determined by the classification prediction determination unit 112 is stored in the classification prediction storage unit 114 and sent to the output unit 116.

出力部１１６は、分類予測判定部１１２により予測判定された分類未知の生物学的情報の分類予測結果を外部に出力するか、あるいは、分類基準生成部１０６により生成された２以上の分離面を規定する情報を含む分類基準を外部に出力する。 The output unit 116 outputs the classification prediction result of the biological information with unknown classification determined by the classification prediction determination unit 112 to the outside, or outputs two or more separation planes generated by the classification reference generation unit 106. The classification criteria including the information to be defined is output to the outside.

また、生物学的情報処理装置１００には、交差検定部１２０が設けられている。交差検定部１２０は、図２９で後述するように、上述のマルチマージンサポートベクトルマシンの学習データとして用いられた既知情報について、上述の既知情報により生成された分類基準により得られる分類予測の判定結果の交差検定を行う。なお、交差検定の手法の詳細については、図２９において後述するので、ここでは説明を繰り返さない。 In addition, the biological information processing apparatus 100 is provided with a cross validation unit 120. As will be described later with reference to FIG. 29, the cross-validation unit 120 performs classification prediction determination results obtained from the classification criteria generated from the above-described known information for the known information used as the learning data of the above-described multi-margin support vector machine. Perform cross-validation. Note that details of the cross-validation method will be described later with reference to FIG. 29, and therefore description thereof will not be repeated here.

さらに、交差検定部１２０は、交差検定の結果に基づいて、上述の分類基準の推定予測率を算出する。こうして得られた推定予測率は、推定予測率記憶部１２２に格納され、出力部１１６により出力されて、生物学的情報処理装置１００のユーザに提示される。 Further, the cross validation unit 120 calculates the estimated prediction rate of the above-described classification standard based on the result of the cross validation. The estimated prediction rate obtained in this way is stored in the estimated prediction rate storage unit 122, output by the output unit 116, and presented to the user of the biological information processing apparatus 100.

生物学的情報処理装置１００では、交差検定部１２０によって学習データの交差検定が行われることにより、上述の分類基準の推定予測率が算出されて出力部１１６を介してユーザに提示される。このため、生物学的情報処理装置１００のユーザは、学習データにより生成された分類基準の推定予測率を把握した上で、分類基準に未知データをあてはめて、信頼性の程度の判明している分類予測判定結果を得ることができる。このため、ユーザが信頼性の低い分類基準による分類予測判定結果を過信する危険性が低減し、研究開発または医療におけるリスクが低減される。 In the biological information processing apparatus 100, the cross-validation unit 120 performs cross-validation of the learning data, thereby calculating the estimated prediction rate of the above-described classification standard and presenting it to the user via the output unit 116. For this reason, the user of the biological information processing apparatus 100 grasps the estimated prediction rate of the classification standard generated from the learning data, and then applies unknown data to the classification standard to determine the degree of reliability. A classification prediction determination result can be obtained. For this reason, the risk that the user overconfidents the classification prediction determination result based on the classification criterion with low reliability is reduced, and the risk in research and development or medical care is reduced.

図４は、分類基準生成部１０６の構成の詳細を示した機能ブロック図である。生物学的情報取得部２０２は、既知情報記憶部１０４から分類が既知の生物学的情報を取得し、生物学的情報記憶部２０４に格納する。一方、分類取得部２１０は、その生物学的情報に対応する既知の分類を取得し、分類記憶部２１２に格納する。 FIG. 4 is a functional block diagram illustrating details of the configuration of the classification reference generation unit 106. The biological information acquisition unit 202 acquires biological information whose classification is known from the known information storage unit 104 and stores the biological information in the biological information storage unit 204. On the other hand, the classification acquisition unit 210 acquires a known classification corresponding to the biological information and stores it in the classification storage unit 212.

入力ベクトル生成部２０６は、既知情報記憶部１０４から取得した分類が既知の生物学的情報から、複数の生物学的情報を含む複数の入力ベクトルを、入力空間内に生成する。具体的には、図２に示すように、特徴量抽出部により抽出された上位ｄ個の特徴量を含む個々の患者由来のサンプルデータを、それぞれｄ次元のベクトル化することにより、入力ベクトルを生成する。入力空間は、例えばｄ次元のユークリッド幾何学に基づく空間などが用いられる。そして、入力ベクトル生成部２０６により生成された入力ベクトルは、入力ベクトル記憶部２０８に格納され、分離部２１４に送られる。 The input vector generation unit 206 generates, in the input space, a plurality of input vectors including a plurality of biological information from biological information whose classification is known acquired from the known information storage unit 104. Specifically, as shown in FIG. 2, the sample data derived from each patient including the top d feature quantities extracted by the feature quantity extraction unit is converted into a d-dimensional vector to obtain the input vector. Generate. As the input space, for example, a space based on d-dimensional Euclidean geometry is used. The input vector generated by the input vector generation unit 206 is stored in the input vector storage unit 208 and sent to the separation unit 214.

ここで、サポートベクトルマシンなどを用いるパターン認識を実現するためには、認識対象から何らかの特徴量を計測（抽出）することが必要である。一般には、特徴量は１種類だけではなく、複数の特徴量を計測し、それらを同時に用いることが多い。そのような特徴量は、通常、まとめて入力ベクトルｘ^Ｔ＝（ｘ_１，・・・，ｘ_ｄ）として表される。ここで、ｘ^Ｔは、ベクトルｘの転置を表す。なお、これは入力ベクトルの一例に過ぎない。 Here, in order to realize pattern recognition using a support vector machine or the like, it is necessary to measure (extract) some feature amount from the recognition target. In general, not only one type of feature amount but also a plurality of feature amounts are measured and used in many cases. Such feature amounts are generally represented collectively as an input vector x ^T = (x ₁ ,..., X _d ). Here, ^{x T} represents the transpose of vector x. This is merely an example of an input vector.

分離部２１４は、図３に示すように、複数の入力ベクトルと、複数の既知の分類とに基づいて、マルチマージンサポートベクトルマシンにより入力空間を３以上の分類（クラス）にそれぞれ対応する３以上の領域に分離する。マルチマージンサポートベクトルマシンは、２以上の互いに平行な分離面を入力空間内に生成することにより、入力空間を３以上の分類（クラス）にそれぞれ対応する３以上の領域に分離する。そして、分離部２１４により生成された分離面は、分離面記憶部２１６に格納され、分類基準出力部２１８に送られる。 As shown in FIG. 3, the separation unit 214 uses the multi-margin support vector machine based on a plurality of input vectors and a plurality of known classifications so that the input space corresponds to three or more classifications (classes). To separate areas. The multi-margin support vector machine generates two or more mutually parallel separation planes in the input space, thereby separating the input space into three or more regions respectively corresponding to three or more classifications (classes). The separation surface generated by the separation unit 214 is stored in the separation surface storage unit 216 and sent to the classification reference output unit 218.

図５は、分離部２１４の構成の詳細を示した機能ブロック図である。入力ベクトル取得部６０２は、入力ベクトル記憶部２０８から入力ベクトルを取得し、サポートベクトルマシン６０６に送る。分類取得部６０４は、分類記憶部２１２からその入力ベクトルに対応する分類を取得し、サポートベクトルマシン６０６に送る。サポートベクトルマシン６０６は、２以上の互いに平行な分離面を入力空間内に生成することにより、入力ベクトルが配置されている入力空間を３以上の領域に分離する。生成された分離面は、分離面出力部６０８により分離部２１４から出力されて、分離面記憶部２１６に格納される。 FIG. 5 is a functional block diagram illustrating details of the configuration of the separation unit 214. The input vector acquisition unit 602 acquires an input vector from the input vector storage unit 208 and sends it to the support vector machine 606. The classification acquisition unit 604 acquires the classification corresponding to the input vector from the classification storage unit 212 and sends it to the support vector machine 606. The support vector machine 606 separates the input space in which the input vectors are arranged into three or more regions by generating two or more parallel separation surfaces in the input space. The generated separation surface is output from the separation unit 214 by the separation surface output unit 608 and stored in the separation surface storage unit 216.

繰り返しになるが、図３に示すように、既存のサポートベクトルマシン（ＳＶＭ）では、所定の次元のベクトル（ｄ次元ベクトル、すなわちｄ個の特徴量の値を含むベクトル）を２クラスに分類する。一方、本発明者の開発したマルチマージンサポートベクトルマシン（ＭＭ−ＳＶＭ）は、所定の次元のベクトルを３以上のクラス（Ｍ段階）に分類する。なお、Ｍは３以上の整数である。 Again, as shown in FIG. 3, in the existing support vector machine (SVM), a vector of a predetermined dimension (d-dimensional vector, that is, a vector including d feature value values) is classified into two classes. . On the other hand, the multi-margin support vector machine (MM-SVM) developed by the present inventors classifies vectors of a predetermined dimension into three or more classes (M stages). M is an integer of 3 or more.

本発明者の開発したマルチマージンサポートベクトルマシン（ＭＭ−ＳＶＭ）は、２以上の互いに平行な分離面を入力空間内に生成することにより、入力空間を３以上の領域に分離するため、既存のサポートベクトルマシンに発生しやすかった「過学習」または「過剰適合」のリスクを低減することができる。このような「過学習」または「過剰適合」のリスクを低減することができることは、後述するように実験的に検証されている。 The multi-margin support vector machine (MM-SVM) developed by the inventor generates two or more parallel separation planes in the input space, thereby separating the input space into three or more regions. It is possible to reduce the risk of “over-learning” or “over-fitting” that is likely to occur in the support vector machine. It has been experimentally verified that the risk of such “overlearning” or “overfit” can be reduced, as will be described later.

一般に、生物学的情報は、（薬剤耐性、疾病度、生存率などの）順序付きのデータが多いため、それらの各種の生物学的情報を総合して、医師、歯科医師、看護師、臨床検査技師、臨床受託会社の技術者、生物学分野の研究者などにより、状況に応じて個別具体的に判断される分類も、低、中、高などのように、やはり順序付きのデータとなることが多い。このため、本発明者の開発したマルチマージンサポートベクトルマシンのように、２以上の互いに平行な分離面を生成すれば、順序付きのデータを適切に分類できる。 In general, biological information has a lot of ordered data (drug resistance, morbidity, survival rate, etc.), so these various types of biological information can be integrated into doctors, dentists, nurses, clinical Classifications that are specifically determined by laboratory technicians, clinical contractor engineers, biology researchers, etc. according to the situation are also ordered data, such as low, medium, and high. There are many cases. Therefore, if two or more parallel separation planes are generated as in the multi-margin support vector machine developed by the present inventor, ordered data can be appropriately classified.

図４において、分類基準生成部２１８は、分離部２１４により生成された２以上の分離面を規定する情報を含む分類基準を生成する。そして、生成された分類基準は、分類基準記憶部２１８に格納される。なお、分類基準とは、後述するマルチマージンサポートベクトルマシンにより生成される分離面そのものであってもよく、分離面を規定するためのパラメータの組合せなどであってもよい。 In FIG. 4, the classification reference generation unit 218 generates a classification reference including information defining two or more separation planes generated by the separation unit 214. Then, the generated classification standard is stored in the classification standard storage unit 218. The classification standard may be a separation plane itself generated by a multi-margin support vector machine, which will be described later, or a combination of parameters for defining the separation plane.

図６は、既存のサポートベクトルマシンと、実施形態１に用いるマルチマージンサポートベクトルマシンとの学習方法の違いをより詳細に説明するための概念図である。既存のサポートベクトルマシン（ＳＶＭ）は、ニューロンのモデルとして最も単純な線形しきい素子を用いて、２クラスのパターン識別器を構成する手法である。既存のサポートベクトルマシンは、訓練サンプル集合（入力ベクトルの集合）から、「マージン最大化」という基準で線形しきい素子のパラメータを学習する。 FIG. 6 is a conceptual diagram for explaining in detail the difference in learning method between the existing support vector machine and the multi-margin support vector machine used in the first embodiment. An existing support vector machine (SVM) is a method of constructing a two-class pattern classifier using the simplest linear threshold element as a neuron model. The existing support vector machine learns the parameters of the linear threshold element from the training sample set (input vector set) on the basis of “margin maximization”.

ここで、入力ベクトルは、ｘ^Ｔ＝（ｘ_１，・・・，ｘ_ｄ）として表されるものとする。なお、ｘ^Ｔは、ベクトルｘの転置を表す。また、ｄは特徴量の個数である。認識対象のクラスの総数をＭ（３以上の整数）とし、各クラスをＣ_１，Ｃ_２，・・・，Ｃ_Ｍとする。 Here, the input vector is expressed as x ^T = (x ₁ ,..., X _d ). Incidentally, ^{x T} represents the transpose of vector x. D is the number of feature quantities. The total number of classes to be recognized is M (an integer of 3 or more), and each class is C ₁ , C ₂ ,..., C _M.

このとき、既存のサポートベクトルマシンで用いる線形しきい素子は、入力ベクトルに対する線形識別関数（スコア関数）である、次式：ｇ（ｘ）＝ｇ（ｘ｜ｗ）＝ｗ^Ｔｘ＝ｗ_１ｘ_１＋・・・＋ｗ_ｄｘ_ｄで表される。この線形しきい素子において、ｇ（ｘ）＝ｂとした場合の識別平面が、入力空間を分離する分離面に相当する。ここで、ｗはシナプス荷重に対応するパラメータであり、ｂはしきい値である。これは、幾何学的には、分離面（識別平面）により、入力空間を２つに分けることに相当する。 At this time, the linear threshold element used in the existing support vector machine is a linear discriminant function (score function) for the input vector, and the following expression: g (x) = g (x | w) = w ^T x = w ₁ x ₁ +... + w _d x _d In this linear threshold element, the identification plane when g (x) = b corresponds to a separation plane that separates the input space. Here, w is a parameter corresponding to the synaptic load, and b is a threshold value. This is geometrically equivalent to dividing the input space into two by the separation plane (identification plane).

なお、図６〜図１１では、この入力ベクトルの集合は、線形分離可能であるとする。すなわち、線形しきい素子のパラメータをうまく調整することで、入力ベクトルの集合を誤りなくわけることができると仮定する。 6 to 11, it is assumed that this set of input vectors can be linearly separated. That is, it is assumed that the set of input vectors can be divided without error by adjusting the parameters of the linear threshold elements well.

入力ベクトルの集合が線形分離可能であるとしても、一般には、入力ベクトルの集合を誤りなくわけるパラメータは一意には決まらない。既存のサポートベクターマシンでは、入力ベクトルをすれすれに通るのではなく、なるべく余裕をもって分けるような分離面が求められる。具体的には、最も近い入力ベクトルとの余裕をマージンと呼ばれる量で測り、マージンが最大となるような分離面を求める。 Even if a set of input vectors is linearly separable, generally, a parameter that divides an input vector set without error is not uniquely determined. In the existing support vector machine, a separation plane that separates input vectors as much as possible is required instead of passing the input vectors. Specifically, the margin with the nearest input vector is measured by an amount called a margin, and a separation plane that maximizes the margin is obtained.

図７および図８は、既存のサポートベクトルマシンと、実施形態１に用いるマルチマージンサポートベクトルマシンとの学習法の数式の違いを説明するための概念図である。図７に示すように、既存のサポートベクトルマシンでは、入力ベクトルの集合が線形分離可能であれば、実線で示される分離面の両側の点線で示される２枚の超平面で入力ベクトルの集合が完全に分離され、２枚の超平面の間には入力ベクトルがひとつも存在しない状態になる。このとき、分離面とこれらの超平面との距離（マージンの大きさ）は、１／||ｗ||となる。また、両側のマージンの合計は、図７において両方向矢印で示すように、２／||ｗ||となる。 FIG. 7 and FIG. 8 are conceptual diagrams for explaining the difference in the mathematical expression of the learning method between the existing support vector machine and the multi-margin support vector machine used in the first embodiment. As shown in FIG. 7, in the existing support vector machine, if the set of input vectors can be linearly separated, the set of input vectors is formed by two hyperplanes indicated by dotted lines on both sides of the separation plane indicated by the solid line. It is completely separated and no input vector exists between the two hyperplanes. At this time, the distance (margin size) between the separation plane and these hyperplanes is 1 / || w ||. Further, the sum of the margins on both sides is 2 / || w || as shown by the double-headed arrow in FIG.

そして、既存のサポートベクトルマシンでマージンを最大とする（間隙を最大とする分離面を規定する）パラメータｗおよびｂを求める問題は、（１／２）||ｗ||^２を最小とするパラメータｗおよびｂを求める問題と等価になる。また、拘束条件として、分類が（―）の入力ベクトルは、分離面より下となり、分類が（＋）の入力ベクトルは、分離面より上となるように拘束する。この最適化問題は、数理計画法の分野における２次計画問題として知られており、さまざまな公知の数値計算法により解くことができる。ただし、既存のサポートベクトルマシンでは、サポートベクトルマシンひとつあたり、分離面はひとつしか生成することができない。 Then, (defining the separation surface to maximize the gap) of the maximum and the margin existing support vector machine is a problem of finding the parameters w and b, the parameters that minimizes the (1/2) || w || ² This is equivalent to the problem of obtaining w and b. Further, as a constraint condition, an input vector with a classification (−) is constrained to be below the separation plane, and an input vector with a classification (+) is constrained to be above the separation plane. This optimization problem is known as a quadratic programming problem in the field of mathematical programming, and can be solved by various known numerical calculation methods. However, an existing support vector machine can generate only one separation plane per support vector machine.

一方、図８に示すように、実施形態１に用いるマルチマージンサポートベクトルマシンでは、上述と同様の手法により分離面を規定する手法であるが、分離面が２以上生成される点、分離面が互いに平行である点、において、既存のサポートベクトルマシンと異なっている。図８では、例として２つの分離面が生成されている。そして、２以上の分離面のいずれにおいても、パラメータｗは同一であるが、しきい値は、ｂ_１およびｂ_２の異なる値をとることになる。図８では、下側の分離面は、しきい値がｂ_１であり、上側の分離面は、しきい値がｂ_２である。 On the other hand, as shown in FIG. 8, the multi-margin support vector machine used in the first embodiment is a method of defining the separation surface by the same method as described above. It differs from existing support vector machines in that they are parallel to each other. In FIG. 8, two separation planes are generated as an example. In any of the two or more separation surfaces, the parameter w is the same, but the threshold value takes different values of b ₁ and b ₂ . In Figure 8, the separation surface of the lower, the threshold is b _1, the upper side of the separation surface, the threshold is b _2.

この場合、ｗの値をいかにして決定するかが問題となるが、本発明者の開発したマルチマージンサポートベクターマシンでは、最も狭い間隙を最大にする分離面を見つけることによりｗを決定している。つまり、２以上のいずれの分離面においてもｗの値を同一にし、２以上のいずれの分離面の両側のマージン領域内に入力ベクトルがひとつも存在しないという条件を満たし、（１／２）||ｗ||^２を最小とする。 In this case, how to determine the value of w becomes a problem, but in the multi-margin support vector machine developed by the present inventor, w is determined by finding the separation plane that maximizes the narrowest gap. Yes. That is, the value of w is the same in any two or more separation planes, and the condition that no input vector exists in the margin area on both sides of any two or more separation planes is satisfied. | w || ² is minimized.

すなわち、本発明者の開発したマルチマージンサポートベクターマシンでマージンを最大とする（最も狭い間隙の間隙を最大とする２以上の互いに平行な分離面を規定する）パラメータｗおよびｂ_１、ｂ_２、ｂ_Ｍ−１を求める問題は、（１／２）||ｗ||^２を最小とするパラメータｗおよびｂ_１、ｂ_２、ｂ_Ｍ−１を求める問題と等価になる。また、拘束条件として、分類が（１）の入力ベクトルは、分離面１より下となり、分類が（２）の入力ベクトルは、分離面１より上であり、分離面２より下となり、同様にして、分離面（Ｍ）の入力ベクトルは、分離面（Ｍ−１）より上となるように拘束する。この最適化問題は、数理計画法の分野における２次計画問題として知られており、さまざまな公知の数値計算法により解くことができる。 That is, the parameters w and b ₁ , b _2, which maximize the margin (specify two or more parallel separation surfaces that maximize the gap of the narrowest gap) in the multi-margin support vector machine developed by the present inventors _. b _M-1 the problem of finding the, (1/2) || w || parameters w and _b 1 ^{to 2} to _minimize, b _{2, b} be a problem equivalent to obtaining the _M-1. In addition, as a constraint condition, an input vector whose classification is (1) is below the separation plane 1, and an input vector whose classification is (2) is above the separation plane 1 and below the separation plane 2, and so on. Thus, the input vector of the separation surface (M) is constrained to be above the separation surface (M-1). This optimization problem is known as a quadratic programming problem in the field of mathematical programming, and can be solved by various known numerical calculation methods.

このように、入力ベクトルはｄ次元のベクトル化されており、分離部において２以上の互いに平行な分離面が生成されるため、図１に示した分類予測判定部１１２においても、未知情報をｄ次元のベクトル化してなる未知ベクトルを生成し、２以上の分離面により３以上の領域に分離される入力空間内のうち、未知ベクトルの配置されている領域に基づいて、未知ベクトルに対応する生物学的情報の分類を予測判定することになる。 Thus, since the input vector is converted into a d-dimensional vector and two or more parallel separation planes are generated in the separation unit, the classification prediction determination unit 112 shown in FIG. A creature corresponding to an unknown vector is generated based on a region where the unknown vector is arranged in an input space that is generated as a vector of dimensions and is separated into three or more regions by two or more separation planes. The classification of scientific information is predicted.

図９は、既存のサポートベクトルマシンと、実施形態１に用いるマルチマージンサポートベクトルマシンとの学習法の数式の違いを説明するための概念図である。既存のサポートベクトルマシンでは、（１／２）||ｗ||^２を最小とするようにパラメータを設定する。この際の拘束条件は、分類が（―）の入力ベクトルは、ｇ（ｘ_ｉ｜ｗ）−ｂ≦−１となり、分類が（＋）の入力ベクトルは、ｇ（ｘ_ｉ｜ｗ）−ｂ≧＋１となるように拘束する。なお、ｇ（ｘ｜ｗ）＝ｗ^Ｔｘである。 FIG. 9 is a conceptual diagram for explaining the difference in the mathematical expression of the learning method between the existing support vector machine and the multi-margin support vector machine used in the first embodiment. In an existing support vector machine, parameters are set so that (1/2) || w || ² is minimized. The constraint condition at this time is that an input vector with a classification (−) is g (x _i | w) −b ≦ −1, and an input vector with a classification (+) is g (x _i | w) −b. Restrained to be ≧ + 1. Note that g (x | w) = w ^T x.

一方、本発明者の開発したマルチマージンサポートベクトルマシンでは、（１／２）||ｗ||^２を最小とするようにパラメータを設定する。この際の拘束条件は、分類が（１）の入力ベクトルは、ｇ（ｘ_ｉ｜ｗ）−ｂ_１≦−１となり、分類が（２）の入力ベクトルは、ｇ（ｘ_ｉ｜ｗ）−ｂ_１≧＋１であり、ｇ（ｘ_ｉ｜ｗ）−ｂ_２≦−１となり、同様にして、分離面（Ｍ）の入力ベクトルは、ｇ（ｘｉ｜ｗ）−ｂ_Ｍ−１≧＋１となるように拘束する。なお、ｇ（ｘ｜ｗ）＝ｗ^Ｔｘである。 On the other hand, in the multi-margin support vector machine developed by the present inventors, it sets parameters so as to minimize the (1/2) || w || ^2. The constraint condition at this time is that the input vector with the classification (1) is g (x _i | w) −b ₁ ≦ −1, and the input vector with the classification (2) is g (x _i | w) −. b ₁ ≧ + 1 and g (x _i | w) −b ₂ ≦ −1. Similarly, the input vector of the separation plane (M) is g (xi | w) −b _M−1 ≧ + 1. Restrained to be. Note that g (x | w) = w ^T x.

図１０は、マルチマージンサポートベクトルマシンにおいて、最適バイアスが不定になるという課題について説明するための概念図である。本発明者の開発したマルチマージンサポートベクトルマシンでは、上記のｂ_１、ｂ_２などのバイアスは、最適値が一意に決まらない場合があり得る。なぜなら、マージンが最小となる分離面では、バイアスは一意に決まるが、マージンが最小ではない分離面では、バイアスは複数の値を取り得るからである。このため、何らかの基準により、バイアスを一意に決める基準を定めておくことが好ましいと考えられる。 FIG. 10 is a conceptual diagram for explaining the problem that the optimum bias becomes indefinite in the multi-margin support vector machine. In the multi-margin support vector machine developed by the present inventor, optimal values may not be uniquely determined for the biases such as b ₁ and b ₂ described above. This is because the bias is uniquely determined on the separation plane with the smallest margin, but the bias can take a plurality of values on the separation plane with the smallest margin. For this reason, it is considered preferable to set a criterion for uniquely determining the bias based on some criterion.

図１１は、図１０で説明した課題を解決するために、マルチマージンサポートベクトルマシンにおいて、バイアスを中央に位置づけることについて説明するための概念図である。バイアスを一意に決める基準として、本発明者の開発したマルチマージンサポートベクトルマシンでは、マージンが最小ではない分離面では、バイアスは両側の２つの超平面の中央に位置するように設定される。すなわち、図１１のｂ_１の場合には、すべての入力ベクトルについて、スコア関数：ｇ（ｘ）に直角な形で投写した場合の値をスコアとし、分類（１）の最大スコアと分類２の最小スコアとの中央の値になるように設定する。 FIG. 11 is a conceptual diagram for explaining that the bias is positioned at the center in the multi-margin support vector machine in order to solve the problem described in FIG. As a criterion for uniquely determining the bias, in the multi-margin support vector machine developed by the present inventor, the bias is set to be located at the center of the two hyperplanes on both sides in the separation plane where the margin is not minimum. That is, in the case of b ₁ in FIG. 11, the score function is a value when projected in a form perpendicular to the score function: g (x) for all input vectors, and the maximum score of classification (1) and classification 2 Set a value that is the center of the minimum score.

このように、バイアスを中央に位置づける理由について説明する。本発明者の開発したマルチマージンサポートベクトルマシンをはじめとして、学習機械の本来の目的は、テストサンプル（分類が未知の入力ベクトル）を誤って識別する率（誤識別率）を最小にすることである。分離面を規定するバイアス（しきい値）を、中央に置くと誤識別率が小さくなる理由については、詳しい説明は省略するが、公知のＶＣ理論の立場から説明すると、期待損失を最小化することが目的ではあるが、分類が既知の入力ベクトルからは、経験損失しか知ることができないので、代わりに経験損失を最小化するしかないことが理由である。この方法は、経験損失最小化（ＥＲＭ）と呼ばれる公知の方法である。すなわち、上記のような場合に経験損失を最小化するためには、バイアスを中央に位置づけることが最も合理的であるため、本発明者の開発したマルチマージンサポートベクトルマシンでは、バイアスを中央に位置づけることとしている。 The reason why the bias is positioned in the center will be described. The original purpose of the learning machine, including the multi-margin support vector machine developed by the present inventor, is to minimize the rate (misidentification rate) of erroneously identifying test samples (input vectors whose classification is unknown). is there. The reason why the misidentification rate decreases when the bias (threshold value) that defines the separation plane is placed in the center is omitted in detail, but from the viewpoint of the known VC theory, the expected loss is minimized. This is because, from an input vector with a known classification, only experience loss can be known, and instead there is no choice but to minimize experience loss. This method is a known method called experience loss minimization (ERM). That is, in order to minimize the experience loss in the above case, it is most reasonable to position the bias in the center. Therefore, in the multi-margin support vector machine developed by the present inventor, the bias is positioned in the center. I am going to do that.

図１２は、サポートベクトルマシンの構成の詳細を示した機能ブロック図である。サポートベクトルマシン６０６には、上述のパラメータベクトルｗを設定するパラメータベクトル設定部７０２が設けられている。パラメータベクトル設定部７０２がパラメータベクトルｗを設定する際の計算については、上述したので繰り返さない。 FIG. 12 is a functional block diagram showing details of the configuration of the support vector machine. The support vector machine 606 is provided with a parameter vector setting unit 702 for setting the parameter vector w described above. Since the calculation when the parameter vector setting unit 702 sets the parameter vector w has been described above, it will not be repeated.

また、サポートベクトルマシン６０６には、上述のバイアスｂ_１、ｂ_２、ｂ_Ｍ−１を設定するバイアス設定部７０４が設けられている。バイアス設定部７０４がバイアスｂを設定する際の計算については、上述したので繰り返さない。 In addition, the support vector machine 606 is provided with a bias setting unit 704 that sets the above-described biases b ₁ , b _{2, and} b _M−1 . Since the calculation when the bias setting unit 704 sets the bias b has been described above, it is not repeated.

さらに、サポートベクトルマシン６０６には、サポートベクトルマシンをソフトマージン法により拡張するソフトマージン化部７０６が設けられている。サポートベクトルマシンをソフトマージン法により拡張する場合に、スラック関数の値を設定するソフトマージン化部７０６の構成および動作については、詳しくは図１３〜図１６で後述する。 Further, the support vector machine 606 is provided with a soft margin unit 706 that extends the support vector machine by a soft margin method. When the support vector machine is expanded by the soft margin method, the configuration and operation of the soft margin unit 706 that sets the value of the slack function will be described in detail later with reference to FIGS.

図１３は、サポートベクトルマシンにより線形分離が不可能な場合について説明するための概念図である。上述の既存のサポートベクトルマシンおよび本発明者の開発したマルチマージンサポートベクトルマシンは、入力ベクトルの集合が線形分離可能な場合についての議論であるが、パターン認識の実問題では、図１３に示すように、２クラスに分類済みの学習用の入力ベクトルの集合でも、Ｍクラス（Ｍは３以上の整数）に分類済みの学習用の入力ベクトルの集合でも、線形分離可能でない場合がある。 FIG. 13 is a conceptual diagram for explaining a case where linear separation is impossible by a support vector machine. The above-described existing support vector machine and the multi-margin support vector machine developed by the present inventor are discussions about a case where a set of input vectors can be linearly separated. In the actual problem of pattern recognition, as shown in FIG. In addition, neither a set of learning input vectors classified into two classes or a set of learning input vectors classified into M classes (M is an integer of 3 or more) may not be linearly separable.

そのため、このように線形分離不可能な場合には、上述の図９の数式において、条件を満たすパラメータは、解なしとなってしまう。よって、実際的な課題にサポートベクトルマシンを使うには、さらなる工夫が必要である。後述するカーネルトリックを用いる手法以外に、まず考えられるのは、多少の識別誤りは許すように制約を緩める方法である。これは、「ソフトマージン」と呼ばれている。 For this reason, when linear separation is impossible in this way, a parameter that satisfies the condition in the above-described equation of FIG. 9 has no solution. Therefore, further ingenuity is necessary to use the support vector machine for practical problems. Other than the method using the kernel trick described later, the first conceivable method is to relax the constraint so as to allow some identification errors. This is called “soft margin”.

図１４は、サポートベクトルマシンをソフトマージン化により弛緩する場合について説明するための概念図である。ソフトマージン法では、マージン１／||ｗ||（両側の合計は２／||ｗ||）を最大としながら、図１４に示すように、幾つかの入力ベクトルが超平面Ｈ１または超平面Ｈ２を超えて反対側に入ってしまうことを許す。反対側にどれくらい入り込んだかの距離を、パラメータξ_ｉ（≧０）を用いてξ_ｉ／||ｗ||と表すとすると、その和は、なるべく小さい方が望ましい。 FIG. 14 is a conceptual diagram for explaining a case where the support vector machine is relaxed by soft margin. In the soft margin method, the margin 1 / || w || (the sum of both sides is 2 / || w ||) is maximized, and as shown in FIG. 14, some input vectors are hyperplane H1 or hyperplane. Allow to enter the other side beyond H2. If the distance of how far into the opposite side is expressed as ξ _i / || w || using the parameter ξ _i (≧ 0), the sum is preferably as small as possible.

図１５は、サポートベクトルマシンをソフトマージン化により弛緩する場合の学習法について説明するための概念図である。既存のサポートベクトルマシンのソフトマージン法では、上述の条件から最適な分離面を求める問題は、（１／２）||ｗ||^２＋ＣΣ（ｉ＝１→Ｎ）ξ_ｉを最小とするパラメータを求める問題に帰着される。なお、Σ（ｉ＝１→Ｎ）ξ_ｉは、ξ_１からξ_Ｎまでの和を求めることを意味する。また、あらたに導入したパラメータＣは、第一項のマージンの大きさと、第二項のはみ出しの程度とのバランスを決める定数である。 FIG. 15 is a conceptual diagram for explaining a learning method when the support vector machine is relaxed by soft margin. The soft margin method existing support vector machine, the problem of finding an optimum separation surface from the above conditions, to minimize the ^{(1/2) || w || 2 +} CΣ (i = 1 → N) ξ i Parameters To the problem of seeking. Note that Σ (i = 1 → N) ξ _i means obtaining the sum from ξ ₁ to ξ _N. The newly introduced parameter C is a constant that determines the balance between the size of the margin of the first term and the degree of protrusion of the second term.

この際の拘束条件は、分類が（―）の入力ベクトルは、ｇ（ｘ_ｉ｜ｗ）−ｂ≦−１＋ξ_ｉとなり、分類が（＋）の入力ベクトルは、ｇ（ｘ_ｉ｜ｗ）−ｂ≧＋１−ξ_ｉとなるように拘束する。なお、ｇ（ｘ｜ｗ）＝ｗ^Ｔｘである。 The constraint condition at this time is that an input vector with a classification (−) is g (x _i | w) −b ≦ −1 + ξ _i , and an input vector with a classification (+) is g (x _i | w) −. It restrains so that it may become b> = + 1- _xi . Note that g (x | w) = w ^T x.

一方、本発明者の開発したマルチマージンサポートベクトルマシンでは、上述の条件から最適な分離面を求める問題は、（１／２）||ｗ||^２＋Ｃ｛Σ（ｍ＝１→Ｍ−１）Σ（ｉ∈Ｉ_ｍ）ξ_ｉ ⁻＋Σ（ｍ＝２→Ｍ）Σ（ｉ∈Ｉ_ｍ）ξ_ｉ ^＋｝を最小とするパラメータを求める問題に帰着される。なお、Σ（ｉ∈Ｉ_ｍ）Ｘ_ｉは、クラスｍに属する観測値の添え字集合に属するｉについてξ_ｉの和を求めることを意味する。また、あらたに導入したパラメータＣは、第一項のマージンの大きさと、第二項のはみ出しの程度とのバランスを決める定数である。 On the other hand, in the multi-margin support vector machine developed by the present inventors, the problem of finding an optimum separation surface from the above ^{conditions, (1/2) || w || 2} + C {Σ (m = 1 → M-1 ) Σ (i∈I _m ) ξ _i ⁻ + Σ (m = 2 → M) Σ (i∈I _m ) ξ _i ⁺ } is reduced to the problem of obtaining a parameter that minimizes. Note that Σ (iεI _m ) X _i means obtaining the sum of ξ _i for _i belonging to the subscript set of observation values belonging to class m. The newly introduced parameter C is a constant that determines the balance between the size of the margin of the first term and the degree of protrusion of the second term.

この際の拘束条件は、分類が（１）の入力ベクトルは、ｇ（ｘ_ｉ｜ｗ）−ｂ_１≦−１＋ξ_ｉ ⁻となり、分類が（２）の入力ベクトルは、ｇ（ｘ_ｉ｜ｗ）−ｂ_１≧＋１−ξ_ｉ ^＋であり、ｇ（ｘ_ｉ｜ｗ）−ｂ_２≦−１＋ξ_ｉ ⁻となり、同様にして、分離面（Ｍ）の入力ベクトルは、ｇ（ｘｉ｜ｗ）−ｂ_Ｍ−１≧＋１−ξ_ｉ ^＋となるように拘束する。なお、ｇ（ｘ｜ｗ）＝ｗ^Ｔｘである。 The constraint condition at this time is that an input vector with classification (1) is g (x _i | w) −b ₁ ≦ −1 + ξ _i ⁻ , and an input vector with classification (2) is g (x _i | w ) −b ₁ ≧ + 1−ξ _i ⁺ , and g (x _i | w) −b ₂ ≦ −1 + ξ _i ⁻ . Similarly, the input vector of the separation plane (M) is g (xi | w) -B _M-1 ≧ + 1−ξ _i ⁺ is constrained. Note that g (x | w) = w ^T x.

図１６は、サポートベクトルマシンをソフトマージン化により弛緩する場合の学習法の解法について説明するための概念図である。上述の既存のサポートベクトルマシンを弛緩させた場合、および本発明者の開発したマルチマージンサポートベクトルマシンを弛緩させた場合の解法については、いずれの最適化問題の一種に過ぎない。 FIG. 16 is a conceptual diagram for explaining a solution of the learning method when the support vector machine is relaxed by soft margin. The solution when the above-described existing support vector machine is relaxed and when the multi-margin support vector machine developed by the present inventor is relaxed is only one kind of optimization problem.

これらの最適化問題の解法については、詳しくは説明しないが、基本的には線形分離可能な場合と同様に図１６に示した数式により公知の計算方法で解くことができる。 Although the method for solving these optimization problems will not be described in detail, it can be basically solved by a known calculation method using the mathematical formula shown in FIG. 16 as in the case where linear separation is possible.

図１７は、本実施の形態に係る生物学的情報処理装置１００の動作について説明するためのフローチャートである。まず、生物学的情報処理装置１００の動作がスタートすると、既知情報取得部１０２により、３クラス以上に分類済みの既知の生物学的情報が外部から取得される（Ｓ１０２）。次いで、取得された分類が既知の生物学的情報は、特徴量抽出部１１８において、所定の抽出基準に基づいて特徴量の次元数を抽出により調節される（Ｓ１０４）。そして、特徴量が抽出された生物学的情報から、入力ベクトル生成部２０６にて入力ベクトルが入力空間内に生成される（Ｓ１０６）。 FIG. 17 is a flowchart for explaining the operation of biological information processing apparatus 100 according to the present embodiment. First, when the operation of the biological information processing apparatus 100 starts, the known information acquisition unit 102 acquires known biological information classified into three or more classes from the outside (S102). Next, the acquired biological information whose classification is known is adjusted by extracting the number of dimensions of the feature amount based on a predetermined extraction criterion in the feature amount extraction unit 118 (S104). Then, an input vector is generated in the input space by the input vector generation unit 206 from the biological information from which the feature amount is extracted (S106).

その後、生成された入力ベクトルの集合および入力ベクトルのそれぞれに対応する分類に基づいて、分離部２１４により、２以上の互いに平行な分離面が入力空間内に生成される（Ｓ１０８）。生成された分離面に基づいて、分類基準生成部１０６により、分類が未知の入力ベクトル（未知ベクトル）を分類予測するための分類基準が設定される（Ｓ１１０）。設定された分類基準は、出力部１１６により出力される（Ｓ１１２）。 Thereafter, based on the set of generated input vectors and the classification corresponding to each of the input vectors, the separation unit 214 generates two or more parallel separation planes in the input space (S108). Based on the generated separation plane, the classification standard generation unit 106 sets a classification standard for classifying and predicting an input vector (unknown vector) whose classification is unknown (S110). The set classification standard is output by the output unit 116 (S112).

次いで、未知情報取得部１１０により、分類が未知の生物学的情報が外部から取得される（Ｓ１１４）。次いで、取得された分類が未知の生物学的情報は、特徴量抽出部１１８において、所定の抽出基準に基づいて特徴量の次元数を抽出により調節される（Ｓ１１６）。そして、抽出された特徴量に基づいて、分類が未知の未知ベクトルが生成される（Ｓ１１８）。 Next, biological information whose classification is unknown is acquired from the outside by the unknown information acquisition unit 110 (S114). Next, the acquired biological information whose classification is unknown is adjusted by extracting the number of dimensions of the feature amount based on a predetermined extraction criterion in the feature amount extraction unit 118 (S116). Then, based on the extracted feature quantity, an unknown vector whose classification is unknown is generated (S118).

そして、この未知ベクトルを、設定された分類基準にあてはめることにより、未知ベクトルの分類予測が判定される（Ｓ１２０）。その後、判定された分類予測は、出力部１１６により出力されて（Ｓ１２２）、一連の動作が終了する。 Then, the unknown vector classification prediction is determined by applying the unknown vector to the set classification standard (S120). Thereafter, the determined classification prediction is output by the output unit 116 (S122), and the series of operations ends.

以下、生物学的情報処理装置１００の作用効果について説明する。
生物学的情報処理装置１００によれば、生物学的情報を３以上のクラスに分類する際に、本発明者の開発したマルチマージンサポートベクトルマシンを用いて、２以上の互いに平行な分離面により入力空間を３以上のクラスにそれぞれ対応する３以上の領域に分離することにより、「過学習」または「過剰適合」を回避しつつ、高い識別性を実現することができる。 Hereinafter, the operational effects of the biological information processing apparatus 100 will be described.
According to the biological information processing apparatus 100, when the biological information is classified into three or more classes, the multi-margin support vector machine developed by the present inventor is used to separate two or more parallel separation surfaces. By separating the input space into three or more regions corresponding to three or more classes, high discrimination can be realized while avoiding “over-learning” or “over-fitting”.

すなわち、本発明者の開発したマルチマージンサポートベクトルマシン（ＭＭ−ＳＶＭ）は、２以上の互いに平行な分離面を入力空間内に生成することにより、入力空間を３以上の領域に分離するため、既存のサポートベクトルマシンに発生しやすかった「過学習」または「過剰適合」のリスクを低減することができる。このような「過学習」または「過剰適合」のリスクを低減することができることは、実験的に検証されている。 That is, the multi-margin support vector machine (MM-SVM) developed by the present inventors generates two or more parallel separation planes in the input space, thereby separating the input space into three or more regions. It is possible to reduce the risk of “over-learning” or “over-fit” that is likely to occur in existing support vector machines. It has been experimentally verified that the risk of such “overlearning” or “overfit” can be reduced.

そして、一般に、生物学的情報は、（薬剤耐性、疾病度、生存率などの）順序付きのデータが多いため、それらの各種の生物学的情報を総合して、医師、歯科医師、看護師、臨床検査技師、臨床受託会社の技術者、生物学分野の研究者などにより、状況に応じて個別具体的に判断される分類も、低、中、高などのように、やはり順序付きのデータとなることが多い。このため、本発明者の開発したマルチマージンサポートベクトルマシンのように、２以上の互いに平行な分離面を生成すれば、順序付きのデータを適切に分類できる。 And in general, biological information has a lot of ordered data (drug resistance, morbidity, survival rate, etc.), so these various types of biological information can be integrated into doctors, dentists, nurses. Classifications that are specifically determined by clinical technologists, clinical contractor engineers, biology researchers, etc. according to the situation are also ordered data, such as low, medium, high, etc. Often. Therefore, if two or more parallel separation planes are generated as in the multi-margin support vector machine developed by the present inventor, the ordered data can be appropriately classified.

くわえて、本発明者の開発したマルチマージンサポートベクトルマシンは、種々の工夫により高い識別性能を有するため、生物学的情報を３以上のクラスに分類する際における予測または分類基準の識別性を向上できる。そのため、生物学的情報処理装置１００によれば、生物学的情報を３以上のクラスに分類する際における分類基準の妥当性を向上できる。したがって、生物学的情報を３以上のクラスに分類する際における予測の妥当性を向上できる。 In addition, since the multi-margin support vector machine developed by the present inventor has high discrimination performance by various devices, it improves the discriminability of prediction or classification criteria when classifying biological information into three or more classes. it can. Therefore, according to the biological information processing apparatus 100, it is possible to improve the validity of the classification criteria when classifying biological information into three or more classes. Therefore, it is possible to improve the validity of prediction when classifying biological information into three or more classes.

また、本発明者の開発したマルチマージンサポートベクトルマシンは、図１８に示すように、単一のスコア関数に対する２以上のバイアスにより２以上の分離面が規定される。そのため、この単一のスコア関数のスコアが順序付き医療観測データにおける一種の評価指標としての役割を担うことができる。すなわち、同じクラスに属するベクトルであっても、スコア関数に対して概念的に直角な投射を行った場合の写像のスコア関数の値に応じて、順序付き医療観測データにおける一種の順序付けを行うことができる。 In the multi-margin support vector machine developed by the present inventors, two or more separation planes are defined by two or more biases for a single score function, as shown in FIG. Therefore, the score of this single score function can play a role as a kind of evaluation index in the ordered medical observation data. In other words, even if the vectors belong to the same class, a kind of ordering in the ordered medical observation data is performed according to the score function value of the mapping when the projection is conceptually orthogonal to the score function. Can do.

さらに、生物学的情報処理装置１００によれば、本発明者の開発したマルチマージンサポートベクトルマシンは、２以上の互いに平行な分離面のうち、マージン幅が最小である分離面のマージン幅を最大化するため、なるべく余裕を持って入力ベクトルまたは特徴ベクトルを分離することになるため、生物学的情報を３以上のクラスに分類する際における予測または分類基準の識別性および汎化性を向上できる。 Furthermore, according to the biological information processing apparatus 100, the multi-margin support vector machine developed by the present inventor maximizes the margin width of the separation surface having the smallest margin width among two or more parallel separation surfaces. Therefore, the input vector or the feature vector is separated with a margin as much as possible, so that the identifiability and generalization of the prediction or classification criteria when the biological information is classified into three or more classes can be improved. .

また、生物学的情報処理装置１００によれば、本発明者の開発したマルチマージンサポートベクトルマシンは、分離面を、マージン幅の中央に位置するように配置するため、サポートベクトルマシンは、分離面を一意に決定することができるため、生物学的情報を３以上のクラスに分類する際における予測または分類基準の汎化性を向上できる。 Further, according to the biological information processing apparatus 100, the multi-margin support vector machine developed by the present inventor arranges the separation plane so as to be located at the center of the margin width. Therefore, it is possible to improve generalization of prediction or classification criteria when classifying biological information into three or more classes.

より具体的には、生物学的情報処理装置１００によれば、２以上の互いに平行な分離面は、次式ｂ＝ｗ^Ｔｘ（この式で、ｘは入力ベクトルまたは特徴ベクトルであり、ｗはパラメータベクトルであり、ｂはバイアスであり、Ｔは転置を示す算術記号である）で表されるスコア関数により規定され、ｗは、前記２以上の分離面で同一であるパラメータベクトルであり、ｂは、前記２以上の分離面で互いに異なる値をとるバイアスであってもよいため、ｂ＝ｗ^Ｔｘで表される２以上の分離面は、互いに平行となり、「過学習」または「過剰適合」を回避しつつ、高い識別性を実現することができるので、生物学的情報を３以上のクラスに分類する際における予測または分類基準の妥当性を向上できる。 More specifically, according to the biological information processing apparatus 100, two or more parallel separation surfaces are expressed by the following equation: b = w ^T x (where x is an input vector or feature vector, and w Is a parameter vector, b is a bias, T is an arithmetic symbol indicating transposition), and w is a parameter vector that is the same in the two or more separation planes, Since b may be biases having different values from each other on the two or more separation surfaces, the two or more separation surfaces represented by b = w ^T x are parallel to each other and are “over-learning” or “excessive”. Since high discrimination can be achieved while avoiding “adaptation”, it is possible to improve the validity of prediction or classification criteria when classifying biological information into three or more classes.

そして、生物学的情報処理装置１００によれば、サポートベクトルマシンは、ソフトマージン法により拡張されたサポートベクトルマシンであってもよいため、入力ベクトルまたは特徴ベクトルを線形分離することが困難な場合にも、ソフトマージン法により線形分離することが可能になるため、生物学的情報を３以上のクラスに分類する際における予測または分類基準の汎化性を向上できる。 According to the biological information processing apparatus 100, since the support vector machine may be a support vector machine extended by the soft margin method, it is difficult to linearly separate an input vector or a feature vector. However, since it is possible to perform linear separation by the soft margin method, it is possible to improve generalization of prediction or classification criteria when classifying biological information into three or more classes.

＜実施形態２＞
本実施形態は、基本的には、実施形態１をカーネルトリック法により拡張した変形例であり、特に言及する場合を除いて、実施形態１の場合と同様の構成であるものとする。まず、本実施形態の意義について理解してもらうために、以下カーネルトリック法の概要について、説明する。 <Embodiment 2>
The present embodiment is basically a modification obtained by expanding the first embodiment by the kernel trick method, and has the same configuration as that of the first embodiment, unless otherwise specified. First, in order to understand the significance of this embodiment, an outline of the kernel trick method will be described below.

本発明者の開発したマルチマージンサポートベクトルマシンなどの学習機械を用いて、訓練サンプル（入力ベクトルの集合）の学習を行う際には、上述のようにソフトマージン法を用いることで、線形分離可能でない場合に対しても線形しきい素子（分離面）のパラメータを用いることができるようになる。 When training samples (a set of input vectors) are learned using a learning machine such as a multi-margin support vector machine developed by the present inventor, linear separation is possible by using the soft margin method as described above. However, the parameter of the linear threshold element (separation surface) can be used even in the case of not.

しかし、ソフトマージン法を用いても、本質的に非線形で複雑な入力ベクトルの集合に対しては、必ずしも良い性能のマルチマージンサポートベクトルマシンを構成できるとは限らない。本質的に非線形な入力ベクトルの集合に対応するための方法として、後述する図２０に示すように、入力ベクトルを非線形変換して、それぞれ対応する特徴ベクトルを生成し、その空間で線形の識別を行う「カーネルトリック」と呼ばれている方法を用いることができる。この方法を用いることにより、マルチマージンサポートベクトルマシンの性能を飛躍的に向上することができる。 However, even if the soft margin method is used, a multi-margin support vector machine with good performance cannot always be constructed for an essentially nonlinear and complex set of input vectors. As a method for dealing with an essentially nonlinear set of input vectors, as shown in FIG. 20 to be described later, the input vectors are nonlinearly transformed to generate corresponding feature vectors, and linear identification is performed in the space. A method called “kernel trick” to do can be used. By using this method, the performance of the multi-margin support vector machine can be dramatically improved.

一般に、線形分離可能性は、サンプル数が大きくなればなるほど難しくなり、逆に、特徴ベクトルの次元が入力ベクトルのサンプル数よりも大きいなら、どんな分類パターンに対しても線形分離可能である。しかし、非線形で複雑な入力ベクトルの集合を線形分離可能にするためには、入力ベクトルのサンプル数と同程度の大きな次元に写像しなければならないので、結果的に膨大な計算量が必要になってしまう。 In general, linear separability becomes more difficult as the number of samples increases, and conversely, if the dimension of the feature vector is larger than the number of samples of the input vector, linear separation is possible for any classification pattern. However, in order to make it possible to linearly separate a set of nonlinear and complex input vectors, it must be mapped to a dimension as large as the number of samples in the input vector, resulting in a huge amount of computation. End up.

このとき、半正定置性カーネル関数という関数を用いて上述の計算を行えば、高次元に写像しながら、実際には写像された空間での特徴ベクトルの計算を避けて、カーネルの計算のみで最適な識別関数を構成することができる。このようなテクニックのことを「カーネルトリック」と呼ぶこととする。 At this time, if the above calculation is performed using a function called a semi-fixed kernel function, while calculating in a high dimension, the calculation of the feature vector in the mapped space is actually avoided, and only the calculation of the kernel is performed. An optimal discriminant function can be constructed. This technique is called “kernel trick”.

図１９は、生物情報処理装置の備える分類基準生成部１０６の構成の概要を示した機能ブロック図である。本実施形態の生物情報処理装置の全体構成については、実施形態１の場合と同様であるため、説明を省略する。尚、図１９以降の図面において、実施形態１と同様な構成要素には同様の符号を付し、適宜説明を省略する。 FIG. 19 is a functional block diagram illustrating an outline of the configuration of the classification reference generation unit 106 included in the biological information processing apparatus. The overall configuration of the biological information processing apparatus of the present embodiment is the same as that of the first embodiment, and thus the description thereof is omitted. In FIG. 19 and subsequent drawings, the same components as those of the first embodiment are denoted by the same reference numerals, and the description thereof is omitted as appropriate.

分類基準生成部１０６では、入力ベクトルを変換して特徴ベクトルを生成する変換部３０２が設けられている。変換部３０２は、入力ベクトル記憶部２０８から入力ベクトルを取得して、非線形変換して特徴ベクトルを生成する。生成された特徴ベクトルは、特徴ベクトル記憶部３０４に格納され、分離部２１４に送られる。また、分類記憶部２１２からは、分離部２１４に個々の特徴ベクトルに対応する既知の分類が送られる。 The classification reference generation unit 106 includes a conversion unit 302 that converts an input vector to generate a feature vector. The conversion unit 302 acquires an input vector from the input vector storage unit 208 and performs nonlinear conversion to generate a feature vector. The generated feature vector is stored in the feature vector storage unit 304 and sent to the separation unit 214. In addition, a known classification corresponding to each feature vector is sent from the classification storage unit 212 to the separation unit 214.

すなわち、変換部３０２は、入力空間内の複数の入力ベクトルを非線形写像により変換することにより、複数の入力ベクトルにそれぞれ対応する複数の特徴ベクトルを、入力空間よりも高次限な特徴空間内に生成する。なお、この非線形変換については、詳しくは図２２〜図２５で後述する。 In other words, the conversion unit 302 converts a plurality of input vectors in the input space by nonlinear mapping, thereby converting a plurality of feature vectors respectively corresponding to the plurality of input vectors into a feature space having a higher order than the input space. Generate. Details of this nonlinear transformation will be described later with reference to FIGS.

分離部２１４は、後述する図２０に示すように、取得した分類既知の特徴ベクトルに基づいて、例えば、本発明者の開発したマルチマージンサポートベクトルマシンなどの学習機械を用いて、特徴ベクトルの分類を予測するための２以上の互いに平行な分離面を特徴空間内に生成する。 As shown in FIG. 20 to be described later, the separation unit 214 classifies the feature vectors using a learning machine such as a multi-margin support vector machine developed by the present inventor based on the acquired known classification feature vectors. Two or more parallel separation planes for predicting are generated in the feature space.

生成された分離面は、分離面記憶部２１６に格納され、そのまま分類基準出力部２１８に送られて、分類基準生成部１０６の外部にある分類基準記憶部１０８に出力されて格納される。 The generated separation surface is stored in the separation surface storage unit 216, sent as it is to the classification reference output unit 218, output to the classification reference storage unit 108 outside the classification reference generation unit 106, and stored therein.

図２０は、分類基準生成部でのカーネルトリックの動作を説明するための概念図である。図２０では、２次元空間内での分類問題におけるカーネルトリックを示している。左の２次元の入力空間内の入力ベクトルの集合を、（ｘ_１ ^２，２^１／２ｘ_１ｘ_２，ｘ_２ ^２）を特徴として用いて非線形写像を行うと、右の３次元の特徴空間内の特徴ベクトルの集合では、線形の平面で２つのクラスを識別できるようになる。これは、入力空間内で楕円形の識別面を構成していることに対応する。 FIG. 20 is a conceptual diagram for explaining the operation of the kernel trick in the classification reference generation unit. FIG. 20 shows a kernel trick for a classification problem in a two-dimensional space. If a set of input vectors in the left two-dimensional input space is nonlinearly mapped using (x ₁ ² , 2 ^1/2 x ₁ x ₂ , x ₂ ² ) as features, the right three-dimensional features In a set of feature vectors in space, two classes can be identified on a linear plane. This corresponds to configuring an elliptical identification surface in the input space.

図２１は、変換部および分離部の構成の詳細を示した機能ブロック図である。変換部３０２では、入力ベクトル取得部５０２が、入力ベクトル記憶部２０８から入力ベクトルを取得し、非線形変換部５０４に送る。非線形変換部５０４は、入力空間中の入力ベクトルを非線形写像により変換して、より高次元の特徴空間中の特徴ベクトルに変換する。生成された特徴ベクトルは、特徴ベクトル出力部５０６により変換部３０２の外部の特徴ベクトル記憶部２０８に出力され、格納される。 FIG. 21 is a functional block diagram illustrating details of the configuration of the conversion unit and the separation unit. In the conversion unit 302, the input vector acquisition unit 502 acquires an input vector from the input vector storage unit 208 and sends it to the nonlinear conversion unit 504. The non-linear conversion unit 504 converts an input vector in the input space by a non-linear mapping, and converts it into a feature vector in a higher-dimensional feature space. The generated feature vector is output by the feature vector output unit 506 to the feature vector storage unit 208 outside the conversion unit 302 and stored.

分離部２１４では、特徴ベクトル取得部７０２が、特徴ベクトル記憶部２０８から特徴ベクトルを取得し、サポートベクトルマシン６０６に送る。また、分類取得部６０４が、その特徴ベクトルに対応する既知の分類を分類記憶部２１２から取得し、サポートベクトルマシン６０６に送る。サポートベクトルマシン６０６は、実施形態１での場合と同様にして、本発明者の開発したマルチマージンサポートベクトルマシンを用いて、特徴空間内の特徴ベクトルの集合を３以上のクラスに線形分離する、２以上の互いに平行な分離面を生成する。生成された分離面は、分離面出力部６０８により、分離部２１４の外部の分離面記憶部２１６に出力されて格納される。 In the separation unit 214, the feature vector acquisition unit 702 acquires a feature vector from the feature vector storage unit 208 and sends it to the support vector machine 606. Also, the classification acquisition unit 604 acquires a known classification corresponding to the feature vector from the classification storage unit 212 and sends it to the support vector machine 606. The support vector machine 606 linearly separates a set of feature vectors in the feature space into three or more classes using the multi-margin support vector machine developed by the present inventors in the same manner as in the first embodiment. Generate two or more parallel separation planes. The generated separation surface is output and stored by the separation surface output unit 608 in the separation surface storage unit 216 outside the separation unit 214.

図２２は、本発明者の開発したマルチマージンサポートベクトルマシンの拡張に用いるカーネル関数の一例を示した概念図である。カーネルトリックを用いて非線形に拡張した、本発明者の開発したマルチマージンサポートベクトルマシンでは、マージン最小の分離面における「マージン最大化」という基準から自動的に分離面付近の少数の特徴ベクトルに対応するカーネル（カーネル特徴）のみが選択され、最適な識別関数（分離面）が構成される。 FIG. 22 is a conceptual diagram showing an example of a kernel function used for extending the multi-margin support vector machine developed by the present inventors. The multi-margin support vector machine developed by the inventor, which has been extended nonlinearly using kernel tricks, automatically supports a small number of feature vectors near the separation plane based on the criterion of "margin maximization" on the separation plane with the smallest margin. Only the kernel (kernel feature) to be selected is selected, and the optimum discriminant function (separation surface) is constructed.

このとき、本発明者の開発したマルチマージンサポートベクトルマシンの拡張に適したカーネル関数としては、半正定置性を満たすカーネル関数を用いることができる。半正定置性を満たすカーネル関数を用いて、入力ベクトルを特徴ベクトルに変換した場合の計算をする（カーネルトリックを行う）ことにより、計算が容易になることにくわえて、カーネルでの計算結果を一意的にカーネルをはずした場合の計算結果に対応させることができるからである。 At this time, as a kernel function suitable for expansion of the multi-margin support vector machine developed by the present inventors, a kernel function satisfying semi-stationary property can be used. Using a kernel function that satisfies the semi-fixed property, the calculation when the input vector is converted to the feature vector (performing the kernel trick) makes the calculation easier and the calculation result in the kernel This is because it is possible to correspond to the calculation result when the kernel is uniquely removed.

半正定置性を満たすカーネル関数としては、例えば、図２２に示すように、線形カーネル関数、多項式カーネル関数、ＲＢＦカーネル関数などが好適に挙げられる。なお、半正定置性の定義を図２１に示す。この定義を満たすカーネル関数であれば、原理的には、これらの関数に限らず、どのようなカーネル関数を用いても良い。 As a kernel function satisfying the semi-fixed property, for example, as shown in FIG. 22, a linear kernel function, a polynomial kernel function, an RBF kernel function, and the like are preferably exemplified. The definition of semi-stationarity is shown in FIG. In principle, any kernel function may be used as long as the kernel function satisfies this definition.

図２３は、本発明者の開発したマルチマージンサポートベクトルマシンの拡張に用いるカーネル関数による学習法の数理表現の一例を示した概念図である。計算方法の詳細な説明については、省略するが、図２２に示した、カーネル関数による学習算法、カーネル関数によるスコア関数、行列、ベクトルに関する記法、経験写像、添え字集合、カーネル行列、クラスラベル行列、クラスラベル行列の部分行列の数式を用いて、公知の計算方法を用いて、本発明者の開発したマルチマージンサポートベクトルマシンをカーネルトリックにより拡張することができる。 FIG. 23 is a conceptual diagram showing an example of a mathematical expression of a learning method using a kernel function used for expansion of a multi-margin support vector machine developed by the present inventors. Although detailed explanation of the calculation method is omitted, the learning algorithm using the kernel function, the score function using the kernel function, the matrix, the notation regarding the vector, the experience mapping, the subscript set, the kernel matrix, and the class label matrix shown in FIG. The multi-margin support vector machine developed by the present inventor can be extended by a kernel trick using a known calculation method using a sub-matrix formula of a class label matrix.

図２４は、本発明者の開発したマルチマージンサポートベクトルマシンの拡張に用いるカーネル関数による学習法において、ソフトマージン化を行う場合のバイアスの計算方法に用いる数理表現の一例を示した概念図である。このように、本発明者の開発したマルチマージンサポートベクトルマシンは、カーネルトリックにより拡張した上で、さらにソフトマージン化により弛緩させることができる。 FIG. 24 is a conceptual diagram showing an example of a mathematical expression used for a bias calculation method in the case of performing a soft margin in a learning method using a kernel function used for expansion of a multi-margin support vector machine developed by the present inventors. . In this way, the multi-margin support vector machine developed by the present inventor can be relaxed by further soft margin after being expanded by the kernel trick.

ソフトマージン化についての説明は、実施形態１で行ったために繰り返さない。このとき、スラック変数であるξ＋およびξ−の復元、バイアスの計算法は、図２３で示した数式を用いることにより、当業者であれば、公知の計算方法を用いて実行可能である。 The description about the soft margin is not repeated because it has been described in the first embodiment. At this time, a method for restoring slack variables ξ + and ξ− and calculating a bias can be executed by a person skilled in the art using a known calculation method by using the mathematical formula shown in FIG.

図２５は、本発明者の開発したマルチマージンサポートベクトルマシンをカーネルトリックにより拡張した場合における、生物学的情報処理装置の動作について説明するためのフローチャートである。まず、生物学的情報処理装置１００の動作がスタートすると、既知情報取得部１０２により、３クラス以上に分類済みの既知の生物学的情報を外部から取得される（Ｓ２０２）。 FIG. 25 is a flowchart for explaining the operation of the biological information processing apparatus when the multi-margin support vector machine developed by the present inventor is extended by kernel tricks. First, when the operation of the biological information processing apparatus 100 starts, the known information acquisition unit 102 acquires known biological information classified into three or more classes from the outside (S202).

次いで、取得された分類が既知の生物学的情報は、特徴量抽出部１１８において、所定の抽出基準に基づいて特徴量の次元数を抽出により調節される（Ｓ２０４）。そして、特徴量が抽出された分類が既知の生物学的情報から、入力ベクトル生成部２０６にて入力ベクトルが入力空間内に生成される（Ｓ２０６）。 Next, the acquired biological information whose classification is known is adjusted by extracting the number of dimensions of the feature amount based on a predetermined extraction criterion in the feature amount extraction unit 118 (S204). Then, an input vector is generated in the input space by the input vector generation unit 206 from biological information from which the feature quantity is extracted and whose classification is known (S206).

続いて、生成された入力空間内の入力ベクトルは、変換部３０２により、非線形変換されて高次の特徴空間内の特徴ベクトルが生成される（Ｓ２０８）。その後、生成された特徴ベクトルの集合および特徴ベクトルのそれぞれに対応する分類に基づいて、分離部２１４により、２以上の互いに平行な分離面が特徴空間内に生成される（Ｓ２１０）。 Subsequently, the input vector in the generated input space is nonlinearly transformed by the conversion unit 302 to generate a feature vector in a higher-order feature space (S208). Thereafter, based on the generated set of feature vectors and the classification corresponding to each of the feature vectors, the separation unit 214 generates two or more parallel separation planes in the feature space (S210).

次いで、成された分離面に基づいて、分類基準生成部２１８により、後述する未知ベクトルを分類予測するための分類基準が設定される（Ｓ２１２）。設定された分類基準は、分類基準出力部２１８により出力される（Ｓ２１４）。 Next, based on the formed separation plane, the classification standard generation unit 218 sets a classification standard for classifying and predicting an unknown vector, which will be described later (S212). The set classification standard is output by the classification standard output unit 218 (S214).

その後、未知情報取得部１１０により、分類が未知の生物学的情報が外部から取得される（Ｓ２１６）。次に、取得された分類が未知の生物学的情報は、特徴量抽出部１１８により、所定の抽出基準に基づいて特徴量の次元数を抽出される（Ｓ２１８）。そして、抽出された特徴量に基づいて、分類が未知の未知ベクトルが生成される（Ｓ２２０）。 Thereafter, the unknown information acquisition unit 110 acquires biological information whose classification is unknown from the outside (S216). Next, for the acquired biological information whose classification is unknown, the feature quantity extraction unit 118 extracts the dimension number of the feature quantity based on a predetermined extraction criterion (S218). Then, based on the extracted feature quantity, an unknown vector whose classification is unknown is generated (S220).

続いて、分類予測判定部１１２により、この分類が未知の入力ベクトルは、非線形写像により変換され、変換未知ベクトルが生成する（Ｓ２２２）。そして、分類予測判定部１１２により、変換未知ベクトルが設定された分類基準にあてはめられることにより、変換未知ベクトルの分類予測が判定される（Ｓ２２４）。その後、判定された分類予測は、出力部１１６により出力されて（Ｓ２２６）、一連の動作が終了する。 Subsequently, the classification prediction determining unit 112 converts the input vector whose classification is unknown by nonlinear mapping, and generates a conversion unknown vector (S222). Then, the classification prediction determination unit 112 determines the classification prediction of the conversion unknown vector by applying the conversion unknown vector to the set classification criterion (S224). Thereafter, the determined classification prediction is output by the output unit 116 (S226), and the series of operations ends.

以下、本発明者の開発したマルチマージンサポートベクトルマシンをカーネルトリックにより拡張した場合における、生物学的情報処理装置１００の特有の作用効果について説明する。特に言及しない作用効果については、実施形態１の場合と同様である。 In the following, a description will be given of the specific operational effects of the biological information processing apparatus 100 when the multi-margin support vector machine developed by the present inventor is extended by kernel tricks. The effects that are not particularly mentioned are the same as in the first embodiment.

本発明者の開発したマルチマージンサポートベクトルマシンをカーネルトリックにより拡張した場合には、生物学的情報を３以上のクラスに分類する際に、複数の入力ベクトルにそれぞれ対応する複数の特徴ベクトルを、入力空間よりも高次限な特徴空間内に生成した上で、２以上の互いに平行な分離面により特徴空間を３以上のクラスにそれぞれ対応する３以上の領域に分離することにより、「過学習」または「過剰適合」を回避しつつ、高い識別性を実現することができる。 When the multi-margin support vector machine developed by the present inventor is extended by kernel tricks, when classifying biological information into three or more classes, a plurality of feature vectors respectively corresponding to a plurality of input vectors, By generating in a feature space of higher order than the input space and separating the feature space into three or more regions respectively corresponding to three or more classes by two or more parallel separation planes, "Or" over-fitting "can be avoided while achieving high discrimination.

このとき、未知ベクトルを非線形写像により変形して、特徴空間内に変換未知ベクトルを生成することにより、「過学習」または「過剰適合」を回避しつつ、高い識別性を有する分離面により、特徴空間内で変換未知ベクトルの分類予測を判定することができ、生物学的情報を３以上のクラスに分類する際における予測の妥当性を向上できる。 At this time, the unknown vector is transformed by a non-linear mapping to generate a transformed unknown vector in the feature space, thereby avoiding “over-learning” or “over-fitting”, and by using a separation surface having high discrimination characteristics. The classification prediction of the transformed unknown vector can be determined in the space, and the validity of the prediction when the biological information is classified into three or more classes can be improved.

このとき、本発明者の開発したマルチマージンサポートベクトルマシンをカーネルトリックにより拡張した場合には、入力ベクトルが本質的に非線形な入力ベクトルの集合であったとしても、入力ベクトルを非線形変換して、それぞれ対応する特徴ベクトルを生成し、その空間で線形の識別を行う「カーネルトリック」と呼ばれている方法を用いることができる。この方法を用いることにより、特徴ベクトルの次元を入力ベクトルのサンプル数よりも大きくすれば、どんな分類パターンに対しても線形分離可能となる。よって、この方法を用いることにより、マルチマージンサポートベクトルマシンの性能を飛躍的に向上することができる。 At this time, when the multi-margin support vector machine developed by the present inventor is extended by kernel tricks, even if the input vector is an essentially nonlinear input vector set, the input vector is nonlinearly transformed, A method called “kernel trick” in which corresponding feature vectors are generated and linear identification is performed in the space can be used. By using this method, if the dimension of the feature vector is larger than the number of samples of the input vector, linear separation can be performed for any classification pattern. Therefore, the performance of the multi-margin support vector machine can be dramatically improved by using this method.

また、本発明者の開発したマルチマージンサポートベクトルマシンをカーネルトリックにより拡張した場合には、変換部は、半正定置性を満たすカーネル関数を用いて、入力ベクトルを特徴ベクトルに変換した場合の計算をするため、高次元に写像しながら、実際には写像された空間での特徴ベクトルの計算を避けて、カーネルの計算のみで最適な識別関数を構成することができる。そのため、生物学的情報を３以上のクラスに分類する際における予測または分類基準の汎化性を向上できる。 In addition, when the multi-margin support vector machine developed by the present inventor is expanded by kernel tricks, the conversion unit calculates when the input vector is converted into a feature vector using a kernel function that satisfies semi-fixed property. Therefore, it is possible to construct an optimum discriminant function only by calculating a kernel while actually calculating feature vectors in the mapped space while mapping in a high dimension. Therefore, it is possible to improve generalization of prediction or classification criteria when classifying biological information into three or more classes.

さらに、本発明者の開発したマルチマージンサポートベクトルマシンをカーネルトリックにより拡張した場合には、線形カーネル関数、多項式カーネル関数およびＲＢＦカーネル関数よりなる群から選ばれる１種以上のカーネル関数は、いずれも半正定置性を満たすカーネル関数であるため、生物学的情報を３以上のクラスに分類する際における予測または分類基準の汎化性を向上できる。 Further, when the multi-margin support vector machine developed by the present inventor is extended by kernel tricks, any one or more kernel functions selected from the group consisting of a linear kernel function, a polynomial kernel function, and an RBF kernel function are all Since it is a kernel function that satisfies the semi-fixed property, it is possible to improve generalization of prediction or classification criteria when classifying biological information into three or more classes.

＜実施形態２の変形例＞
図２６は、実施形態２の変形例に用いる分類基準生成部の構成の詳細を示した機能ブロック図である。この変形例は、基本的には実施形態２の場合と同様であり、特に言及する場合を除いて、同様の構成を有するものとする。 <Modification of Embodiment 2>
FIG. 26 is a functional block diagram illustrating details of the configuration of the classification reference generation unit used in the modification of the second embodiment. This modified example is basically the same as that of the second embodiment, and has the same configuration except for a case where particularly mentioned.

この変形例では、実施形態２の場合と異なり、分離部２１４にて、本発明者の開発したマルチマージンサポートベクトルマシンにより生成された特徴空間内の分離面は、分離面記憶部２１６に格納され、逆変換部４０２に送られる。 In this modification, unlike the case of the second embodiment, the separation surface in the feature space generated by the multi-margin support vector machine developed by the present inventor is stored in the separation surface storage unit 216 by the separation unit 214. , And sent to the inverse transform unit 402.

そして、逆変換部４０２では、上述の非線形写像による変換の逆変換により、特徴空間内の２以上の互いに平行な線形の分離面が、入力空間内の２以上の非線形な分離面に逆変換される。すなわち、分離部２１４により生成された２以上の分離面を上述の非線形写像により逆変換することにより、入力空間を３以上の領域に分離する２以上の分離面を入力空間内に生成する。 Then, in the inverse transform unit 402, two or more parallel linear separation surfaces in the feature space are inversely transformed into two or more nonlinear separation surfaces in the input space by the inverse transformation of the transformation by the above-described nonlinear mapping. The That is, two or more separation surfaces that separate the input space into three or more regions are generated in the input space by inversely transforming the two or more separation surfaces generated by the separation unit 214 using the above-described nonlinear mapping.

こうして逆変換された分離面は、逆変換済分離面記憶部４０８に格納され、分類基準出力部２１８に送られて、分類基準生成部１０６の外部にある分類基準記憶部１０８に出力されて格納される。 The reversely converted separation plane is stored in the reverse converted separation plane storage unit 408, sent to the classification reference output unit 218, and output to the classification reference storage unit 108 outside the classification reference generation unit 106 for storage. Is done.

図２７は、実施形態２の変形例に係る生物学的情報処理装置の動作について説明するためのフローチャートである。この変形例では、まず、生物学的情報処理装置１００の動作がスタートすると、既知情報取得部１０２により、３クラス以上に分類済みの既知の生物学的情報を外部から取得される（Ｓ３０２）。 FIG. 27 is a flowchart for explaining the operation of the biological information processing apparatus according to the modification of the second embodiment. In this modified example, first, when the operation of the biological information processing apparatus 100 is started, the known information that has been classified into three or more classes is acquired from the outside by the known information acquisition unit 102 (S302).

次いで、取得された分類が既知の生物学的情報は、特徴量抽出部１１８において、所定の抽出基準に基づいて特徴量の次元数を抽出により調節される（Ｓ３０４）。そして、特徴量が抽出された分類が既知の生物学的情報から、入力ベクトル生成部２０６にて入力ベクトルが入力空間内に生成される（Ｓ３０６）。 Next, the acquired biological information whose classification is known is adjusted by extracting the dimensionality of the feature amount based on a predetermined extraction criterion in the feature amount extraction unit 118 (S304). Then, an input vector is generated in the input space by the input vector generation unit 206 from biological information from which the feature quantity is extracted and whose classification is known (S306).

続いて、生成された入力空間内の入力ベクトルは、変換部３０２により、非線形変換されて高次の特徴空間内の特徴ベクトルが生成される（Ｓ３０８）。その後、生成された特徴ベクトルの集合および特徴ベクトルのそれぞれに対応する分類に基づいて、分離部２１４により、２以上の互いに平行な分離面が特徴空間内に生成される（Ｓ３１０）。 Subsequently, the input vector in the generated input space is nonlinearly transformed by the conversion unit 302 to generate a feature vector in a higher-order feature space (S308). Thereafter, based on the generated set of feature vectors and the classification corresponding to each of the feature vectors, the separation unit 214 generates two or more parallel separation planes in the feature space (S310).

その後、生成された特徴空間内の分離面を、逆変換部４０２により上述の非線形写像を用いて逆変換して、入力空間内に逆変換済分離面を生成する（Ｓ３１２）。そして、生成された逆変換済分離面に基づいて、分類基準生成部１０６により、後述する未知ベクトルを分類予測するための分類基準が設定される（Ｓ３１４）。次いで、生成された分類基準は、出力部１１６により出力される（Ｓ３１６）。 Thereafter, the generated separation plane in the feature space is inversely transformed using the above-described nonlinear mapping by the inverse transformation unit 402 to generate an inversely transformed separation plane in the input space (S312). Then, based on the generated reverse-transformed separation plane, the classification standard generation unit 106 sets a classification standard for classifying and predicting an unknown vector to be described later (S314). Next, the generated classification standard is output by the output unit 116 (S316).

さらに、未知情報取得部１１０により、分類が未知の生物学的情報が外部から取得される（Ｓ３１８）。次に、取得された分類が未知の生物学的情報は、特徴量抽出部１１８により、所定の抽出基準に基づいて特徴量の次元数を抽出される（Ｓ３２０）。そして、抽出された特徴量に基づいて、分類が未知の未知ベクトルが生成される（Ｓ３２２）。 Furthermore, biological information whose classification is unknown is acquired from the outside by the unknown information acquisition unit 110 (S318). Next, the acquired biological information whose classification is unknown is extracted by the feature amount extraction unit 118 based on a predetermined extraction criterion (S320). Then, based on the extracted feature quantity, an unknown vector whose classification is unknown is generated (S322).

続いて、分類予測判定部１１２により、未知ベクトルが設定された分類基準にあてはめられることにより、未知ベクトルの分類予測が判定される（Ｓ３２４）。その後、判定された分類予測は、出力部１１６により出力されて（Ｓ３２６）、一連の動作が終了する。 Subsequently, the classification prediction determination unit 112 determines the classification prediction of the unknown vector by applying the unknown vector to the classification criterion set (S324). Thereafter, the determined classification prediction is output by the output unit 116 (S326), and the series of operations ends.

以下、この変形例における、生物学的情報処理装置１００の特有の作用効果について説明する。特に言及しない作用効果については、実施形態２の場合と同様である。 Hereinafter, the specific operation and effect of the biological information processing apparatus 100 in this modification will be described. The effects that are not particularly mentioned are the same as those in the second embodiment.

そして、こうして得られた特徴空間内の分離面を、上述の非線形写像により逆変換することにより、入力空間内にも非線形の分離面を生成することができる。このため、入力空間内で未知ベクトルの「過学習」または「過剰適合」を回避しつつ、高い識別性の分類予測判定を行うことができ、生物学的情報を３以上のクラスに分類する際における予測の妥当性を向上できる。 Then, the separation plane in the feature space thus obtained is inversely transformed by the above-described nonlinear mapping, whereby a nonlinear separation plane can be generated also in the input space. For this reason, it is possible to perform classification prediction determination with high distinctiveness while avoiding “overlearning” or “overfitting” of unknown vectors in the input space, and when biological information is classified into three or more classes. Can improve the validity of predictions.

＜まとめ＞
図２８は、これまで説明してきた実施形態に係る生物学的情報処理装置を用いた順序付き医療観察データの分類予測システムの概要について説明するための概念図である。上述の実施形態１および２に係る生物学的情報処理装置は、図２８に示すように、順序つき医療観測データの分類予測システムとして好適に用いることができる。 <Summary>
FIG. 28 is a conceptual diagram for explaining the outline of the classification prediction system for ordered medical observation data using the biological information processing apparatus according to the embodiment described so far. As shown in FIG. 28, the biological information processing apparatus according to the first and second embodiments can be suitably used as an ordered medical observation data classification prediction system.

例えば、入力データ（既知）を、患者Ａ、Ｂ・・・それぞれについて、医師、看護士、臨床検査技師などが、遺伝子発現データ、血液検査データ、問診データ、Ｘ線データ、その他などのデータを測定してまとめることにより作成する。医師は、これらの入力データを検討して、（薬剤耐性、疾病度、生存率などの）順序付医療観測データの順序について、低、中、高の３段階のいずれかの段階に分類する。 For example, for input data (known), for each of patients A, B, etc., doctors, nurses, clinical technologists, etc. provide data such as gene expression data, blood test data, interview data, X-ray data, etc. Create by measuring and putting together. The doctor examines the input data and classifies the order of the ordered medical observation data (such as drug resistance, morbidity, and survival rate) into one of three stages of low, medium, and high.

こうして得られた分類既知の入力データを、本発明者の開発した新手法であるマルチマージンサポートベクトルマシンにより、計算機学習して分離基準を生成する。そして、得られた分類基準は、あらかじめ交差検定により予測率を推定しておく。 The classification known input data thus obtained is computer-learned by a multi-margin support vector machine, which is a new technique developed by the present inventor, to generate a separation criterion. The obtained classification standard is estimated in advance by a cross-validation.

そして、新患者Ｘの疾病予測をする際には、新患者Ｘについて、医師、看護士、臨床検査技師などが、遺伝子発現データ、血液検査データ、問診データ、Ｘ線データ、その他などのデータを測定してまとめることにより同様に分類未知の入力データを作成する。そして、その入力データを上記の分類基準にあてはめることにより、新患者Ｘの（薬剤耐性、疾病度、生存率などの）順序付医療観測データの順序について、低、中、高の３段階のいずれかの段階に段階判定する。 When predicting the disease of a new patient X, doctors, nurses, clinical technologists, etc. of the new patient X obtain data such as gene expression data, blood test data, interview data, X-ray data, etc. Similarly, input data with unknown classification is created by collecting and measuring. Then, by applying the input data to the above classification criteria, the order of the ordered medical observation data (drug resistance, disease level, survival rate, etc.) of the new patient X can be any of three levels, low, medium, and high. Step by step.

このシステムによれば、生物学的情報を３以上のクラスに分類する際に、本発明者の開発したマルチマージンサポートベクトルマシンを用いているため、２以上の互いに平行な分離面により入力空間を３以上のクラスにそれぞれ対応する３以上の領域に分離することにより、「過学習」または「過剰適合」を回避しつつ、高い識別性を実現することができる。 According to this system, when the biological information is classified into three or more classes, the multi-margin support vector machine developed by the present inventor is used. Therefore, the input space is divided by two or more parallel separation surfaces. By separating into three or more regions corresponding to three or more classes, high discrimination can be realized while avoiding “overlearning” or “overfitting”.

そして、一般に、生物学的情報は、（薬剤耐性、疾病度、生存率などの）順序付きのデータが多いため、それらの各種の生物学的情報を総合して、医師、歯科医師、看護師、臨床検査技師、臨床受託会社の技術者、生物学分野の研究者などにより、状況に応じて個別具体的に判断される分類も、低、中、高などのように、やはり順序付きのデータとなることが多い。このため、本発明者の開発したマルチマージンサポートベクトルマシンのように、２以上の互いに平行な分離面を生成すれば、順序付きのデータを適切に分類できる。 And in general, biological information has a lot of ordered data (drug resistance, morbidity, survival rate, etc.), so these various types of biological information can be integrated into doctors, dentists, nurses. Classifications that are specifically determined by clinical technologists, clinical contractor engineers, biology researchers, etc. according to the situation are also ordered data, such as low, medium, high, etc. Often. Therefore, if two or more parallel separation planes are generated as in the multi-margin support vector machine developed by the present inventor, ordered data can be appropriately classified.

よって、このシステムによれば、新患者Ｘの（薬剤耐性、疾病度、生存率などの）順序付医療観測データの順序について、低、中、高の３段階のいずれかの段階に段階判定する際に、「過学習」または「過剰適合」を抑制して、信頼性の高い段階判定をすることができるようになる。 Therefore, according to this system, the order of the ordered medical observation data (drug resistance, morbidity, survival rate, etc.) of the new patient X is determined in one of three stages of low, medium, and high. In this case, it is possible to perform a highly reliable stage determination by suppressing “over-learning” or “over-fitting”.

以上、図面を参照して本発明の実施形態について述べたが、これらは本発明の例示であり、上記以外の様々な構成を採用することもできる。 As mentioned above, although embodiment of this invention was described with reference to drawings, these are the illustrations of this invention, Various structures other than the above are also employable.

図２９は、実施例に係る生物学的情報処理装置の信頼性を確認するための交差検定による予測率の例を示す図である。本実施例では、実施形態２に係る生物学的情報処理装置１００を用いて、４４ヒト癌由来培養細胞での抗がん剤耐性予測を行った。入力データとしては、図２５の表に示すように、細胞ごとの抗がん剤耐性の実験結果の研究者による評価が、低、中、高の３つのクラスに既に分類されている入力データを用いた。入力データに当所含まれている特徴量は、数千種類のゲノム上の遺伝子発現データであった。 FIG. 29 is a diagram illustrating an example of a prediction rate by cross-validation for confirming the reliability of the biological information processing apparatus according to the embodiment. In this example, antibiotic resistance prediction was performed on cultured cells derived from 44 human cancers using the biological information processing apparatus 100 according to the second embodiment. As the input data, as shown in the table of FIG. 25, input data that has already been classified into three classes of low, medium, and high is evaluated by the researcher on the anticancer drug resistance experimental results for each cell. Using. The features included in the input data were gene expression data on thousands of different genomes.

これらの４４個の入力データのうち４３個の入力データを、所定の基準（ｔ検定による特徴パラメータおよび相関係数による特徴パラメータ）により特徴量抽出して、上位数個の特徴量を抽出して入力データから入力ベクトルを生成した。そして、入力ベクトルを非線形写像により特徴ベクトルに変換して、本発明者の開発したマルチマージンサポートベクトルマシンをＲＢＦカーネル関数により拡張して、さらにソフトマージン法により弛緩した上で、実施形態２で説明した計算を行った。その結果、特徴空間を３つに分離する、２つの互いに平行な分離面が得られた。得られた分離面から、分離基準を設定した。そして、これらの４４個の入力データのうち残り１個の入力データを、分離基準に当てはめて、得られた分類予測判定結果が妥当であるか検討した。 Of these 44 pieces of input data, 43 pieces of input data are extracted in accordance with predetermined criteria (feature parameters by t-test and feature parameters by correlation coefficient), and the top few feature amounts are extracted. An input vector was generated from the input data. Then, the input vector is converted into a feature vector by nonlinear mapping, the multi-margin support vector machine developed by the present inventor is extended by the RBF kernel function, and further relaxed by the soft margin method. The calculation was performed. As a result, two parallel separation surfaces were obtained that separated the feature space into three. A separation standard was set from the obtained separation surface. Then, the remaining one input data among these 44 input data was applied to the separation criterion, and the obtained classification prediction determination result was examined.

このようにして、分離基準の予測率を測定するために、４４細胞のうち４３細胞を学習として、１細胞を予測することを４４回繰り返した。そして、ｔ検定による特徴パラメータおよび相関係数による特徴パラメータを入力ベクトルとして用いた分類予測判定結果を、図２９で表としてまとめた。図２９で示すように、ｔ検定法による特徴量での予測率は、９３．２％（４１／４４）であった。また、実際の耐性量を使用したピアソン相関係数法による特徴量での予測率は、９３．２２％（４１／４４）であった。 Thus, in order to measure the prediction rate of the separation criterion, 43 cells out of 44 cells were learned and 1 cell was predicted 44 times. Then, the classification prediction determination results using the feature parameter by the t test and the feature parameter by the correlation coefficient as input vectors are summarized as a table in FIG. As shown in FIG. 29, the prediction rate with the feature amount by the t-test method was 93.2% (41/44). Moreover, the prediction rate with the feature-value by the Pearson correlation coefficient method using an actual tolerance amount was 93.22% (41/44).

上述の交差検定の結果、実施形態２に係る生物学的情報処理装置１００を用いて、４４ヒト癌由来培養細胞での抗がん剤耐性予測を行ったところ、この生物学的情報処理装置の分類基準は妥当性が高く、この分類基準を用いた分類予測結果の妥当性も高いことが実験的に明らかとなった。 As a result of the cross-validation described above, anticancer drug resistance prediction was performed on 44 human cancer-derived cultured cells using the biological information processing apparatus 100 according to the second embodiment. It was experimentally revealed that the classification criteria are highly valid and the classification prediction results using this classification criteria are also highly valid.

以上、本発明を実施例に基づいて説明した。この実施例はあくまで例示であり、種々の変形例が可能なこと、またそうした変形例も本発明の範囲にあることは当業者に理解されるところである。 In the above, this invention was demonstrated based on the Example. It is to be understood by those skilled in the art that this embodiment is merely an example, and that various modifications are possible and that such modifications are within the scope of the present invention.

以上のように、本発明にかかる生物学的情報処理装置は、２以上の互いに平行な分離面により入力空間または特徴空間を３以上のクラスにそれぞれ対応する３以上の領域に分離するため、生物学的情報を３以上のクラスに分類する際における予測または分類基準の妥当性を向上できるという効果を有し、生物学的情報処理装置、生物学的情報処理方法および生物学的情報処理プログラム等として有用である。 As described above, the biological information processing apparatus according to the present invention separates an input space or a feature space into three or more regions respectively corresponding to three or more classes by two or more parallel separation surfaces. The biological information processing apparatus, the biological information processing method, the biological information processing program, and the like have the effect of improving the validity of prediction or classification criteria when classifying scientific information into three or more classes. Useful as.

実施形態１に係る生物学的情報処理装置の構成の概要を示した機能ブロック図である。1 is a functional block diagram illustrating an outline of a configuration of a biological information processing apparatus according to a first embodiment. 実施形態１に係る生物学的情報処理装置の動作を説明するための概念図である。FIG. 3 is a conceptual diagram for explaining the operation of the biological information processing apparatus according to the first embodiment. 既存のサポートベクトルマシンと、実施形態１に用いるマルチマージンサポートベクトルマシンとの機能の違いを説明するための概念図である。It is a conceptual diagram for demonstrating the difference in the function of the existing support vector machine and the multi-margin support vector machine used for Embodiment 1. 実施形態１に用いる分類基準生成部の構成の詳細を示した機能ブロック図である。3 is a functional block diagram illustrating details of a configuration of a classification reference generation unit used in Embodiment 1. FIG. 実施形態１に用いる分離部の構成の詳細を示した機能ブロック図である。FIG. 3 is a functional block diagram illustrating details of a configuration of a separation unit used in the first embodiment. 既存のサポートベクトルマシンと、実施形態１に用いるマルチマージンサポートベクトルマシンとの学習方法の違いを説明するための概念図である。It is a conceptual diagram for demonstrating the difference in the learning method of the existing support vector machine and the multimargin support vector machine used for Embodiment 1. FIG. 既存のサポートベクトルマシンと、実施形態１に用いるマルチマージンサポートベクトルマシンとの学習法の数式の違いを説明するための概念図である。It is a conceptual diagram for demonstrating the difference of the numerical formula of the learning method of the existing support vector machine and the multimargin support vector machine used for Embodiment 1. FIG. 既存のサポートベクトルマシンと、実施形態１に用いるマルチマージンサポートベクトルマシンとの学習法の数式の違いを説明するための概念図である。It is a conceptual diagram for demonstrating the difference of the numerical formula of the learning method of the existing support vector machine and the multimargin support vector machine used for Embodiment 1. FIG. 既存のサポートベクトルマシンと、実施形態１に用いるマルチマージンサポートベクトルマシンとの学習法の数式の違いを説明するための概念図である。It is a conceptual diagram for demonstrating the difference of the numerical formula of the learning method of the existing support vector machine and the multimargin support vector machine used for Embodiment 1. FIG. 実施形態１に用いるマルチマージンサポートベクトルマシンにおいて、最適バイアスが不定になるという課題について説明するための概念図である。FIG. 9 is a conceptual diagram for explaining a problem that an optimum bias becomes indefinite in the multi-margin support vector machine used in the first embodiment. 実施形態１に用いるマルチマージンサポートベクトルマシンにおいて、バイアスを中央に位置づけることについて説明するための概念図である。FIG. 6 is a conceptual diagram for explaining that a bias is positioned at the center in the multi-margin support vector machine used in the first embodiment. 実施形態１に用いるサポートベクトルマシンの構成の詳細を示した機能ブロック図である。3 is a functional block diagram showing details of the configuration of a support vector machine used in Embodiment 1. FIG. 実施形態１に用いるサポートベクトルマシンにより線形分離が不可能な場合について説明するための概念図である。It is a conceptual diagram for demonstrating the case where linear separation is impossible with the support vector machine used for Embodiment 1. FIG. 実施形態１に用いるサポートベクトルマシンをソフトマージン化により弛緩する場合について説明するための概念図である。It is a conceptual diagram for demonstrating the case where the support vector machine used for Embodiment 1 relaxes by soft margin-ization. 実施形態１に用いるサポートベクトルマシンをソフトマージン化により弛緩する場合の学習法について説明するための概念図である。It is a conceptual diagram for demonstrating the learning method in the case of relaxing the support vector machine used for Embodiment 1 by soft margin-ization. 実施形態１に用いるサポートベクトルマシンをソフトマージン化により弛緩する場合の学習法の解法について説明するための概念図である。It is a conceptual diagram for demonstrating the solution of the learning method in the case of relaxing the support vector machine used for Embodiment 1 by soft margining. 実施形態１に係る生物学的情報処理装置の動作について説明するためのフローチャートである。3 is a flowchart for explaining an operation of the biological information processing apparatus according to the first embodiment. 実施形態１に係る生物学的情報処理装置の作用効果について説明するための概念図である。FIG. 3 is a conceptual diagram for explaining the operational effect of the biological information processing apparatus according to the first embodiment. 実施形態２に用いる分類基準生成部の構成の詳細を示した機能ブロック図である。10 is a functional block diagram showing details of a configuration of a classification reference generation unit used in Embodiment 2. FIG. 実施形態２に用いる分類基準生成部でのカーネルトリックの動作を説明するための概念図である。FIG. 10 is a conceptual diagram for explaining a kernel trick operation in a classification reference generation unit used in the second embodiment. 実施形態２に用いる変換部および分離部の構成の詳細を示した機能ブロック図である。10 is a functional block diagram illustrating details of configurations of a conversion unit and a separation unit used in Embodiment 2. FIG. 実施形態２に用いるカーネル関数の一例を示した概念図である。FIG. 10 is a conceptual diagram illustrating an example of a kernel function used in the second embodiment. 実施形態２に用いるカーネル関数による学習法の数理表現の一例を示した概念図である。6 is a conceptual diagram illustrating an example of a mathematical expression of a learning method using a kernel function used in Embodiment 2. FIG. 実施形態２に用いるカーネル関数による学習法においてソフトマージン化を行う場合のバイアスの計算方法に用いる数理表現の一例を示した概念図である。FIG. 10 is a conceptual diagram illustrating an example of a mathematical expression used in a bias calculation method when soft margining is performed in a learning method using a kernel function used in the second embodiment. 実施形態２に係る生物学的情報処理装置の動作について説明するためのフローチャートである。10 is a flowchart for explaining an operation of the biological information processing apparatus according to the second embodiment. 実施形態２の変形例に用いる分類基準生成部の構成の詳細を示した機能ブロック図である。FIG. 10 is a functional block diagram illustrating details of a configuration of a classification reference generation unit used in a modification of the second embodiment. 実施形態２の変形例に係る生物学的情報処理装置の動作について説明するためのフローチャートである。10 is a flowchart for explaining an operation of the biological information processing apparatus according to the modification of the second embodiment. 実施形態１および２に係る生物学的情報処理装置を用いた順序付き医療観察データの分類予測システムの概要について説明するための概念図である。It is a conceptual diagram for demonstrating the outline | summary of the classification | category prediction system of the ordered medical observation data using the biological information processing apparatus which concerns on Embodiment 1 and 2. FIG. 実施例に係る生物学的情報処理装置の信頼性を確認するための交差検定による予測率の例を示す図である。It is a figure which shows the example of the prediction rate by the cross-validation for confirming the reliability of the biological information processing apparatus which concerns on an Example.

Explanation of symbols

１００生物学的情報処理装置
１０２既知情報取得部
１０４既知情報記憶部
１０６分類基準生成部
１０８分類基準記憶部
１１０未知情報取得部
１１２分類予測判定部
１１４分類予測記憶部
１１６出力部
１１８特徴量抽出部
１２０交差検定部
１２２推定予測率記憶部
２０２生物学的情報取得部
２０４生物学的情報記憶部
２０６入力ベクトル生成部
２０８入力ベクトル記憶部
２１０分類取得部
２１２分類記憶部
２１４分離部
２１６分離面記憶部
２１８分類基準出力部
３０２変換部
３０４特徴ベクトル記憶部
４０２逆変換部
４０８逆変換済分離面記憶部
５０２入力ベクトル取得部
５０４非線形変換部
６０２入力ベクトル取得部
６０４分類取得部
６０６サポートベクトルマシン
６０８分離面出力部
７０２パラメータベクトル設定部
７０４バイアス設定部
７０６ソフトマージン化部
DESCRIPTION OF SYMBOLS 100 Biological information processing apparatus 102 Known information acquisition part 104 Known information storage part 106 Classification reference generation part 108 Classification reference storage part 110 Unknown information acquisition part 112 Classification prediction determination part 114 Classification prediction storage part 116 Output part 118 Feature-value extraction part 120 Cross validation unit 122 Estimated prediction rate storage unit 202 Biological information acquisition unit 204 Biological information storage unit 206 Input vector generation unit 208 Input vector storage unit 210 Classification acquisition unit 212 Classification storage unit 214 Separation unit 216 Separation plane storage unit
218 Classification reference output unit 302 Conversion unit 304 Feature vector storage unit 402 Inverse conversion unit 408 Inversely converted separation plane storage unit 502 Input vector acquisition unit 504 Nonlinear conversion unit 602 Input vector acquisition unit 604 Classification acquisition unit 606 Support vector machine 608 Separation plane Output unit 702 Parameter vector setting unit 704 Bias setting unit 706 Soft margin unit

Claims

A biological information processing apparatus for predicting the classification of biological information,
A known information acquisition unit for acquiring a plurality of known biological information classified into three or more classes;
An input vector generation unit that generates a plurality of input vectors including the plurality of biological information acquired by the known information acquisition unit in an input space;
Based on the plurality of input vectors and the classification of the plurality of biological information respectively corresponding to the plurality of input vectors, two or more parallel separation planes are generated in the input space, and the input space is A separation unit that separates into three or more regions respectively corresponding to the three or more classes;
An unknown information acquisition unit for acquiring biological information whose classification is unknown;
An unknown vector generation unit that generates an unknown vector including the biological information acquired by the unknown information acquisition unit in the input space separated into the three or more regions by the two or more separation surfaces;
A classification of the biological information corresponding to the unknown vector is predicted based on a region where the unknown vector is arranged in an input space separated into the three or more regions by the two or more separation planes. A prediction determination unit for determining;
A prediction classification output unit that outputs a classification of the biological information corresponding to the unknown vector predicted and determined by the prediction determination unit;
A biological information processing apparatus comprising:

A biological information processing apparatus for predicting the classification of biological information,
A known information acquisition unit for acquiring a plurality of known biological information classified into three or more classes;
An input vector generation unit for generating a plurality of input vectors including the plurality of biological information acquired by the known information acquisition unit in an input space;
Conversion that generates a plurality of feature vectors corresponding to the plurality of input vectors in a feature space of a higher order than the input space by converting the plurality of input vectors in the input space by nonlinear mapping. And
Based on the plurality of feature vectors and the classification of the plurality of biological information respectively corresponding to the plurality of feature vectors, two or more parallel separation planes are generated in the feature space, and the feature space is A separation unit that separates into three or more regions respectively corresponding to the three or more classes;
An unknown information acquisition unit for acquiring biological information whose classification is unknown;
By converting the unknown vector including the biological information acquired by the unknown information acquisition unit by the nonlinear mapping, the converted unknown vector corresponding to the unknown vector is converted into the three or more separation surfaces by the two or more separation planes. An unknown vector generation unit that generates the feature space separated into regions;
Classification of the biological information corresponding to the converted unknown vector based on a region where the converted unknown vector is arranged in the feature space separated into the three or more regions by the two or more separation planes. A prediction determination unit for predicting and determining
A predicted classification output unit that outputs a classification of the biological information corresponding to the transformed unknown vector predicted and determined by the prediction determination unit;
A biological information processing apparatus comprising:

A biological information processing apparatus for predicting the classification of biological information,
A known information acquisition unit for acquiring a plurality of known biological information classified into three or more classes;
An input vector generation unit for generating a plurality of input vectors including the plurality of biological information acquired by the known information acquisition unit in an input space;
Conversion that generates a plurality of feature vectors corresponding to the plurality of input vectors in a feature space of a higher order than the input space by converting the plurality of input vectors in the input space by nonlinear mapping. And
Based on the plurality of feature vectors and the classification of the plurality of biological information respectively corresponding to the plurality of feature vectors, two or more parallel separation planes are generated in the feature space, and the feature space is A separation unit that separates into three or more regions respectively corresponding to the three or more classes;
Inverse transformation that generates two or more separation surfaces in the input space by separating the input space into three or more regions by inversely transforming the two or more separation surfaces generated by the separation unit by the nonlinear mapping. And
An unknown information acquisition unit for acquiring biological information whose classification is unknown;
An unknown vector generation unit that generates an unknown vector including the biological information acquired by the unknown information acquisition unit in the input space separated into the three or more regions by the two or more separation surfaces;
A classification of the biological information corresponding to the unknown vector is predicted based on a region where the unknown vector is arranged in an input space separated into the three or more regions by the two or more separation planes. A prediction determination unit for determining;
A prediction classification output unit that outputs a classification of the biological information corresponding to the unknown vector predicted and determined by the prediction determination unit;
A biological information processing apparatus comprising:

The biological information processing apparatus according to claim 2 or 3,
The biological information processing apparatus, wherein the conversion unit is configured to perform a calculation when the input vector is converted into the feature vector using a kernel function satisfying semi-fixed property.

The biological information processing apparatus according to claim 4,
The biological information processing apparatus, wherein the kernel function is one or more kernel functions selected from the group consisting of a linear kernel function, a polynomial kernel function, and an RBF kernel function.

The biological information processing apparatus according to any one of claims 1 to 5,
The biological information processing apparatus, wherein the linear separation unit is configured to generate the two or more parallel separation planes using a support vector machine.

The biological information processing apparatus according to claim 6,
The biological information processing characterized in that the support vector machine is configured to maximize the margin width of a separation surface having a minimum margin width among the two or more parallel separation surfaces. apparatus.

The biological information processing apparatus according to claim 6 or 7,
The biological information processing apparatus, wherein the separation unit is configured to arrange the separation surface so as to be positioned at the center of the margin width.

The biological information processing apparatus according to any one of claims 6 to 8,
The biological information processing apparatus, wherein the support vector machine is a support vector machine extended by a soft margin method.

The biological information processing apparatus according to any one of claims 1 to 9,
The two or more parallel separation surfaces have the following formula: b = w ^T x
(Where x is an input vector or feature vector, w is a parameter vector, b is a bias, and T is an arithmetic symbol indicating transposition)
Is defined by the score function represented by
The w is a parameter vector that is the same in the two or more separation planes;
The biological information processing apparatus, wherein b is a bias having different values on the two or more separation planes.

A biological information processing device for generating classification criteria for classifying biological information,
A known information acquisition unit for acquiring a plurality of known biological information classified into three or more classes;
An input vector generation unit that generates a plurality of input vectors including the plurality of biological information acquired by the known information acquisition unit in an input space;
Based on the plurality of input vectors and the classification of the plurality of biological information respectively corresponding to the plurality of input vectors, two or more parallel separation planes are generated in the input space, and the input space is A separation unit that separates into three or more regions respectively corresponding to the three or more classes;
A classification criterion output unit that outputs the classification criterion including information defining the two or more separation planes generated by the separation unit;
A biological information processing apparatus comprising:

A biological information processing device for generating classification criteria for classifying biological information,
A known information acquisition unit for acquiring a plurality of known biological information classified into three or more classes;
An input vector generation unit that generates a plurality of input vectors including the plurality of biological information acquired by the known information acquisition unit in an input space;
Conversion that generates a plurality of feature vectors corresponding to the plurality of input vectors in a feature space of a higher order than the input space by converting the plurality of input vectors in the input space by nonlinear mapping. And
Based on the plurality of feature vectors and the classification of the plurality of biological information respectively corresponding to the plurality of feature vectors, two or more parallel separation planes are generated in the feature space, and the feature space is A separation unit that separates into three or more regions respectively corresponding to the three or more classes;
A classification criterion output unit that outputs the classification criterion including information defining the two or more separation planes generated by the separation unit;
A biological information processing apparatus comprising:

A biological information processing device for generating classification criteria for classifying biological information,
A known information acquisition unit for acquiring a plurality of known biological information classified into three or more classes;
An input vector generation unit that generates a plurality of input vectors including the plurality of biological information acquired by the known information acquisition unit in an input space;
Conversion that generates a plurality of feature vectors corresponding to the plurality of input vectors in a feature space of a higher order than the input space by converting the plurality of input vectors in the input space by nonlinear mapping. And
Based on the plurality of feature vectors and the classification of the plurality of biological information respectively corresponding to the plurality of feature vectors, two or more parallel separation planes are generated in the feature space, and the feature space is A separation unit that separates into three or more regions respectively corresponding to the three or more classes;
Inverse transformation that generates two or more separation surfaces in the input space by separating the input space into three or more regions by inversely transforming the two or more separation surfaces generated by the separation unit by the nonlinear mapping. And
A classification criterion output unit that outputs the classification criterion including information defining the two or more separation planes generated by the inverse transform unit;
A biological information processing apparatus comprising:

The biological information processing apparatus according to claim 12 or 13,
The biological information processing apparatus, wherein the conversion unit is configured to perform a calculation when the input vector is converted into the feature vector using a kernel function satisfying semi-fixed property.

The biological information processing apparatus according to claim 14, wherein
The biological information processing apparatus, wherein the kernel function is one or more kernel functions selected from the group consisting of a linear kernel function, a polynomial kernel function, and an RBF kernel function.

The biological information processing apparatus according to any one of claims 11 to 15,
The biological information processing apparatus, wherein the linear separation unit is configured to generate the two or more parallel separation planes using a support vector machine.

The biological information processing apparatus according to claim 16, wherein
The biological information processing characterized in that the support vector machine is configured to maximize the margin width of a separation surface having a minimum margin width among the two or more parallel separation surfaces. apparatus.

The biological information processing apparatus according to claim 16 or 17,
The biological information processing apparatus, wherein the separation unit arranges the separation surface so as to be positioned at the center of the margin width.

The biological information processing apparatus according to any one of claims 16 to 18,
The biological information processing apparatus, wherein the support vector machine is a support vector machine extended by a soft margin method.

The biological information processing apparatus according to any one of claims 11 to 19,
The two or more parallel separation surfaces have the following formula: b = w ^T x
(Where x is an input vector or feature vector, w is a parameter vector, b is a bias, and T is an arithmetic symbol indicating transposition)
Is defined by the score function represented by
The w is a parameter vector that is the same in the two or more separation planes;
The biological information processing apparatus, wherein b is a bias having different values on the two or more separation planes.

A biological information processing method for predicting a classification of biological information,
Obtaining a plurality of known biological information classified into three or more classes;
Generating in the input space a plurality of input vectors including the plurality of biological information obtained by obtaining the biological information;
Based on the plurality of input vectors and the classification of the plurality of biological information respectively corresponding to the plurality of input vectors, two or more parallel separation planes are generated in the input space, and the input space is Separating into three or more regions respectively corresponding to the three or more classes;
Obtaining biological information with unknown classification;
An unknown vector including the biological information obtained by obtaining biological information whose classification is unknown is input into the input space separated into the three or more regions by the two or more separation surfaces. Generating step;
A classification of the biological information corresponding to the unknown vector is predicted based on a region where the unknown vector is arranged in an input space separated into the three or more regions by the two or more separation planes. A determining step;
Outputting the biological information classification corresponding to the unknown vector predicted and determined by predicting and determining the biological information classification;
A biological information processing method comprising:

A biological information processing method for predicting a classification of biological information,
Obtaining a plurality of known biological information classified into three or more classes;
Generating, in an input space, a plurality of input vectors including the plurality of biological information obtained by obtaining the known biological information;
Generating a plurality of feature vectors respectively corresponding to the plurality of input vectors in a feature space of higher order than the input space by transforming the plurality of input vectors in the input space by nonlinear mapping; When,
Based on the plurality of feature vectors and the classification of the plurality of biological information respectively corresponding to the plurality of feature vectors, two or more parallel separation planes are generated in the feature space, and the feature space is Separating into three or more regions respectively corresponding to the three or more classes;
Obtaining biological information with unknown classification;
By converting the unknown vector including the biological information acquired by the step of acquiring the biological information whose classification is unknown, the transformed unknown vector corresponding to the unknown vector is converted into the 2 Generating in the feature space separated into the three or more regions by the separation surface;
Classification of the biological information corresponding to the converted unknown vector based on a region where the converted unknown vector is arranged in the feature space separated into the three or more regions by the two or more separation planes. Predicting and determining
Outputting the biological information classification corresponding to the transformed unknown vector predicted and determined by predicting and determining the biological information classification;
A biological information processing method comprising:

A biological information processing method for predicting a classification of biological information,
Obtaining a plurality of known biological information classified into three or more classes;
Generating, in an input space, a plurality of input vectors including the plurality of biological information obtained by obtaining the known biological information;
Generating a plurality of feature vectors respectively corresponding to the plurality of input vectors in a feature space of higher order than the input space by transforming the plurality of input vectors in the input space by nonlinear mapping; When,
Based on the plurality of feature vectors and the classification of the plurality of biological information respectively corresponding to the plurality of feature vectors, two or more parallel separation planes are generated in the feature space, and the feature space is Separating into three or more regions respectively corresponding to the three or more classes;
Step of generating two or more separation surfaces in the input space for separating the input space into three or more regions by inversely transforming the two or more separation surfaces generated by the separating step by the nonlinear mapping. When,
Obtaining biological information with unknown classification;
An unknown vector including the biological information obtained by obtaining biological information whose classification is unknown is input into the input space separated into the three or more regions by the two or more separation surfaces. Generating step;
A classification of the biological information corresponding to the unknown vector is predicted based on a region where the unknown vector is arranged in an input space separated into the three or more regions by the two or more separation planes. A determining step;
Outputting the biological information classification corresponding to the unknown vector predicted and determined by predicting and determining the biological information classification;
A biological information processing method comprising:

A biological information processing method for generating a classification criterion for classifying biological information,
Obtaining a plurality of known biological information classified into three or more classes;
Generating, in an input space, a plurality of input vectors including the plurality of biological information obtained by obtaining the known biological information;
Based on the plurality of input vectors and the classification of the plurality of biological information respectively corresponding to the plurality of input vectors, two or more parallel separation planes are generated in the input space, and the input space is Separating into three or more regions respectively corresponding to the three or more classes;
Outputting the classification criteria including information defining the two or more separation surfaces generated by the separation unit;
A biological information processing method including:

A biological information processing method for generating a classification criterion for classifying biological information,
Obtaining a plurality of known biological information classified into three or more classes;
Generating, in an input space, a plurality of input vectors including the plurality of biological information obtained by obtaining the known biological information;
Generating a plurality of feature vectors respectively corresponding to the plurality of input vectors in a feature space of higher order than the input space by transforming the plurality of input vectors in the input space by nonlinear mapping; When,
Based on the plurality of feature vectors and the classification of the plurality of biological information respectively corresponding to the plurality of feature vectors, two or more parallel separation planes are generated in the feature space, and the feature space is Separating into three or more regions respectively corresponding to the three or more classes;
Outputting the classification criteria including information defining the two or more separation surfaces generated by the separating step;
A biological information processing method including:

A biological information processing method for generating a classification criterion for classifying biological information,
Obtaining a plurality of known biological information classified into three or more classes;
Generating, in an input space, a plurality of input vectors including the plurality of biological information obtained by obtaining the known biological information;
Generating a plurality of feature vectors respectively corresponding to the plurality of input vectors in a feature space of higher order than the input space by transforming the plurality of input vectors in the input space by nonlinear mapping; When,
Based on the plurality of feature vectors and the classification of the plurality of biological information respectively corresponding to the plurality of feature vectors, two or more separation planes are generated in the feature space, and the feature space is the three or more feature spaces. Separating into three or more regions each corresponding to a class of
Two or more parallel separation surfaces that separate the input space into three or more regions are transformed into the input space by inversely transforming the two or more separation surfaces generated by the separating step by the nonlinear mapping. Generating step;
Outputting the classification criteria including information defining the two or more separation surfaces generated by the step of generating separation surfaces in an input space by the inverse transformation;
A biological information processing method including:

A biological information processing program for causing a computer to execute a process of predicting a classification of biological information,
Obtaining a plurality of known biological information classified into three or more classes;
Generating, in an input space, a plurality of input vectors including the plurality of biological information obtained by obtaining the known biological information;
Based on the plurality of input vectors and the classification of the plurality of biological information respectively corresponding to the plurality of input vectors, two or more parallel separation planes are generated in the input space, and the input space is Separating into three or more regions respectively corresponding to the three or more classes;
Obtaining biological information with unknown classification;
An unknown vector including the biological information obtained by obtaining biological information whose classification is unknown is input into the input space separated into the three or more regions by the two or more separation surfaces. Generating step;
A classification of the biological information corresponding to the unknown vector is predicted based on a region where the unknown vector is arranged in an input space separated into the three or more regions by the two or more separation planes. A determining step;
Outputting a classification of the biological information corresponding to the unknown vector predicted and determined by the predicting and determining step;
A biological information processing program characterized by causing a computer to execute.

A biological information processing program for causing a computer to execute a process of predicting a classification of biological information,
Obtaining a plurality of known biological information classified into three or more classes;
Generating, in an input space, a plurality of input vectors including the plurality of biological information obtained by obtaining the known biological information;
Generating a plurality of feature vectors respectively corresponding to the plurality of input vectors in a feature space of higher order than the input space by transforming the plurality of input vectors in the input space by nonlinear mapping; When,
Based on the plurality of feature vectors and the classification of the plurality of biological information respectively corresponding to the plurality of feature vectors, two or more parallel separation planes are generated in the feature space, and the feature space is Separating into three or more regions respectively corresponding to the three or more classes;
Obtaining biological information with unknown classification;
By converting the unknown vector including the biological information acquired by the step of acquiring the biological information whose classification is unknown, the transformed unknown vector corresponding to the unknown vector is converted into the 2 Generating in the feature space separated into the three or more regions by the separation surface;
Classification of the biological information corresponding to the converted unknown vector based on a region where the converted unknown vector is arranged in the feature space separated into the three or more regions by the two or more separation planes. Predicting and determining
Outputting a classification of the biological information corresponding to the transformed unknown vector predicted and determined by the predicting and determining step;
A biological information processing program characterized by causing a computer to execute.

A biological information processing program for causing a computer to execute a process of predicting a classification of biological information,
Obtaining a plurality of known biological information classified into three or more classes;
Generating, in an input space, a plurality of input vectors including the plurality of biological information obtained by obtaining the known biological information;
Generating a plurality of feature vectors respectively corresponding to the plurality of input vectors in a feature space of higher order than the input space by transforming the plurality of input vectors in the input space by nonlinear mapping; When,
Based on the plurality of feature vectors and the classification of the plurality of biological information respectively corresponding to the plurality of feature vectors, two or more parallel separation planes are generated in the feature space, and the feature space is Separating into three or more regions respectively corresponding to the three or more classes;
Step of generating two or more separation surfaces in the input space for separating the input space into three or more regions by inversely transforming the two or more separation surfaces generated by the separating step by the nonlinear mapping. When,
Obtaining biological information with unknown classification;
An unknown vector including the biological information obtained by obtaining biological information whose classification is unknown is input into the input space separated into the three or more regions by the two or more separation surfaces. Generating step;
A classification of the biological information corresponding to the unknown vector is predicted based on a region where the unknown vector is arranged in an input space separated into the three or more regions by the two or more separation planes. A determining step;
Outputting a classification of the biological information corresponding to the unknown vector predicted and determined by the predicting and determining step;
A biological information processing program characterized by causing a computer to execute.

A biological information processing program for generating a classification criterion for causing a computer to execute a process of classifying biological information,
Obtaining a plurality of known biological information classified into three or more classes;
Generating, in an input space, a plurality of input vectors including the plurality of biological information obtained by obtaining the known biological information;
Based on the plurality of input vectors and the classification of the plurality of biological information respectively corresponding to the plurality of input vectors, two or more parallel separation planes are generated in the input space, and the input space is Separating into three or more regions respectively corresponding to the three or more classes;
Outputting the classification criteria including information defining the two or more separation surfaces generated by the separating step;
Biological information processing program that causes a computer to execute.

A biological information processing program for generating a classification criterion for causing a computer to execute a process of classifying biological information,
Obtaining a plurality of known biological information classified into three or more classes;
Generating, in an input space, a plurality of input vectors including the plurality of biological information obtained by obtaining the known biological information;
Generating a plurality of feature vectors respectively corresponding to the plurality of input vectors in a feature space of higher order than the input space by transforming the plurality of input vectors in the input space by nonlinear mapping; When,
Based on the plurality of feature vectors and the classification of the plurality of biological information respectively corresponding to the plurality of feature vectors, two or more parallel separation planes are generated in the feature space, and the feature space is Separating into three or more regions respectively corresponding to the three or more classes;
Outputting the classification criteria including information defining the two or more separation surfaces generated by the separation unit;
Biological information processing program that causes a computer to execute.

A biological information processing program for generating a classification criterion for causing a computer to execute a process of classifying biological information,
Obtaining a plurality of known biological information classified into three or more classes;
Generating, in an input space, a plurality of input vectors including the plurality of biological information obtained by obtaining the known biological information;
Generating a plurality of feature vectors respectively corresponding to the plurality of input vectors in a feature space of higher order than the input space by transforming the plurality of input vectors in the input space by nonlinear mapping; When,
Based on the plurality of feature vectors and the classification of the plurality of biological information respectively corresponding to the plurality of feature vectors, two or more parallel separation planes are generated in the feature space, and the feature space is Separating into three or more regions respectively corresponding to the three or more classes;
Step of generating two or more separation surfaces in the input space for separating the input space into three or more regions by inversely transforming the two or more separation surfaces generated by the separating step by the nonlinear mapping. When,
Outputting the classification criteria including information defining the two or more separation surfaces generated by the step of generating separation surfaces in the input space by the inverse transformation;
Biological information processing program that causes a computer to execute.