JP2020025471A

JP2020025471A - Toxicity learning device, toxicity learning method, trained model, toxicity prediction device and program

Info

Publication number: JP2020025471A
Application number: JP2018150286A
Authority: JP
Inventors: 勝久堀本; Katsuhisa Horimoto; 福井　一彦; Kazuhiko Fukui; 一彦福井
Original assignee: National Institute of Advanced Industrial Science and Technology AIST
Current assignee: National Institute of Advanced Industrial Science and Technology AIST
Priority date: 2018-08-09
Filing date: 2018-08-09
Publication date: 2020-02-20

Abstract

【課題】化合物の毒性を予測する新しい方法を提供する。【解決手段】毒性学習装置１０は、化合物を曝露したサンプルの発現データとコントロールの発現データを入力する入力部１１と、サンプルとコントロールの発現データを所定の遺伝子ごとに比較する比較部１３と、発現データの違いに基づいて、遺伝子の発現データを符号化する符号化部１４と、符号化された発現データに化合物の毒性のラベルを付与するラベル付与部１５と、ラベルが付与された教師データを用いて、遺伝子の発現データから化合物の毒性を予測するモデルの学習を行うモデル学習部１６とを備える。【選択図】図２PROBLEM TO BE SOLVED: To provide a new method for predicting the toxicity of a compound. A toxicity learning device (10) includes an input unit (11) for inputting expression data of a sample exposed to a compound and expression data of a control, and a comparison unit (13) for comparing the expression data of the sample and the control for each predetermined gene. A coding unit 14 that encodes gene expression data based on the difference in expression data, a labeling unit 15 that labels the encoded expression data for the toxicity of a compound, and a labeled teacher data. It is provided with a model learning unit 16 that learns a model for predicting the toxicity of a compound from gene expression data using the above. [Selection diagram] Fig. 2

Description

本発明は、毒性学習装置、毒性学習方法、学習済みモデル、毒性予測装置およびプログラムに関する。 The present invention relates to a toxicity learning device, a toxicity learning method, a learned model, a toxicity prediction device, and a program.

医薬品開発には、莫大な開発費用と長い研究開発期間がかかる一方で、医薬品が無事に上市される成功確率は決して高いものではない。この原因の一つがヒトに対する安全性を十分に確保できないことである。医薬品開発の早期の段階でその毒性を見極めることができれば、医薬品開発の成功率が向上すると考えられる。そこで、従来から、医薬品の安全性を予測する研究が行われていた。 While drug development involves enormous development costs and long research and development periods, the success rate of successfully launching a drug is by no means high. One of the causes is that human safety cannot be sufficiently ensured. If the toxicity can be determined at an early stage of drug development, the success rate of drug development will increase. Therefore, studies for predicting the safety of pharmaceuticals have been conducted.

非特許文献１では、初代肝細胞を用いたスフェロイド培養系に着目し、肝細胞３次元培養系を用いたアセトアミノフェンの肝毒性評価を行うことが開示されている。また、構造活性相関（ＱＳＡＲ）に基づいて毒性を予測するソフトウェアも販売されている（非特許文献２）。また、化合物の構造に着目して、その化合物の毒性をディープラーニングで求めようとする研究も発表されている（非特許文献３）。 Non-Patent Document 1 discloses that hepatotoxicity evaluation of acetaminophen using a three-dimensional hepatocyte culture system is performed, focusing on a spheroid culture system using primary hepatocytes. In addition, software for predicting toxicity based on structure-activity relationship (QSAR) is also available (Non-Patent Document 2). In addition, a study has been published that seeks to determine the toxicity of a compound by deep learning, focusing on the structure of the compound (Non-Patent Document 3).

佐能正剛「ヒトにおける医薬品の肝毒性及び代謝物の in vitro, in vivo 予測評価系の構築に関する研究」YAKUGAKU ZASSHI 135(11) 1273―1279 (2015)Masanobu Sano, "Study on in vitro and in vivo prediction evaluation system for hepatotoxicity and metabolites of pharmaceuticals in humans" YAKUGAKU ZASSHI 135 (11) 1273-1279 (2015) ＱＳＡＲ毒性予測ソフトウェア Leadscope(R) Model Applier Genetic Toxicity Suite 伊藤忠テクノソリューションズ株式会社、2017年、インターネット＜ＵＲＬ：http://ls.ctc-g.co.jp/products/leadscope/files/LeadscopeModelApplier_2017.pdf＞QSAR Toxicity Prediction Software Leadscope (R) Model Applier Genetic Toxicity Suite ITOCHU Techno Solutions Co., Ltd., 2017, Internet <URL: http://ls.ctc-g.co.jp/products/leadscope/files/LeadscopeModelApplier_2017.pdf> Andreas Mayr他「Deep Tox: Toxicity Prediction using Deep Learning」Frontiers in Environment Science, February 2016Andreas Mayr et al. `` Deep Tox: Toxicity Prediction using Deep Learning '' Frontiers in Environment Science, February 2016

本発明は、化合物の毒性を予測する新しい方法を提供することを目的とする。 The present invention aims to provide a new method for predicting the toxicity of a compound.

本発明は、化合物の毒性を予測するために、化合物を曝露したサンプルの発現データを用いる。本発明では、機械学習を用いてあらかじめ学習した学習済みモデルを用いて毒性の予測を行う。ここで問題となるのが、モデルの学習に用いる教師データをどのようにして準備するかである。非特許文献３は、化合物の構造に着目しているが、本発明はこれとは異なる方法を採用する。 The present invention uses expression data from samples exposed to a compound to predict the toxicity of the compound. In the present invention, toxicity prediction is performed using a learned model that has been learned in advance using machine learning. The problem here is how to prepare the teacher data used for learning the model. Non-Patent Document 3 focuses on the structure of a compound, but the present invention employs a different method.

本発明の毒性学習装置は、化合物を曝露したサンプルの発現データとコントロールの発現データを入力する入力部と、前記サンプルと前記コントロールの発現データを所定の遺伝子ごとに比較する比較部と、前記発現データの違いに基づいて、前記遺伝子の発現データを符号化する符号化部と、符号化された発現データに前記化合物の毒性のラベルを付与するラベル付与部と、前記ラベルが付与された教師データを用いて、遺伝子の発現データから化合物の毒性を予測するモデルの学習を行うモデル学習部とを備える。 The toxicity learning device of the present invention includes an input unit for inputting expression data of a sample exposed to a compound and expression data of a control, a comparing unit for comparing the sample and the expression data of the control for each predetermined gene, An encoding unit that encodes the expression data of the gene based on a difference between the data, a labeling unit that labels the encoded expression data with a toxicity label of the compound, and teacher data to which the label is added. And a model learning unit for learning a model for predicting the toxicity of the compound from the gene expression data using

このように化合物を曝露したサンプルとコントロールの発現データの比較から、遺伝子ごとの発現データの違いを符号化すると共に、当該化合物の毒性についての知見を利用してラベルを付与することで教師データを生成できる。この教師データを用いて発現データから化合物の毒性を推論するためのモデルを生成できる。 From the comparison of the expression data of the sample exposed to the compound and the control in this way, the difference in expression data for each gene is encoded, and the labeling is applied using knowledge about the toxicity of the compound to generate teacher data. Can be generated. A model for inferring the toxicity of the compound from the expression data can be generated using the teacher data.

本発明の毒性学習装置において、前記符号化部は、前記発現データの違いに基づいて前記遺伝子に順位を付け、前記順位が上位の所定数の遺伝子に「１」、下位の所定数の遺伝子に「−１」、それ以外の所定数に「０」を付与してもよい。所定数は任意の数である。例えば、遺伝子の全体数のパーセントによって定めてもよく、好ましくは５％、さらに好ましくは１〜２％である。 In the toxicity learning device of the present invention, the encoding unit ranks the genes based on the difference in the expression data, and the rank is “1” for a predetermined number of genes in an upper rank and “1” in a predetermined number of genes in a lower rank. “−1” and “0” may be added to other predetermined numbers. The predetermined number is an arbitrary number. For example, it may be defined by the percentage of the total number of genes, preferably 5%, more preferably 1-2%.

このように発現データを１，０，−１で符号化することにより、機械学習による学習を行いやすくなる。また、本発明では、上位と下位について同じ所定数を０以外の数値としているので、上下で０以外の数値をとる範囲が対称であり、モデルの学習を適切に行える。 By encoding the expression data with 1, 0, -1 in this manner, learning by machine learning becomes easy. Further, in the present invention, since the same predetermined number is set to a numerical value other than 0 for the upper order and the lower order, the range in which the numerical value other than 0 is taken in the upper and lower parts is symmetrical, so that the model can be appropriately learned.

本発明の毒性学習装置において、前記比較部は、次の（ｉ）〜（ｉｉｉ）の方法で、前記発現データを遺伝子ごとに比較して複数の比較結果を得てもよい。
（ｉ）前記サンプルと前記コントロールの発現データの差をとる。
（ｉｉ）前記サンプルと前記コントロールの発現データの比をとる。
（ｉｉｉ）前記サンプルと前記コントロールの発現データを正規化して差をとる。 In the toxicity learning device of the present invention, the comparing unit may obtain a plurality of comparison results by comparing the expression data for each gene by the following methods (i) to (iii).
(I) Take the difference between the expression data of the sample and the control.
(Ii) Take the ratio of the expression data of the sample and the control.
(Iii) Normalize the expression data of the sample and the control and take the difference.

生物を対象とした実験（実測値）によるデータをベースにした機械学習の研究では、大量の教師データを取得することが困難である。本発明では、発現データを取得する手法（例えば、マイクロアレイによる解析手法）による揺らぎを利用し、（ｉ）〜（ｉｉｉ）の異なる計算手法で比較することにより、比較結果のデータ量を増やすことができ、適切な学習を行える。 In machine learning research based on data from experiments (actual measurements) on living things, it is difficult to obtain a large amount of teacher data. In the present invention, it is possible to increase the data amount of the comparison result by making use of fluctuations caused by a method of acquiring expression data (for example, an analysis method using a microarray) and performing comparisons using different calculation methods (i) to (iii). Can do appropriate learning.

本発明の毒性学習装置において、前記ラベル付与部は、前記既存の化合物の毒性を記憶したデータベースから読み出した毒性のデータを前記ラベルとして用いてもよい。また、本発明の毒性学習装置は、前記既存の化合物の副作用を記憶したデータベースから化合物の副作用のデータを読み出し、読み出した副作用のデータに基づいて化合物の毒性のデータを求めるラベル生成部を備え、前記ラベル付与部は、求めた毒性のデータを前記ラベルとして用いてもよい。 In the toxicity learning device of the present invention, the labeling unit may use toxicity data read from a database storing the toxicity of the existing compound as the label. Further, the toxicity learning device of the present invention includes a label generation unit that reads data on the side effects of the compound from the database storing the side effects of the existing compound, and obtains data on the toxicity of the compound based on the read side effect data, The labeling unit may use the obtained toxicity data as the label.

本発明の毒性学習方法は、化合物を曝露したサンプルの発現データとコントロールの発現データを入力するステップと、前記サンプルと前記コントロールの発現データを遺伝子ごとに比較するステップと、前記発現データの違いに基づいて、前記遺伝子の発現データを符号化するステップと、符号化された発現データに前記化合物の毒性のラベルを付して教師データを生成するステップと、前記教師データを用いて、遺伝子の発現データから化合物の毒性を予測するモデルの学習を行うステップとを備える。本発明の毒性学習方法は、上述した毒性学習装置の各種の特徴を有してもよい。 The toxicity learning method of the present invention comprises the steps of: inputting expression data of a sample exposed to a compound and expression data of a control; comparing the expression data of the sample and the control for each gene; Encoding the expression data of the gene, generating the teacher data by labeling the encoded expression data with the toxicity of the compound, and using the teacher data to express the gene. Learning a model for predicting the toxicity of the compound from the data. The toxicity learning method of the present invention may have various features of the toxicity learning device described above.

本発明の毒性学習方法は、前記入力するステップでは、既知の一の化合物を曝露した複数のサンプルの発現データを入力し、前記比較するステップでは、前記複数のサンプルと前記コントロールの発現データを遺伝子ごとに比較し、前記一の化合物について複数の比較結果を得てもよい。生物を対象とした実験では、実験による揺らぎを低減するため、複数の実測データの平均を取るのが通常であるが、本発明の構成によれば、複数の実測データを独立した実験のデータと考えて、教師データのデータ量を増やすことができる。 In the toxicity learning method of the present invention, in the inputting step, expression data of a plurality of samples to which a known compound has been exposed are input, and in the comparing step, the expression data of the plurality of samples and the control are expressed by a gene. And a plurality of comparison results may be obtained for the one compound. In experiments on living organisms, it is usual to take an average of a plurality of actually measured data in order to reduce fluctuations caused by the experiment.However, according to the configuration of the present invention, a plurality of actually measured data are compared with independent experiment data. Considering this, the data amount of the teacher data can be increased.

本発明の毒性予測装置は、上記した毒性学習方法によって学習された学習済みモデルを用いて、化合物の毒性を推論する装置であって、未知の化合物を曝露したサンプルの発現データを入力する入力部と、前記サンプルの発現データとコントロールの発現データを所定の遺伝子ごとに比較する比較部と、前記発現データの違いに基づいて、前記遺伝子の発現データを符号化する符号化部と、前記符号化された発現データを学習済みモデルに適用して、前記化合物の毒性を推論する推論部と、前記推論部による推論結果を出力する出力部とを備える。この構成により、化合物を曝露した発現データを用いて化合物の毒性を予測することができる。 The toxicity prediction device of the present invention is a device for inferring the toxicity of a compound using a learned model learned by the above-described toxicity learning method, and an input unit for inputting expression data of a sample exposed to an unknown compound. A comparing unit that compares the expression data of the sample with the expression data of the control for each predetermined gene; an encoding unit that encodes the expression data of the gene based on a difference between the expression data; and the encoding unit. An inference unit that infers the toxicity of the compound by applying the obtained expression data to the learned model, and an output unit that outputs an inference result by the inference unit. With this configuration, the toxicity of the compound can be predicted using the expression data to which the compound has been exposed.

本発明の学習済みモデルは、化合物を曝露したときの発現データに基づいて、化合物の毒性を定量化した値を出力するよう、コンピュータを機能させるための学習済みモデルであって、ニューラルネットワークの入力層に、コントロールとの発現データの違いに基づいて符号化された発現データが入力され、入力された符号化データに基づいて前記ニューラルネットワークの学習済みの重み付け係数に基づく演算を行い、出力層から前記化合物の毒性を定量化した値を出力するよう、コンピュータを機能させる。 The trained model of the present invention is a trained model for operating a computer so as to output a value quantifying the toxicity of a compound based on expression data upon exposure of the compound, and is an input of a neural network. In the layer, expression data encoded based on the difference between the expression data with the control is input, and based on the input encoded data, perform a calculation based on the learned weighting coefficients of the neural network, and from the output layer The computer is operated to output a value quantifying the toxicity of the compound.

本発明によれば、このように化合物を曝露したサンプルの発現データから当該化合物の毒性を推論するためのモデルを学習させるための教師データを生成できる。 According to the present invention, it is possible to generate teacher data for learning a model for inferring the toxicity of the compound from the expression data of the sample exposed to the compound in this way.

（ａ）肝毒性判別の全体の枠組みの中の学習段階を示す図である。（ｂ）肝毒性判別の全体の枠組みの中の推論段階を示す図である。(A) is a figure which shows the learning stage in the whole framework of hepatotoxicity discrimination. (B) It is a figure which shows the inference stage in the whole framework of hepatotoxicity discrimination. 第１の実施の形態の毒性学習装置の構成を示す図である。It is a figure showing composition of a toxicity learning device of a 1st embodiment. サンプルとコントロールの発現量の比較を行った結果の一例を示す図である。It is a figure showing an example of the result of having compared the expression level of a sample and a control. 遺伝子ごとに発現データの比較を行った例を示す図である。It is a figure showing the example which compared expression data for every gene. 遺伝子に与えた順位に従って、発現データを符号化した例を示す図である。FIG. 7 is a diagram showing an example in which expression data is encoded according to the order given to genes. （ａ）ラベル付与部によって各化合物にラベルを付与した例を示す図である。（ｂ）発現データを符号化したデータを示す図である。(A) It is a figure which shows the example which attached the label to each compound by the label provision part. (B) It is a figure which shows the data which encoded the expression data. 第１の実施の形態の毒性学習装置の動作を示す図である。It is a figure showing operation of a toxicity learning device of a 1st embodiment. 第１の実施の形態の毒性予測装置の構成を示す図である。It is a figure showing the composition of the toxicity prediction device of a 1st embodiment. 第２の実施の形態の毒性学習装置の構成を示す図である。It is a figure showing composition of a toxicity learning device of a 2nd embodiment. （ａ）副作用ＤＢに記憶されたデータの例を示す図である。（ｂ）副作用ＤＢに記憶されたデータを加工した例を示す図である。(A) is a diagram showing an example of data stored in a side effect DB. (B) It is a figure which shows the example which processed the data memorize | stored in the side effect DB.

以下、本発明の実施の形態の毒性学習装置および毒性予測装置について、図面を参照して説明する。以下に説明する実施の形態では、未知の化合物の肝毒性を判別する毒性予測装置および当該毒性予測装置で用いるモデルを学習する毒性学習装置を例として説明する。なお、本発明の毒性学習装置および毒性予測装置は、肝毒性以外の毒性を判別する毒性予測装置にも適用することができる。 Hereinafter, a toxicity learning device and a toxicity prediction device according to an embodiment of the present invention will be described with reference to the drawings. In the embodiment described below, a toxicity prediction device that discriminates hepatotoxicity of an unknown compound and a toxicity learning device that learns a model used in the toxicity prediction device will be described as examples. Note that the toxicity learning device and the toxicity prediction device of the present invention can be applied to a toxicity prediction device for determining toxicity other than hepatotoxicity.

（第１の実施の形態）
図１は、本実施の形態の毒性学習装置１０および毒性予測装置２０による肝毒性判別の全体の枠組みを示す図である。図１（ａ）に示すように、学習段階では、毒性学習装置１０が多数の教師データを用いてモデルの学習を行い、学習済みモデルを生成する。ここで、教師データとして何を用いればよいのか、ということが問題であった。推論段階では、図１（ｂ）に示すように、毒性予測装置２０が、学習済みモデルを用いて、新規化合物の肝毒性の判別を行う。本実施の形態では、「Ｍｏｓｔ」「Ｌｅｓｓ」「Ｎｏｎ」の３段階で肝毒性を判別する。「Ｍｏｓｔ」が肝毒性が最も高く、「Ｎｏｎ」は肝毒性がないことを表している。 (First Embodiment)
FIG. 1 is a diagram showing an overall framework of hepatotoxicity discrimination by the toxicity learning device 10 and the toxicity prediction device 20 of the present embodiment. As shown in FIG. 1A, in the learning stage, the toxicology learning device 10 learns a model using a large number of teacher data, and generates a learned model. Here, there is a problem as to what should be used as teacher data. In the inference stage, as shown in FIG. 1B, the toxicity prediction device 20 determines the hepatotoxicity of the new compound using the learned model. In the present embodiment, hepatotoxicity is determined in three stages of “Most”, “Less”, and “Non”. “Most” indicates the highest hepatotoxicity, and “Non” indicates no hepatotoxicity.

なお、学習済みモデルとしては、ニューラルネットワークのモデルを用いる。ニューラルネットワークの構造を何層とするか、畳込み層及びプーリング層を設けるかどうか等は任意であるが、発明者らの実験によれば、畳込みニューラルネットワークを用いると、精度良く推論を行えるモデルを構成することができることが分かった。 Note that a neural network model is used as the learned model. The number of layers of the structure of the neural network and whether or not to provide a convolutional layer and a pooling layer are arbitrary. However, according to experiments by the inventors, inference can be performed with high accuracy by using a convolutional neural network. It turns out that the model can be constructed.

図２は、第１の実施の形態の毒性学習装置１０の構成を示す図である。毒性学習装置１０は、発現データを入力する入力部１１と、入力されたデータから教師データを生成する教師データ生成部１２と、教師データを用いてモデルの学習を行うモデル学習部１６とを有している。 FIG. 2 is a diagram illustrating a configuration of the toxicity learning device 10 according to the first embodiment. The toxicity learning device 10 includes an input unit 11 for inputting expression data, a teacher data generating unit 12 for generating teacher data from the input data, and a model learning unit 16 for learning a model using the teacher data. are doing.

入力部１１は、化合物を曝露したサンプルの発現データとコントロールの発現データの入力を受け付ける。ここでサンプルの発現データが曝露される化合物は既知の化合物である。発現データは、遺伝子の発現データでもよいし、タンパク質の発現データでもよい。本実施の形態では、発現データは、マイクロアレイで取得されたデータである。 The input unit 11 receives input of expression data of a sample to which the compound has been exposed and expression data of a control. Here, the compound to which the expression data of the sample is exposed is a known compound. The expression data may be gene expression data or protein expression data. In the present embodiment, the expression data is data obtained by a microarray.

教師データ生成部１２は、比較部１３と、符号化部１４と、ラベル付与部１５を有している。比較部１３は、サンプルの発現データとコントロールの発現データとを比較する機能を有する。比較部１３は、比較部１３は、次の（ｉ）〜（ｉｉｉ）の方法で、前記発現データを遺伝子ごとに比較して複数の比較結果を得る。
（ｉ）サンプルの発現データとコントロールの発現データの差をとる。
（ｉｉ）サンプルの発現データとコントロールの発現データの比をとる。
（ｉｉｉ）サンプルの発現データとコントロールの発現データをそれぞれ正規化した上で差をとる。 The teacher data generation unit 12 includes a comparison unit 13, an encoding unit 14, and a label assignment unit 15. The comparing unit 13 has a function of comparing the expression data of the sample with the expression data of the control. The comparing unit 13 obtains a plurality of comparison results by comparing the expression data for each gene by the following methods (i) to (iii).
(I) Take the difference between the expression data of the sample and the expression data of the control.
(Ii) Take the ratio between the expression data of the sample and the expression data of the control.
(Iii) Normalize the expression data of the sample and the expression data of the control, and then take the difference.

このように３通りの方法で比較と行うことにより、サンプルとコントロールの比較結果のデータ量は、元のサンプルのデータの３倍になる。毒性学習装置１０、入力されたデータの３倍の量のデータを学習に用いることができる。 As described above, by performing the comparison using the three methods, the data amount of the comparison result between the sample and the control becomes three times the data amount of the original sample. The toxicology learning device 10 can use three times the amount of input data for learning.

図３は、サンプルとコントロールの発現量の比較を行った結果の一例を示す図である。「Ｄｉｆｆ」は発現データの差をとった結果であり（上記（ｉ））、「Ｒａｔｉｏ」は発現データの比をとった結果であり（上記（ｉｉ））、「ＣＲ（Change Ratioの意味）」正規化した上で差をとった結果である（上記（ｉｉｉ））。 FIG. 3 is a diagram showing an example of the result of comparing the expression levels of a sample and a control. “Diff” is the result of the difference between the expression data ((i) above), “Ratio” is the result of the ratio of the expression data ((ii)), and “CR” (meaning Change Ratio) "This is the result of taking the difference after normalization ((iii) above).

縦軸に記載された「Ｐｒｏｂｅ」はマイクアレイのプローブ番号を示し、横軸のＤｒｕｇは、サンプルに曝露させた化合物を示す。例えば、「Ｄｒｕｇ１」を曝露したサンプルでは、Ｐｒｏｂｅ１の差は１０、Ｐｒｏｂｅ２の差は１２５３、Ｐｒｏｂｅ３の差は
３２４、・・・であることを示している。 “Probe” described on the vertical axis indicates the probe number of the microphone array, and Drug on the horizontal axis indicates the compound exposed to the sample. For example, in the sample exposed to “Drug1”, the difference of Probe1 is 10, the difference of Probe2 is 1253, the difference of Probe3 is 324, and so on.

図３に示す例において、縦軸の「Ｐｒｏｂｅ」は発現データを取得したマイクロアレイのプローブを意味するが、本実施の形態の比較部１３は、遺伝子ごとにサンプルとコントロールの比較を行う。つまり、「Ｐｒｏｂｅ」ごとの比較結果を遺伝子ごとの比較結果に変える。本実施の形態では、一つの遺伝子に複数のＰｒｏｂｅが対応する場合には、複数のＰｒｏｂｅの発現データの平均値を、その遺伝子の発現データの代表値とする。これにより、ヒトの遺伝子であれば約２万個の遺伝子の比較結果に絞り込むことができる。比較部１３は、遺伝子の中でも一般的に重要であると考えられている約１０００個程度のランドマーク遺伝子に絞りこんでもよい。このように比較結果を絞り込むことにより、機械学習による計算を可能にできる。 In the example shown in FIG. 3, “Probe” on the vertical axis indicates a probe of a microarray from which expression data has been acquired, but the comparison unit 13 of the present embodiment compares a sample with a control for each gene. That is, the comparison result for each “Probe” is changed to the comparison result for each gene. In the present embodiment, when a plurality of Probes correspond to one gene, the average value of the expression data of the plurality of Probes is used as the representative value of the expression data of the gene. As a result, in the case of human genes, it is possible to narrow down the comparison results of about 20,000 genes. The comparison unit 13 may narrow down to about 1000 landmark genes that are generally considered to be important among the genes. By narrowing down the comparison results in this way, calculations by machine learning can be made possible.

図４は、遺伝子ごとに発現データの比較を行った例を示す図である。例えば、「Ｄｒｕｇ１」を曝露したサンプルでは、ＧｅｎｅＩＤ１の遺伝子の差は１０１、ＧｅｎｅＩＤ２の遺伝子の差は５３８７、ＧｅｎｅＩＤ３の遺伝子の差は３２４である。 FIG. 4 is a diagram showing an example in which expression data is compared for each gene. For example, in the sample exposed to “Drug1”, the difference between GeneID1 genes is 101, the difference between GeneID2 genes is 5387, and the difference between GeneID3 genes is 324.

符号化部１４は、図４に示すように求めた発現データの差（または比）に基づいて、遺伝子の発現データを符号化する。具体的には、符号化部１４は、発現データの差（または比）に基づいて遺伝子ＩＤに順位をつける。符号化部１４は、付与した順位が上位の所定数の遺伝子に「１」、順位が下位の所定数の遺伝子に「−１」、それ以外の所定数に「０」を付与して、発現データを符号化する。 The encoding unit 14 encodes the expression data of the gene based on the difference (or ratio) of the expression data obtained as shown in FIG. Specifically, the encoding unit 14 ranks gene IDs based on the difference (or ratio) of the expression data. The encoding unit 14 assigns “1” to a predetermined number of genes with higher ranks, “−1” to a predetermined number of genes with lower ranks, and “0” to other predetermined numbers. Encode the data.

図５は、遺伝子に与えて順位に従って、「１」「０」「−１」を付与した例を示す図である。ＧｅｎｅＩＤ１の遺伝子は発現データの差が大きく順位が高いので、発現データが「１」に符号化され、ＧｅｎｅＩＤ３の遺伝子は発現データの差が小さく（またはマイナスの値が大きく）順位が低いので、発現データが「−１」に符号化されている。このように符号化部１４が発現データのデータ変換を行うことにより、機械学習を適切に行える。 FIG. 5 is a diagram illustrating an example in which “1”, “0”, and “−1” are assigned to genes according to the order. Since the gene ID1 gene has a large difference in expression data and has a high rank, the expression data is coded as "1", and the GeneID3 gene has a small difference (or a large negative value) in the expression data and has a low rank. Data is encoded as "-1". By performing the data conversion of the expression data by the encoding unit 14 in this manner, machine learning can be appropriately performed.

本実施の形態では、順位の上位２％の遺伝子の発現データを「１」にし、下位２％の遺伝子の発現データを「−１」とし、それ以外の遺伝子の発現データを「０」にする。ここでは、上位と下位の２％をそれぞれ「１」と「−１」にしているが、どの程度までを「１」「−１」とするかによって、学習済みモデルを使った推論の精度が変わるので、学習済みモデルの評価に基づいて調整することが好ましい。 In the present embodiment, the expression data of the top 2% of the genes is set to “1”, the expression data of the bottom 2% of the genes is set to “−1”, and the expression data of the other genes are set to “0”. . Here, the upper and lower 2% are set to “1” and “−1”, respectively, but the accuracy of the inference using the trained model depends on the degree of “1” and “−1”. Since it changes, it is preferable to adjust based on the evaluation of the trained model.

ラベル付与部１５は、化合物に対応する毒性を表すラベルを付与する。毒性学習装置１０は、肝毒性データベース（以下、「肝毒性ＤＢ」という）３０と接続されており、肝毒性ＤＢ３０に記憶された化合物の毒性のデータに基づいて、化合物に対して「Ｍｏｓｔ」「Ｌｅｓｓ」「Ｎｏｎ」のラベルを付与する。肝毒性ＤＢ３０の一例は、アメリカ食品医薬品局（ＦＤＡ）が提供しているＬｉｖｅｒＴｏｘｉｃｉｔｙＫｏｎｗｌｅｄｇｅＢａｓｅ（ＬＴＫＢ）である。ラベル付与部１５は、ＬＴＫＢを参照して化合物にラベルを付与する。 The label assignment unit 15 assigns a label indicating toxicity corresponding to the compound. The toxicity learning device 10 is connected to a hepatotoxicity database (hereinafter, referred to as “hepatotoxicity DB”) 30, and based on the toxicity data of the compound stored in the hepatotoxicity DB 30, “Most” “ Labels “Less” and “Non” are given. One example of the hepatotoxicity DB 30 is the Liver Toxicity Knowledge Base (LTKB) provided by the U.S. Food and Drug Administration (FDA). The labeling unit 15 labels the compound with reference to the LTKB.

図６（ａ）は、ラベル付与部１５によって各化合物にラベルを付与した例を示す図である。なお、図６（ａ）は、図４，図５と比べて、縦軸と横軸を入れ替えて記載しているので留意されたい。例えば、Ｄｒｕｇ１の化合物は毒性が「Ｍｏｓｔ」、Ｄｒｕｇ２の化合物は毒性がＭｏｓｔ、Ｄｒｕｇ３の化合物は毒性が「Ｌｅｓｓ」であるというラベルが付与されている。これにより、遺伝子の発現データの符号化データとラベルとがセットとなった教師データが得られる。なお、教師データとしては、既存の化合物が何であるか（図６（ａ）における「Ｄｒｕｇ１」等の名称）ということは重要ではない。必要なのは、図６（ｂ）に示すような、発現データを符号化したデータである。つまり、遺伝子の発現データとそれに対応する毒性のラベルが教師データとなる。 FIG. 6A is a diagram illustrating an example in which a label is assigned to each compound by the label assigning unit 15. It should be noted that FIG. 6A shows the vertical axis and the horizontal axis interchanged as compared with FIGS. 4 and 5. For example, a compound of Drug 1 is labeled with a toxicity of "Most", a compound of Drug 2 is labeled with a toxicity of Most, and a compound of Drug 3 is labeled with a toxicity of "Less". As a result, teacher data in which encoded data of gene expression data and labels are set is obtained. Note that it is not important what the existing compound is (name of “Drug1” in FIG. 6A) as the teacher data. What is needed is data obtained by encoding expression data as shown in FIG. In other words, the gene expression data and the corresponding toxicity label serve as the teacher data.

モデル学習部１６は、教師データ生成部１２にて生成された教師データを用いてモデルの学習を行う。ニューラルネットワークの入力層に教師データの発現データを入力し、出力層から対応するラベルが得られるようにニューラルネットワークの重み係数を学習する。モデル学習部１６は、大量の教師データを用いてモデルの学習を行うことにより、発現データから毒性を推論するためのモデルを生成する。モデル学習部１６は、学習によって得られたモデルを学習済みモデル記憶部１７に記憶する。 The model learning unit 16 learns a model using the teacher data generated by the teacher data generation unit 12. The expression data of the teacher data is input to the input layer of the neural network, and the weight coefficient of the neural network is learned so that the corresponding label can be obtained from the output layer. The model learning unit 16 generates a model for inferring toxicity from the expression data by learning the model using a large amount of teacher data. The model learning unit 16 stores the model obtained by the learning in the learned model storage unit 17.

図７は、第１の実施の形態の毒性学習装置１０の動作を示すフローチャートである。毒性学習装置１０は、化合物を曝露したサンプルの発現データとコントロールの発現データを入力する（Ｓ１０）。毒性学習装置１０は、入力されたサンプルとコントロールの発現データを比較する（Ｓ１１）。ここでは、上述したように、発現データの差、比、および正規化した上で差をとる。 FIG. 7 is a flowchart illustrating the operation of the toxicity learning device 10 according to the first embodiment. The toxicity learning device 10 inputs the expression data of the sample to which the compound was exposed and the expression data of the control (S10). The toxicity learning device 10 compares the input sample and the expression data of the control (S11). Here, as described above, the difference, ratio, and normalization of the expression data are used to obtain the difference.

次に、毒性学習装置１０はデータを圧縮する（Ｓ１２）。すなわち、マイクロアレイのプローブのデータを遺伝子のデータに変換し、データ数を圧縮する。この際、ヒトの遺伝子（約２万個）を用いてもよいし、肝毒性に関連のありそうな１０００個程度の遺伝子を用いてもよい。 Next, the toxicity learning device 10 compresses the data (S12). That is, the data of the probe of the microarray is converted into the data of the gene, and the number of data is compressed. At this time, human genes (about 20,000) may be used, or about 1000 genes likely to be related to hepatotoxicity.

毒性学習装置１０は、発現データの比較結果に基づいて遺伝子を順位付けし（Ｓ１３）、付与した順位に基づいて発現データを符号化データに変換する（Ｓ１４）。具体的には、順位が上位２％の遺伝子の発現データを「１」とし、順位が下位２％の遺伝子の発現データを「−１」とし、それ以外の遺伝子の発現データを「０」とする。続いて、毒性学習装置１０は、肝毒性ＤＢ３０のデータを参照して、化合物にラベルを付与し、教師データを生成する（Ｓ１５）。 The toxicity learning apparatus 10 ranks the genes based on the comparison result of the expression data (S13), and converts the expression data into encoded data based on the assigned rank (S14). Specifically, the expression data of the gene with the higher rank of 2% is “1”, the expression data of the gene with the lower rank of 2% is “−1”, and the expression data of the other genes are “0”. I do. Subsequently, the toxicity learning device 10 refers to the data in the hepatotoxicity DB 30 and assigns a label to the compound to generate teacher data (S15).

毒性学習装置１０は、処理を行っていないサンプルデータがあるか否かを判定し（Ｓ１６）、他のサンプルデータがある場合には（Ｓ１６でＹＥＳ）、上述した処理を繰り返す（Ｓ１１〜Ｓ１５）。他のサンプルデータがない場合（Ｓ１６でＮＯ）、毒性学習装置１０は、生成した大量の教師データを用いて、モデルの学習を行い、学習によって得られたモデルを学習済みモデル記憶部１７に記憶する（Ｓ１７）。 The toxicology learning device 10 determines whether there is any sample data that has not been processed (S16). If there is another sample data (YES in S16), the above process is repeated (S11 to S15). . If there is no other sample data (NO in S16), the toxicity learning device 10 performs model learning using the generated large amount of teacher data, and stores the model obtained by learning in the learned model storage unit 17. (S17).

図８は、毒性予測装置２０の構成を示す図である。毒性予測装置２０は、毒性学習装置１０での学習によって生成した学習済みモデルを記憶した学習済みモデル記憶部２８を有している。毒性予測装置２０は、肝毒性を調べたい新規化合物の発現データを入力する入力部２１と、入力された発現データに前処理を行う前処理部２２と、前処理された発現データを用いて肝毒性の有無を推論する推論部２５と、推論結果を出力する出力部２６とを有している。 FIG. 8 is a diagram showing a configuration of the toxicity prediction device 20. The toxicity prediction device 20 has a learned model storage unit 28 that stores a learned model generated by learning in the toxicity learning device 10. The toxicity prediction device 20 includes an input unit 21 for inputting expression data of a new compound whose hepatotoxicity is to be examined, a pre-processing unit 22 for performing pre-processing on the input expression data, and a liver processing unit using the pre-processed expression data. It has an inference unit 25 for inferring the presence or absence of toxicity and an output unit 26 for outputting an inference result.

前処理部２２は、比較部２３、符号化部２４を有しており、入力部２１に入力された新規化合物の発現データに対して、毒性学習装置１０の比較部１３、符号化部１４で行った処理と同じ処理を行って、発現データを符号化データに変換する。なお、前処理部２２には、コントロールデータ記憶部２７が接続されており、コントロールデータ記憶部２７から読み出したコントロールの発現データと、新規化合物の発現データとの比較を行う。これにより、毒性予測装置２０に対して、コントロールの発現データを入力しなくてもよい。前処理部２２は、発現データの符号化データを推論部２５に渡す。 The preprocessing unit 22 includes a comparison unit 23 and an encoding unit 24. The comparison unit 13 and the encoding unit 14 of the toxicology learning device 10 apply the expression data of the new compound input to the input unit 21 to the data. The same processing as that performed is performed to convert the expression data into encoded data. Note that a control data storage unit 27 is connected to the preprocessing unit 22, and compares the expression data of the control read from the control data storage unit 27 with the expression data of the new compound. Thereby, it is not necessary to input control expression data to the toxicity prediction device 20. The preprocessing unit 22 passes the encoded data of the expression data to the inference unit 25.

推論部２５は、学習済みモデル記憶部２８から学習済みモデルを読み出し、読み出した学習済みモデルの入力層に、前処理部２２から入力された符号化データを適用する。これにより、推論部２５は、学習済みモデルの出力層から出力される肝毒性を求める。 The inference unit 25 reads the learned model from the learned model storage unit 28, and applies the encoded data input from the preprocessing unit 22 to the input layer of the read learned model. Thereby, the inference unit 25 obtains the hepatotoxicity output from the output layer of the learned model.

以上、本実施の形態の毒性学習装置１０および毒性予測装置２０の構成について説明したが、上記した毒性学習装置１０および毒性予測装置２０のハードウェアの例は、ＣＰＵ、ＲＡＭ、ＲＯＭ、ハードディスク、ディスプレイ、キーボード、マウス、通信インターフェース等を備えたコンピュータである。上記した各機能を実現するモジュールを有するプログラムをＲＡＭまたはＲＯＭに格納しておき、ＣＰＵによって当該プログラムを実行することによって、上記した毒性学習装置１０および毒性予測装置２０が実現される。このようなプログラムも本発明の範囲に含まれる。 The configuration of the toxicity learning device 10 and the toxicity prediction device 20 according to the present embodiment has been described above. Examples of the hardware of the toxicity learning device 10 and the toxicity prediction device 20 include a CPU, a RAM, a ROM, a hard disk, and a display. , A keyboard, a mouse, a communication interface, and the like. By storing a program having modules for realizing the above-described functions in a RAM or a ROM and executing the program by the CPU, the above-described toxicity learning device 10 and toxicity prediction device 20 are realized. Such a program is also included in the scope of the present invention.

本実施の形態の毒性学習装置１０は、化合物を曝露したサンプルの発現データとコントロールの発現データを比較した結果に基づいて、遺伝子ごとの発現データの違いを符号化すると共に、当該化合物の毒性についてのラベルを付与することで教師データを生成できる。この教師データを用いて発現データから化合物の毒性を推論するためのモデルを生成できる。 The toxicity learning device 10 of the present embodiment encodes the difference in the expression data for each gene based on the result of comparing the expression data of the sample to which the compound was exposed and the expression data of the control, and also describes the toxicity of the compound. The teacher data can be generated by giving the label of. A model for inferring the toxicity of the compound from the expression data can be generated using the teacher data.

（第２の実施の形態）
図９は、第２の実施の形態の毒性学習装置１０の構成を示す図である。第２の実施の形態の毒性学習装置１０は、肝毒性ＤＢ３０に加えて、副作用データベース（以下、「副作用ＤＢ」という）３１に記憶された副作用のデータを用いる。第２の実施の形態の毒性学習装置１０の基本的な構成は、第１の実施の形態の毒性学習装置１０と同じであるが、副作用ＤＢ３１から読み出したデータから、化合物の肝毒性に関するラベルを生成するラベル生成部１８をさらに備えている。ラベル付与部１５は、肝毒性ＤＢ３０に記憶された肝毒性のデータに基づくラベルに加えて、ラベル生成部１８にて生成されたラベルも用いる。 (Second embodiment)
FIG. 9 is a diagram illustrating a configuration of the toxicity learning device 10 according to the second embodiment. The toxicity learning device 10 according to the second embodiment uses side effect data stored in a side effect database (hereinafter, referred to as “side effect DB”) 31 in addition to the hepatotoxicity DB 30. The basic configuration of the toxicology learning device 10 according to the second embodiment is the same as that of the toxicology learning device 10 according to the first embodiment, but a label related to the hepatotoxicity of the compound is obtained from the data read from the side effect DB 31. It further includes a label generation unit 18 for generating. The labeling unit 15 uses the label generated by the label generation unit 18 in addition to the label based on the hepatotoxicity data stored in the hepatotoxicity DB 30.

副作用ＤＢ３１の一例は、アメリカ食品医薬品局（ＦＤＡ）が提供しているＦＤＡＡｄｖｅｒｓｅＥｖｅｎｔＲｅｐｏｒｔｉｎｇＳｙｓｔｅｍ（ＦＡＲＥＳ）である。ＦＡＲＥＳは、副作用レポートの自発的報告システムであり、医療専門家、患者、製薬企業など様々な報告者による膨大なレポートデータが含まれている。ただし、化合物の肝毒性のデータが体系的にまとめられているわけではないので、本実施の形態の毒性学習装置１０は、副作用ＤＢ３１のデータを利用するために、副作用ＤＢ３１に記憶された多様なデータに基づいて、肝毒性に関するラベルを生成する。 One example of the side effect DB 31 is FDA Advertise Event Reporting System (FARES) provided by the U.S. Food and Drug Administration (FDA). FARES is a voluntary reporting system for side effects reports, and contains a huge amount of report data from various reporters, such as medical professionals, patients, and pharmaceutical companies. However, since the hepatotoxicity data of the compounds is not systematically compiled, the toxicity learning device 10 of the present embodiment uses various data stored in the side effect DB 31 in order to use the data of the side effect DB 31. Generate a label for hepatotoxicity based on the data.

図１０（ａ）は、副作用ＤＢ３１に記憶されているデータの例を示す図である。図１０（ａ）に示す例では、医薬品１〜医薬品ｋについて、副作用１〜副作用ｍが記憶されている。医薬品と副作用の交差するマトリックスに記載されているｘ₁₁等のデータは、当該医薬品に副作用があるか否かを示すデータである。 FIG. 10A is a diagram illustrating an example of data stored in the side effect DB 31. In the example shown in FIG. 10A, the side effects 1 to m are stored for the medicines 1 to k. Data such as x ₁₁ listed in the matrix at the intersection of drugs and side effects is data indicating whether or not there is a side effect said medicament.

ラベル生成部１８は、副作用ＤＢ３１に記憶された医薬品を、毒性学習装置１０に入力するサンプルに使った対象の医薬品とその他の医薬品に分類すると共に、副作用ＤＢ３１に記憶された副作用を、注目している副作用（つまり肝毒性の副作用）とその他の副作用に分類し、それぞれの医薬品で副作用が生じている件数ｎをカウントする。 The label generation unit 18 classifies the medicines stored in the side effect DB 31 into target medicines and other medicines used for the sample input to the toxicity learning device 10, and pays attention to the side effects stored in the side effect DB 31. It is classified into the side effects (ie, hepatotoxic side effects) and other side effects, and the number n of side effects occurring with each drug is counted.

図１０（ｂ）に、副作用ＤＢ３１に記憶されたデータを加工した例を示す図である。図１０（ｂ）に示す例では、対象の医薬品で注目している副作用が生じた件数はｎ₁₁であり、その他の医薬品で注目している副作用が生じた件数はｎ₂₁であり、注目している副作用の合計はｎ₊₁である。また、対象の医薬品でその他の副作用が生じた件数はｎ₁₂であり、その他の医薬品でその他の副作用が生じた件数はｎ₂₂であり、その他の副作用の合計はｎ₊₂である。 FIG. 10B is a diagram showing an example in which data stored in the side effect DB 31 has been processed. In the example shown in FIG. 10 (b), the number of side effects caused of interest in the pharmaceutical of interest is n _11, the number of side effects of interest in other drugs occurs is n _21, attention The sum of the side effects is n _{+ 1} . Further, the number of other side effects drugs caused the subject is n _12, the number of other side effects caused by other medicines is n _22, the sum of the other side effect is n _+2.

ラベル生成部１８は、対象の医薬品に肝毒性があるか否かを判定するため、特定事象の報告のオッズ比（Reporting Odds Ratio：ＲＯＲ）を計算する。具体的には、次の式（１）で計算する。
ＲＯＲ＝（ｎ₁₁／ｎ₂₁）／（ｎ₁₂／ｎ₂₂）・・・（１） The label generator 18 calculates a reporting odds ratio (ROR) for reporting a specific event in order to determine whether or not the target drug has hepatotoxicity. Specifically, it is calculated by the following equation (1).
ROR = (n ₁₁ / n ₂₁ ) / (n ₁₂ / n ₂₂ ) (1)

分子の（ｎ₁₁／ｎ₂₁）は、注目している副作用が対象の医薬品でどの程度の割合で起こったか、その他の医薬品に対する割合で表している。分母の（ｎ₁₂／ｎ₂₂）は、その他の副作用が対象の医薬品でどの程度の割合で起こったか、その他の医薬品に対する割合で表している。分子と分母の比をとった値が「１」に近ければ、注目している副作用についての報告も偶然になされたものであると解釈でき、この比が「１」よりかなり大きい場合には、対象の医薬品に対してなされた注目している副作用の報告が偶然ではないと解釈できる。ラベル生成部１８は、ＲＯＲの９５％信頼区間の下限が１より大きい場合に、対象の医薬品で注目している副作用があったと判定する。 The (n ₁₁ / n ₂₁ ) of the molecule indicates the proportion of the side effect of interest occurring in the target drug, as a percentage of other drugs. The denominator (n ₁₂ / n ₂₂ ) represents the ratio of other side effects that occurred in the target drug and the ratio to the other drugs. If the value of the ratio of the numerator to the denominator is close to “1”, it can be interpreted that the report of the side effect of interest is also made by accident, and if this ratio is much larger than “1”, It can be construed that the report of the noted side effect made on the drug in question is not accidental. When the lower limit of the 95% confidence interval of the ROR is greater than 1, the label generation unit 18 determines that there is a side effect of interest in the target drug.

また、ラベル生成部１８は、副作用ＤＢ３１に記憶されたデータ（図１０（ａ）参照）から、注目している副作用の報告件数をカウントする際に、オントロジーや医学用語集（ＭｅｄＤＲＡ）を用いて、副作用の報告内容の集約をしてもよい。 When counting the number of reported side effects of interest from the data (see FIG. 10A) stored in the side effect DB 31, the label generation unit 18 uses an ontology or a medical glossary (MedDRA). Alternatively, the report contents of side effects may be aggregated.

第２の実施の形態の毒性学習装置１０は、副作用ＤＢ３１に記憶されたデータを用いて肝毒性のラベルを生成するので、多くの化合物のデータを用いて教師データを生成することができる。なお、本実施の形態では、肝毒性のシグナルがあるか否かを判定するために、ＲＯＲを計算する例を用いたが、次の式（２）で計算される特定事象の報告割合の比（Proportional Reporting Rations：ＰＲＲ）を用いてもよい。
ＰＲＲ＝（ｎ₁₁／ｎ₁₊）／（ｎ₂₁／ｎ₂₊）・・・（２） Since the toxicity learning device 10 of the second embodiment generates a hepatotoxicity label using the data stored in the side effect DB 31, it is possible to generate teacher data using data of many compounds. In the present embodiment, an example of calculating the ROR is used to determine whether there is a signal of hepatotoxicity, but the ratio of the report rate of the specific event calculated by the following equation (2) is used. (Proportional Reporting Rations: PRR) may be used.
_{_{PRR = (n 11 / n 1+}} ) / (n 21 / n 2+) ··· (2)

また、副作用ＤＢ３１に記憶されたデータから、対象医薬品が注目する副作用を有するかどうかを判定する方法としては、上記の方法以外にも、例えば、主成分分析、因子分析、ＳＶＭ等の手法を用いてもよい。 As a method of determining whether or not the target drug has a noticeable side effect from the data stored in the side effect DB 31, for example, a method such as principal component analysis, factor analysis, or SVM may be used in addition to the above method. You may.

以上、本発明の毒性学習装置および毒性予測装置について実施の形態を挙げて詳細に説明したが、本発明の毒性学習装置は上記した実施の形態に限定されるものではない。例えば、上記した実施の形態では、サンプルの発現データを順位付けし、上位と下位のそれぞれ２％を「１」「−１」としたが、上位２％を「１」とし、それ以外を「０」としてもよい。 As described above, the toxicity learning device and the toxicity prediction device of the present invention have been described in detail with reference to the embodiments. However, the toxicity learning device of the present invention is not limited to the above embodiments. For example, in the above-described embodiment, the expression data of the sample is ranked, and the upper and lower 2% are respectively set to “1” and “−1”. However, the upper 2% is set to “1” and the other 2% is set to “1”. It may be “0”.

上記した実施の形態では、毒性学習装置１０と毒性予測装置２０を別装置として構成する例を挙げて説明したが、毒性学習装置１０と毒性予測装置２０を一つの装置で構成してもよい。毒性学習装置１０と毒性予測装置２０を一つの装置で構成すると、毒性予測装置２０による推論結果に基づいて、学習済みモデルの修正を行うことが容易である。学習済みモデルの推論の精度が良くない場合には、例えば、符号化部１４による符号化を行う際に、「１」「−１」に変換する発現データの順位（上位〇％、下位〇％）を変更してもよい。 In the above-described embodiment, an example has been described in which the toxicity learning device 10 and the toxicity prediction device 20 are configured as separate devices. However, the toxicity learning device 10 and the toxicity prediction device 20 may be configured as a single device. If the toxicity learning device 10 and the toxicity prediction device 20 are configured as one device, it is easy to correct the learned model based on the inference result by the toxicity prediction device 20. If the accuracy of the inference of the trained model is not good, for example, when encoding is performed by the encoding unit 14, the order of the expression data to be converted to “1” and “−1” (upper%, lower%) ) May be changed.

上記した実施の形態では、教師データのデータ量を増幅させるため、比較部１３は、サンプルの発現データとコントロールの発現データとを３通りの方法で比較し、３倍の教師データを生成する例を説明したが、本発明の毒性学習方法は、サンプルの発現データを入力する際に、取得した生のデータを入力することで、教師データの量を増やしてもよい。すなわち、通常は、生物を対象とした実験では、実験による揺らぎを低減するため、複数の実測データの平均を取るが、このような平均化を行うことなく、複数の実測データを独立した実験のデータと考えて、それぞれを教師データとすることにより、データ量を増やすことができる。 In the embodiment described above, in order to amplify the data amount of the teacher data, the comparing unit 13 compares the expression data of the sample and the expression data of the control by three methods and generates three times the teacher data. However, the toxicity learning method of the present invention may increase the amount of teacher data by inputting the obtained raw data when inputting the expression data of the sample. That is, usually, in experiments on living organisms, an average of a plurality of measured data is averaged in order to reduce fluctuations caused by the experiment. By considering them as data and using them as teacher data, the data amount can be increased.

本発明は、未知の化合物の毒性を判別する毒性予測装置で用いられるモデルの毒性学習装置等として有用である。 INDUSTRIAL APPLICABILITY The present invention is useful as a toxicity learning device for a model used in a toxicity prediction device for determining the toxicity of an unknown compound.

１０毒性学習装置
１１入力部
１２教師データ生成部
１３比較部
１４符号化部
１５ラベル付与部
１６モデル学習部
１７学習済みモデル記憶部
１８ラベル生成部
２０毒性予測装置
２１入力部
２２前処理部
２３比較部
２４符号化部
２５推論部
２６出力部
２７コントロールデータ記憶部
２８学習済みモデル記憶部
３０肝毒性データベース
Reference Signs List 10 Toxicity learning device 11 Input unit 12 Teacher data generation unit 13 Comparison unit 14 Encoding unit 15 Label assignment unit 16 Model learning unit 17 Trained model storage unit 18 Label generation unit 20 Toxicity prediction device 21 Input unit 22 Preprocessing unit 23 Comparison Unit 24 encoding unit 25 inference unit 26 output unit 27 control data storage unit 28 trained model storage unit 30 hepatotoxicity database

Claims

An input unit for inputting expression data of a sample to which the compound is exposed and expression data of a control,
A comparison unit that compares the expression data of the sample and the control for each predetermined gene,
An encoding unit that encodes the expression data of the gene based on the difference in the expression data,
A labeling unit for labeling the encoded expression data with a toxicity label of the compound,
Using the labeled teacher data, a model learning unit for learning a model for predicting the toxicity of the compound from the expression data of the gene,
Toxicity learning device equipped with.

The encoding unit includes:
The genes are ranked based on the difference in the expression data, and the rank is determined to be “1” for a predetermined number of genes with a higher rank, “−1” for a predetermined number of genes with a lower rank, and “0” for other genes. The toxicity learning device according to claim 1, wherein

The toxicity learning device according to claim 1, wherein the comparing unit obtains a plurality of comparison results by comparing the expression data for each gene by the following methods (i) to (iii).
(I) taking the difference between the expression data of the sample and the control;
(Ii) taking the expression data of the sample and the control;
(Iii) Normalize the expression data of the sample and the control and take the difference.

The toxicity learning device according to any one of claims 1 to 3, wherein the labeling unit uses toxicity data read from a database storing the toxicity of the existing compound as the label.

Reads the data of the side effects of the compound from the database that stores the side effects of the existing compound, comprising a label generating unit that obtains the data of the toxicity of the compound based on the data of the read side effects,
The toxicity learning device according to claim 1, wherein the labeling unit uses the obtained toxicity data as the label.

Inputting expression data of a sample exposed to a known compound and expression data of a control,
Comparing the expression data of the sample and the control for each gene,
Encoding the expression data of the gene based on the difference in the expression data,
Generating teacher data by labeling the encoded expression data with the toxicity of the compound;
Using the teacher data, learning a model to predict the toxicity of the compound from the expression data of the gene,
Toxicity learning method comprising.

The encoding step includes:
Ranking the genes based on the difference in the expression data;
A step of assigning “1” to a predetermined number of genes whose rank is higher, “−1” to a predetermined number of genes whose rank is lower, and “0” to other genes;
The toxicity learning method according to claim 6, comprising:

8. The toxicity learning method according to claim 6, wherein the comparing step obtains a plurality of comparison results by comparing the expression data for each gene by the following methods (i) to (iii).
(I) taking the difference between the expression data of the sample and the control;
(Ii) taking the expression data of the sample and the control;
(Iii) Normalize the expression data of the sample and the control and take the difference.

In the inputting step, input expression data of a plurality of samples exposed to a known compound,
In the comparing step, the expression data of the plurality of samples and the control are compared for each gene, and a plurality of comparison results are obtained for the one compound.
The toxicity learning method according to claim 6.

In the step of generating the teacher data,
10. The toxicity learning method according to claim 6, wherein toxicity data read from a database storing the toxicity of the existing compound is used as the label.

In the step of generating the teacher data,
10. The toxicity learning according to any one of claims 6 to 9, wherein toxicity data of the compound is obtained based on side effect data read from a database storing the side effects of the existing compound, and the obtained toxicity data is used as the label. Method.

A program for generating a model used to infer toxicity of a compound based on expression data of a sample to which the compound has been exposed, comprising:
Inputting expression data of a sample to which the compound is exposed and expression data of a control;
Comparing the expression data of the sample and the control for each gene,
Encoding the expression data of the gene based on the difference in the expression data,
Generating teacher data by labeling the encoded expression data with the toxicity of the compound;
Using the teacher data, learning a model to predict the toxicity of the compound from the expression data of the gene,
A program that executes

An apparatus for inferring the toxicity of a compound by using a learned model learned by the toxicity learning method according to claim 6,
An input unit for inputting expression data of a sample exposed to an unknown compound,
A comparison unit that compares the expression data of the sample and the expression data of the control for each predetermined gene,
An encoding unit that encodes the expression data of the gene based on the difference in the expression data,
Applying the encoded expression data to the trained model, an inference unit for inferring the toxicity of the compound,
An output unit that outputs an inference result by the inference unit;
A toxicity prediction device comprising:

A trained model for operating a computer to output a value quantifying the toxicity of a compound based on the expression data when the compound is exposed. Expression data encoded based on the difference is input, and based on the input encoded data, an operation based on the learned weighting factors of the neural network is performed, and the toxicity of the compound is quantified from the output layer. A trained model that lets a computer function to output values.