JP6788961B2

JP6788961B2 - How to select genes that can classify cells

Info

Publication number: JP6788961B2
Application number: JP2015221656A
Authority: JP
Inventors: 法親緒方
Original assignee: CHITOSE BIO EVOLUTION PTE. LTD.
Current assignee: CHITOSE BIO EVOLUTION PTE. LTD.
Priority date: 2015-11-11
Filing date: 2015-11-11
Publication date: 2020-11-25
Anticipated expiration: 2035-11-11
Also published as: JP2017091277A

Description

本発明は，細胞を分類可能な遺伝子の選出方法に関する。 The present invention relates to a method for selecting a gene capable of classifying cells.

特開２０１２−３９９９４号公報には，主成分算出方法及びトランスクリプトーム解析方法が開示されている。 Japanese Unexamined Patent Publication No. 2012-39994 discloses a method for calculating principal components and a method for analyzing transcriptome.

特開２０１４−０７５９９５号公報には，トランスクリプトームを用いた発現変動遺伝子抽出又はパスウェイ解析にかける実験区の選定方法が開示されている。 Japanese Unexamined Patent Publication No. 2014-075995 discloses a method for selecting an experimental group to be subjected to expression-variable gene extraction or pathway analysis using a transcriptome.

上記のように，個別細胞トランスクリプトームデータを取得することができ，様々な用途に利用されている。これにより個別の細胞の特徴を解析することが容易になった。この場合，個別の細胞の特徴は，別の細胞間のトランスクリプトームデータを比較することによって識別されうることとなる。そのため，個々の細胞を識別するためには、比較対象となる細胞、又は個別の細胞間の比較による分類が必要である。個別細胞トランスクリプトームデータについて個別の細胞間の比較による分類を実施するために、これまで細胞周期、あるいはマーカー遺伝子によって定義される既知の分類に基づいて予め個別の細胞が分類され、それらの分類に基づいて個別細胞トランスクリプトームデータが分類されてきた。 As described above, individual cell transcriptome data can be acquired and used for various purposes. This made it easier to analyze the characteristics of individual cells. In this case, the characteristics of individual cells can be identified by comparing transcriptome data between different cells. Therefore, in order to identify individual cells, it is necessary to classify cells to be compared or by comparison between individual cells. In order to carry out comparative classification of individual cell transcriptome data between individual cells, individual cells have been previously classified based on the cell cycle or known classification defined by marker genes, and their classification Individual cell transcriptome data have been categorized based on.

しかしながら，例えば，チャイニーズハムスター卵巣由来樹立細胞系（ＣＨＯ−Ｋ１）は，治療用タンパク質製造用樹立細胞であるが，非モデル生物であることから既知のマーカー遺伝子が少なく，予め個別の細胞の分類が困難であり，個別細胞トランスクリプトームデータの分類が不可能であった。 However, for example, the Chinese hamster ovary-derived established cell line (CHO-K1) is an established cell for the production of therapeutic proteins, but since it is a non-model organism, there are few known marker genes, and individual cells are classified in advance. It was difficult and it was impossible to classify individual cell transcriptome data.

特開２０１２−３９９９４号公報Japanese Unexamined Patent Publication No. 2012-39994 特開２０１４−０７５９９５号公報Japanese Unexamined Patent Publication No. 2014-075995

本発明は，個別細胞トランスクリプトームデータの分類を可能とする方法を提供することを目的とする。 An object of the present invention is to provide a method capable of classifying individual cell transcriptome data.

本発明は，基本的には，複数の遺伝子についての発現量分布を求めて，分布を解析し，主成分分析を行うことで，対象となる細胞を分類するための候補を選出することにより，個別細胞のトランスクリプトームデータの分類が可能となるという知見に基づく。 The present invention basically obtains expression level distributions for a plurality of genes, analyzes the distributions, and performs principal component analysis to select candidates for classifying target cells. It is based on the finding that it is possible to classify transcriptome data of individual cells.

この発明の第１の側面は，細胞を分類可能な遺伝子の選出方法に関する。
この方法は，発現量分布算出工程（Ｓ１０１）と，分布分析工程（Ｓ１０２）と，分類可能遺伝子初期候補選出工程（Ｓ１０３）とを含む。
発現量分布算出工程（Ｓ１０１）は，複数の細胞のトランスクリプトームデータから，複数の遺伝子について複数の個別細胞の発現量分布を求める工程である。
分布分析工程（Ｓ１０２）は，複数の個別細胞の発現量分布が複数の分布の山を有するか判定する工程である。
分類可能遺伝子初期候補選出工程（Ｓ１０３）は，分布分析工程で複数の分布の山を有すると判定された遺伝子を，細胞を分類可能な遺伝子の初期候補として選出する工程である。 The first aspect of the present invention relates to a method for selecting a gene capable of classifying cells.
This method includes an expression level distribution calculation step (S101), a distribution analysis step (S102), and a classifiable gene initial candidate selection step (S103).
The expression level distribution calculation step (S101) is a step of obtaining the expression level distribution of a plurality of individual cells for a plurality of genes from transcriptome data of a plurality of cells.
The distribution analysis step (S102) is a step of determining whether the expression level distribution of a plurality of individual cells has a mountain of a plurality of distributions.
The classifiable gene initial candidate selection step (S103) is a step of selecting a gene determined to have a plurality of distribution peaks in the distribution analysis step as an initial candidate for a gene that can classify cells.

この発明の第１の側面の好ましい態様は，分類可能遺伝子候補選出工程（Ｓ１０４）を更に含むものである。分類可能遺伝子候補選出工程（Ｓ１０４）は，細胞を分類可能な遺伝子の初期候補の定量的分類を行い，複数の分布の山に含まれる第１の山と第２の山とが定量的分類において分離可能か否か判定し，分離可能なものを，細胞を分類可能な遺伝子の候補として選出するものである。定量的分類は，主成分分析を含むものが好ましい。 A preferred embodiment of the first aspect of the present invention further comprises a classifiable gene candidate selection step (S104). The classifiable gene candidate selection step (S104) quantitatively classifies the initial candidates of genes that can classify cells, and the first mountain and the second mountain included in the mountains of a plurality of distributions are quantitatively classified. Whether or not it can be separated is determined, and those that can be separated are selected as candidates for genes that can classify cells. Quantitative classification preferably includes principal component analysis.

この発明の第１の側面の好ましい態様は，複数の分布の山に含まれる第１の山と第２の山に含まれる細胞群をそれぞれ主成分１及び主成分２としたときに，主成分１及び主成分２のＭＡプロットにおいて，Ａの所定値以上の位置において，主成分１及び主成分２を分離することができるか否か判定し，分離可能なものを，細胞を分類可能な遺伝子として選出する工程（Ｓ１０５）を含むものである。 A preferred embodiment of the first aspect of the present invention is that when the cell groups contained in the first mountain and the second mountain contained in the mountains having a plurality of distributions are the main component 1 and the main component 2, respectively, the main components are used. In the MA plot of 1 and the main component 2, it is determined whether or not the main component 1 and the main component 2 can be separated at a position equal to or higher than the predetermined value of A, and the separable gene is a gene capable of classifying cells. (S105) is included.

本発明の第２の側面は，上記の方法を実現するコンピュータやプログラムに関する。
この選出装置は，コンピュータを含む細胞を分類可能な遺伝子の選出装置である。そして，コンピュータは，発現量分布算出手段１１と，分布分析手段１３と，分類可能遺伝子初期候補選出手段１５とを含む。
発現量分布算出手段１１は，複数の細胞のトランスクリプトームデータから，複数の遺伝子について複数の個別細胞の発現量分布を求めるための手段である。
分布分析手段１３は，複数の個別細胞の発現量分布が複数の分布の山を有するか判定するための手段である。
分類可能遺伝子初期候補選出手段は，分布分析手段が複数の分布の山を有すると判定した遺伝子を，細胞を分類可能な遺伝子の初期候補として選出するための手段である。 The second aspect of the present invention relates to a computer or program that realizes the above method.
This selection device is a gene selection device that can classify cells including computers. Then, the computer includes an expression level distribution calculation means 11, a distribution analysis means 13, and a classifiable gene initial candidate selection means 15.
The expression level distribution calculation means 11 is a means for obtaining the expression level distribution of a plurality of individual cells for a plurality of genes from transcriptome data of a plurality of cells.
The distribution analysis means 13 is a means for determining whether the expression level distribution of a plurality of individual cells has a mountain of a plurality of distributions.
The classifiable gene initial candidate selection means is a means for selecting a gene determined by the distribution analysis means to have a mountain of a plurality of distributions as an initial candidate for a gene that can be classified.

この発明の第２の側面の好ましい態様は，細胞を分類可能な遺伝子の初期候補の定量的分類を行い，前記複数の分布の山に含まれる第１の山と第２の山とが前記定量的分類において分離可能か否か判定する手段１７と，
前記第１の山と第２の山とが分離可能な場合は，前記遺伝子を細胞を分類可能な遺伝子の候補として選出する，分類可能遺伝子候補選出手段１９を更に含むものである。 A preferred embodiment of the second aspect of the present invention is to quantitatively classify the initial candidates of genes that can classify cells, and the first and second peaks included in the plurality of distribution peaks are the quantitative classification. Means 17 for determining whether or not the cells can be separated in the classification, and
When the first peak and the second peak are separable, it further includes a classifiable gene candidate selection means 19 that selects the gene as a candidate for a gene that can classify cells.

この発明の第２の側面の好ましい態様は，複数の分布の山に含まれる第１の山と第２の山に含まれる細胞群をそれぞれ主成分１及び主成分２としたときに，主成分１及び主成分２のＭＡプロットにおいて，主成分１及び主成分２を分離することができるか否か判定し，分離可能なものを，細胞を分類可能な遺伝子として選出する手段２１を更に有するものである。 A preferred embodiment of the second aspect of the present invention is that when the first mountain and the cell group contained in the second mountain contained in the mountains having a plurality of distributions are the main component 1 and the main component 2, respectively, the main components are used. In the MA plot of 1 and the main component 2, it is determined whether or not the main component 1 and the main component 2 can be separated, and those having the means 21 for selecting the separable gene as a gene that can classify the cells. Is.

この発明の第２の側面の上記とは別の態様は，プログラムに関する。このプログラムは，コンピュータを，発現量分布算出手段１１と，分布分析手段１３と，分類可能遺伝子初期候補選出手段１５を含むように機能させるものである。またこのプログラムは更に，上記したコンピュータのように機能させるものであってもよい。プログラムは通常ＣＤ−ＲＯＭなどの記録媒体に記憶されるか，インターネットによりダウンロード可能にされており，コンピュータにインストールされることで，各種手段や機能を実装できるようにされている。 Another aspect of the second aspect of the present invention relates to the program. This program causes the computer to function to include an expression level distribution calculation means 11, a distribution analysis means 13, and a classifiable gene initial candidate selection means 15. The program may also function like the computer described above. Programs are usually stored on recording media such as CD-ROMs or made available for download via the Internet, and by being installed on a computer, various means and functions can be implemented.

本発明のよれば，個別細胞トランスクリプトームデータの分類を可能とする方法やそのための装置を提供できる。 According to the present invention, it is possible to provide a method capable of classifying individual cell transcriptome data and a device for that purpose.

図１は，細胞を分類可能な遺伝子の選出方法を行うための処理装置のブロック図である。FIG. 1 is a block diagram of a processing device for selecting a gene capable of classifying cells. 図２は，細胞を分類可能な遺伝子の選出方法の工程図である。FIG. 2 is a process diagram of a method for selecting a gene capable of classifying cells. 図３は，あるｍＲＮＡ（エノラーゼ）の個別細胞での発現量分布を示すヒストグラムである。FIG. 3 is a histogram showing the expression level distribution of a certain mRNA (enolase) in individual cells. 図４は，主成分分析の結果を示す図面に替わるグラフである。FIG. 4 is a graph that replaces the drawing showing the results of principal component analysis. 図５は，ＭＡプロットを示す図面に替わるグラフである。FIG. 5 is a graph that replaces the drawing showing the MA plot.

以下，図面を用いて本発明を実施するための形態について説明する。本発明は，以下に説明する形態に限定されるものではなく，以下の形態から当業者が自明な範囲で適宜修正したものも含む。図１は，細胞を分類可能な遺伝子の選出方法を行うための処理装置のブロック図であり，図２は，細胞を分類可能な遺伝子の選出方法の工程図である。 Hereinafter, embodiments for carrying out the present invention will be described with reference to the drawings. The present invention is not limited to the forms described below, but also includes those modified from the following forms to the extent apparent to those skilled in the art. FIG. 1 is a block diagram of a processing device for performing a method for selecting a gene capable of classifying cells, and FIG. 2 is a process diagram of a method for selecting a gene capable of classifying cells.

この発明の第１の側面は，細胞を分類可能な遺伝子の選出方法に関する。
この方法は，発現量分布算出工程と，分布分析工程と，分類可能遺伝子初期候補選出工程とを含む。
発現量分布算出工程は，複数の細胞のトランスクリプトームデータから，複数の遺伝子について複数の個別細胞の発現量分布を求める工程である。
分布分析工程は，複数の個別細胞の発現量分布が複数の分布の山を有するか判定する工程である。
分類可能遺伝子初期候補選出工程は，分布分析工程で複数の分布の山を有すると判定された遺伝子を，細胞を分類可能な遺伝子の初期候補として選出する工程である。 The first aspect of the present invention relates to a method for selecting a gene capable of classifying cells.
This method includes an expression level distribution calculation step, a distribution analysis step, and a classifier initial candidate selection step.
The expression level distribution calculation step is a step of obtaining the expression level distribution of a plurality of individual cells for a plurality of genes from transcriptome data of a plurality of cells.
The distribution analysis step is a step of determining whether the expression level distribution of a plurality of individual cells has a mountain of a plurality of distributions.
The process of selecting an initial candidate for a classifiable gene is a process of selecting a gene determined to have a plurality of peaks of distribution in the distribution analysis process as an initial candidate for a gene that can classify cells.

この発明の第１の側面の好ましい態様は，分類可能遺伝子候補選出工程を更に含むものである。細胞を分類可能な遺伝子の初期候補の定量的分類を行い，複数の分布の山に含まれる第１の山と第２の山とが定量的分類において分離可能か否か判定し，分離可能なものを，細胞を分類可能な遺伝子の候補として選出するものである。定量的分類は，主成分分析を含むものが好ましい。 A preferred embodiment of the first aspect of the present invention further comprises a step of selecting classifiable gene candidates. Quantitative classification of initial candidates of genes that can classify cells is performed, and it is determined whether the first mountain and the second mountain contained in the mountains of multiple distributions can be separated in the quantitative classification, and the cells can be separated. Those are selected as candidates for genes that can classify cells. Quantitative classification preferably includes principal component analysis.

この発明の第１の側面の好ましい態様は，複数の分布の山に含まれる第１の山と第２の山に含まれる細胞群をそれぞれ主成分１及び主成分２としたときに，主成分１及び主成分２のＭＡプロットにおいて，Ａの所定値以上の位置において，主成分１及び主成分２を分離することができるか否か判定し，分離可能なものを，細胞を分類可能な遺伝子として選出するものである。 A preferred embodiment of the first aspect of the present invention is that when the cell groups contained in the first mountain and the second mountain contained in the mountains having a plurality of distributions are the main component 1 and the main component 2, respectively. In the MA plot of 1 and the main component 2, it is determined whether or not the main component 1 and the main component 2 can be separated at a position equal to or higher than the predetermined value of A, and the separable gene is a gene capable of classifying cells. It is to be elected as.

上記の遺伝子の選出方法は，ヒトが手計算を行って求めても良いし，コンピュータを用いて自動的に行っても良い。すなわち，本発明は，上記した方法を実行するためのコンピュータや，コンピュータを上記した方法を実現するようにするためのプログラム，及びそのようなプログラムを記憶したコンピュータが読み取り可能な情報記録媒体をも提供する。 The above-mentioned gene selection method may be obtained by human calculation by hand, or may be automatically performed by using a computer. That is, the present invention also includes a computer for executing the above-mentioned method, a program for making the computer realize the above-mentioned method, and an information recording medium that can be read by a computer that stores such a program. provide.

本発明の第２の側面は，コンピュータを含む細胞を分類可能な遺伝子の選出装置に関する。そして，コンピュータは，発現量分布算出手段（１１）と，分布分析手段（１３）と，分類可能遺伝子初期候補選出手段（１５）を含む。このコンピュータは，入出力部，記憶部，制御部，及び演算部が存在し，情報の授受を行うことができるように各要素が接続されている。そして，制御部は，記憶部に記憶した制御プログラムからの指令を受け，記憶部に記憶された各種情報を読み出して，演算部に所定の演算を行わせ，演算結果を記憶部に記憶し，適宜入出力部から出力する。 A second aspect of the present invention relates to a gene selection device capable of classifying cells including a computer. Then, the computer includes an expression level distribution calculation means (11), a distribution analysis means (13), and a classifiable gene initial candidate selection means (15). This computer has an input / output unit, a storage unit, a control unit, and an arithmetic unit, and each element is connected so that information can be exchanged. Then, the control unit receives a command from the control program stored in the storage unit, reads various information stored in the storage unit, causes the calculation unit to perform a predetermined calculation, and stores the calculation result in the storage unit. Output from the input / output section as appropriate.

発現量分布算出手段（１１）は，複数の細胞のトランスクリプトームデータから，複数の遺伝子について複数の個別細胞の発現量分布を求めるための手段である。分布分析手段（１３）は，複数の個別細胞の発現量分布が複数の分布の山を有するか判定するための手段である。分類可能遺伝子初期候補選出手段（１５）は，分布分析手段が複数の分布の山を有すると判定した遺伝子を，細胞を分類可能な遺伝子の初期候補として選出するための手段である。 The expression level distribution calculation means (11) is a means for obtaining the expression level distribution of a plurality of individual cells for a plurality of genes from transcriptome data of a plurality of cells. The distribution analysis means (13) is a means for determining whether the expression level distribution of a plurality of individual cells has a mountain of a plurality of distributions. The classifiable gene initial candidate selection means (15) is a means for selecting a gene determined by the distribution analysis means to have a mountain of a plurality of distributions as an initial candidate for a gene that can be classified.

このコンピュータは，定量的分類において分離可能か否か判定する手段（１７）と，分類可能遺伝子候補選出手段（１９）を更に含むものが好ましい。定量的分類において分離可能か否か判定する手段は，細胞を分類可能な遺伝子の初期候補の定量的分類を行い，複数の分布の山に含まれる第１の山と第２の山とが定量的分類において分離可能か否か判定するための手段である。
分類可能遺伝子候補選出手段は，第１の山と第２の山とが分離可能な場合に，その遺伝子を，細胞を分類可能な遺伝子の候補として選出するための手段である。 The computer preferably further includes a means for determining whether or not it can be separated in quantitative classification (17) and a means for selecting classifiable gene candidates (19). In the quantitative classification, the means for determining whether or not the cells can be separated is to quantitatively classify the initial candidates of genes that can classify the cells, and the first mountain and the second mountain included in the mountains of multiple distributions are quantified. It is a means for determining whether or not it is separable in the classification.
The classifiable gene candidate selection means is a means for selecting a cell as a candidate for a classifiable gene when the first mountain and the second mountain are separable.

このコンピュータは，細胞を分類可能な遺伝子として選出する手段（２１）を更に有するものが好ましい。細胞を分類可能な遺伝子として選出する手段を更に有するは，複数の分布の山に含まれる第１の山と第２の山に含まれる細胞群をそれぞれ主成分１及び主成分２としたときに，主成分１及び主成分２のＭＡプロットにおいて，主成分１及び主成分２を分離することができるか否か判定し，分離可能なものを，細胞を分類可能な遺伝子として選出するための手段である。 The computer preferably further comprises means (21) for selecting cells as classifiable genes. Further means for selecting cells as genes that can be classified are when the cell groups contained in the first mountain and the second mountain contained in the mountains of a plurality of distributions are the main component 1 and the main component 2, respectively. , A means for determining whether or not the main component 1 and the main component 2 can be separated in the MA plot of the main component 1 and the main component 2 and selecting the separable gene as a gene that can classify the cell. Is.

以下，上記のコンピュータを用いた例を用いて，細胞を分類可能な遺伝子の選出方法を説明する。このコンピュータは，記録媒体に記憶された制御プログラム又はインターネットからダウンロードすることにより入手可能な制御プログラムをインストールしたものであることが好ましい。この制御プログラムは，コンピュータに細胞を分類可能な遺伝子の選出方法を実現するための各種手段を実装させるためのものである。記録媒体の例は，ＣＤ−ＲＯＭ，ＤＶＤ，ＵＳＢ，及びメモリーカードであり，プログラムを記憶することができる媒体であれば，どのようなものであっても構わない。 Hereinafter, a method for selecting a gene capable of classifying cells will be described using the above computer-based example. The computer preferably has a control program stored on a recording medium or a control program available by downloading from the Internet installed. This control program is intended to allow a computer to implement various means for realizing a gene selection method capable of classifying cells. Examples of the recording medium are a CD-ROM, a DVD, a USB, and a memory card, and any medium that can store a program may be used.

コンピュータは，その記憶部に複数の細胞のトランスクリプトームデータを記憶している。トランスクリプトームデータは，特定の状況下において細胞中に存在する全てのｍＲＮＡ（又は一次転写産物）の総体を指す。トランスクリプトームは，特開２０１２−３９９９４号公報（主成分算出方法及びトランスクリプトーム解析方法），特開２０１４−０７５９９５号公報（トランスクリプトームを用いた発現変動遺伝子抽出又はパスウェイ解析にかける実験区の選定方法），及び特許第５７１４３２６号（新規に進行中の心不全における個人のリスク評価のためのトランスクリプトームのバイオマーカー）に開示されているとおり，公知のデータである。また，このトランスクリプトームデータを用いた主成分解析などの解析技術についてもこれらの文献に開示されるとおり公知のものである。 The computer stores transcriptome data of multiple cells in its storage. Transcriptome data refers to the total number of mRNAs (or primary transcripts) present in a cell under certain circumstances. The transcriptome is described in JP2012-39994 (main component calculation method and transcriptome analysis method), JP2014-0759595 (experimental group for expression variation gene extraction or pathway analysis using transcriptome). (Selection method), and Transcriptome biomarkers for assessing individual risk in newly progressing heart failure, as disclosed in Japanese Patent No. 5714326. In addition, analysis techniques such as principal component analysis using this transcriptome data are also known as disclosed in these documents.

入出力部から，実行指令を受けたコンピュータは，発現量分布算出手段（１１）に，複数の細胞のトランスクリプトームデータから，複数の遺伝子について複数の個別細胞の発現量分布を求めさせる（Ｓ１０１）。この複数の遺伝子について複数の個別細胞の発現量分布を求める工程（Ｓ１０１）は，記憶部に記憶される対象個体の複数の細胞のトランスクリプトームデータを読み出して，複数の遺伝子について複数の個別細胞の発現量分布を求める演算を行うものである。演算の例として，横軸がＲＰＭ値とし，この階級幅を適宜設定できるようにしており，それらの階級幅に含まれる細胞の数を頻度として求めるようにするものがあげられる。これは，リード数の総数が１００万になるように正規化するものである。このコンピュータは，ＲＰＭ値による正規化を行うためのプログラムがインストールされていることが好ましい。そのようなプログラムの例は，公開ソフトウェアであるＲを用いたプログラムである。このようにして発現変動遺伝子を検出する（後述する図３を参照。）。 The computer that receives the execution command from the input / output unit causes the expression level distribution calculation means (11) to obtain the expression level distribution of a plurality of individual cells for a plurality of genes from the transcriptome data of a plurality of cells (S101). ). In the step (S101) of obtaining the expression level distribution of a plurality of individual cells for the plurality of genes, the transcriptome data of the plurality of cells of the target individual stored in the storage unit is read out, and the plurality of individual cells for the plurality of genes This is an operation for obtaining the expression level distribution of. As an example of the calculation, the horizontal axis is the RPM value, and this class width can be set as appropriate, and the number of cells included in those class widths is calculated as the frequency. This normalizes the total number of reads to 1 million. It is preferable that this computer has a program installed for normalization by RPM value. An example of such a program is a program using R, which is public software. In this way, the expression-variable gene is detected (see FIG. 3 described later).

分布分析手段（１３）により分布分析工程（Ｓ１０２）が行われる。分布分析手段（１３）は，複数の個別細胞の発現量分布が複数の分布の山を有するか判定するための手段である。分布分析手段（１３）の例は，個別遺伝子の細胞での発現分布を示すヒストグラムにおいて，分布の山が存在するか否かを，複数の遺伝子について分析するものである。分布分析手段（１３）の別の例は，記憶部から各階級における頻度を読み出し，次の階級の頻度と比較することで，頻度が増加又は減少していることを求め，増加又は減少が所定値以上（たとえば，２階級以上，又は３階級以上）続いている場合に，分布の連続増加又は連続減少があったと解析するものである。そして，連続増加から連続減少があった場合に，山があり，連続減少から連続増加があった場合に谷があったと判断する。この場合，山には，仮想的な峰が存在することとなる。この峰（従って山）が２つ以上ある場合に，複数の個別細胞の発現量分布が複数の分布の山を有すると判断される。分布分析手段（１３）は，発現分布の変曲点を求めて，それにより峰の数を求めるものであっても良い。 The distribution analysis step (S102) is performed by the distribution analysis means (13). The distribution analysis means (13) is a means for determining whether the expression level distribution of a plurality of individual cells has a mountain of a plurality of distributions. In the example of the distribution analysis means (13), in a histogram showing the expression distribution of individual genes in cells, whether or not there is a mountain of distribution is analyzed for a plurality of genes. Another example of the distribution analysis means (13) is to read the frequency in each class from the storage unit and compare it with the frequency of the next class to find that the frequency is increasing or decreasing, and the increase or decrease is predetermined. It is analyzed that there was a continuous increase or decrease in the distribution when the value continues above the value (for example, the second class or higher, or the third class or higher). Then, it is judged that there is a mountain when there is a continuous decrease from a continuous increase, and there is a valley when there is a continuous increase from a continuous decrease. In this case, there will be a virtual peak on the mountain. When there are two or more peaks (hence, peaks), it is determined that the expression level distribution of the plurality of individual cells has peaks of multiple distributions. The distribution analysis means (13) may obtain the inflection point of the expression distribution and thereby obtain the number of peaks.

ある遺伝子について，発現量分布が複数の分布の山を有する場合は，個別細胞のトランスクリプトームデータの分類が可能である可能性がある。 If the expression level distribution of a gene has multiple peaks of distribution, it may be possible to classify the transcriptome data of individual cells.

さらに好ましい例では，分類可能遺伝子初期候補選出手段（１５）により分類可能遺伝子初期候補選出工程（Ｓ１０３）が行われる。分類可能遺伝子初期候補選出手段（１５）は，分布分析手段が複数の分布の山を有すると判定した遺伝子を，細胞を分類可能な遺伝子の初期候補として選出するための手段である。分布分析手段（１３）が，個別細胞の発現量分布が複数の分布の山を有すると判断した場合，分類可能遺伝子初期候補選出手段（１５）は，その細胞を分類可能な遺伝子の初期候補として選出し，記憶部に記憶する。この情報は，インターフェイス，モニタ，プリンタといった出力部により，上記の遺伝子を，細胞を分類可能な遺伝子として出力してもよい。 In a more preferable example, the classifiable gene initial candidate selection step (S103) is performed by the classifiable gene initial candidate selection means (15). The classifiable gene initial candidate selection means (15) is a means for selecting a gene determined by the distribution analysis means to have a mountain of a plurality of distributions as an initial candidate for a gene that can be classified. When the distribution analysis means (13) determines that the expression level distribution of an individual cell has a mountain of multiple distributions, the classifiable gene initial candidate selection means (15) uses the cell as an initial candidate for a classifiable gene. Select and store in the storage section. This information may be output from the above genes as genes that can classify cells by an output unit such as an interface, a monitor, or a printer.

さらに好ましい例では，分離可能性判定手段（１７）と，分類可能遺伝子候補選出手段（１９）により，分類可能遺伝子候補選出工程（Ｓ１０４）が行われる。
定量的分類において分離可能か否か判定する手段（１７）が，細胞を分類可能な遺伝子の初期候補の定量的分類を行い，複数の分布の山に含まれる第１の山と第２の山とが定量的分類において分離可能か否か判定する。定量的分類の例は，主成分分析，回帰分析，及び因子分析である。これらは，たとえば，ソフトウェアＲを用いることで容易に行うことができる。また，トランスクリプトームデータにおける主成分分析は公知であり，公知のプログラムを用いて主成分分析を行うことができる。トランスクリプトームデータにおける主成分分析の例は，たとえば，以下の文献に記載されている。
Ｊａｃｋｓｏｎ，Ｊ．Ｅｄｗａｒｄ（１９９１），ＡＵｓｅｒ’ｓＧｕｉｄｅｔｏＰｒｉｎｃｉｐａｌＣｏｍｐｏｎｅｎｔｓ（ＮｅｗＹｏｒｋ：ＪｏｈｎＷｉｌｅｙ＆Ｓｏｎｓ，Ｉｎｃ），
Ｓｈａｗ，ＰｅｔｅｒＪ．Ａ．（２００３），ＭｕｌｔｉｖａｒｉａｔｅＳｔａｔｉｓｔｉｃｓｆｏｒｔｈｅＥｎｖｉｒｏｎｍｅｎｔａｌＳｃｉｅｎｃｅｓ（Ｌｏｎｄｏｎ：ＨｏｄｄｅｒＡｒｎｏｌｄ）．
主成分分析では行列の次元をあらわす軸を新しく設定する。それらの新たな軸はそれぞれ直交している。また第一の軸は要素群の中心に添い，また第二の軸は第一の軸で表されなかった残渣の中心に沿う。こうすることで，それぞれの新しく設定された軸はオリジナルの行列よりも少ない次元でデータを効率よく近似することとなる。 In a more preferable example, the classifiable gene candidate selection step (S104) is performed by the separability determination means (17) and the classifiable gene candidate selection means (19).
The means (17) for determining whether or not the cells can be separated in the quantitative classification quantitatively classifies the initial candidates of genes that can classify the cells, and the first mountain and the second mountain included in the mountains of multiple distributions. Judges whether and is separable in the quantitative classification. Examples of quantitative classification are principal component analysis, regression analysis, and factor analysis. These can be easily performed by using software R, for example. In addition, principal component analysis in transcriptome data is known, and principal component analysis can be performed using a known program. Examples of principal component analysis in transcriptome data are described, for example, in the following literature.
Jackson, J.M. Edward (1991), A User's Guide to Principal Components (New York: John Wiley & Sons, Inc.),
Shaw, Peter J. et al. A. (2003), Multivariate Statistics for the Environmental Sciences (London: Hodder Arnold).
In principal component analysis, a new axis representing the dimension of the matrix is set. Each of these new axes is orthogonal. Also, the first axis is along the center of the element group, and the second axis is along the center of the residue not represented by the first axis. By doing this, each newly set axis will efficiently approximate the data with fewer dimensions than the original matrix.

このコンピュータは，複数の分布の山に含まれる第１の山と第２の山とが定量的分類において分離可能か否か判定するための閾値を記憶部に記憶していることが好ましい。主成分分析を行う場合は，主成分プロットにおいて，個別細胞の発現量分布のそれぞれの山に帰属した細胞が，分離できているか否か判定する。この場合，分離具合を示す閾値と，記憶部に記憶されている閾値とを比較することで，複数の分布の山に含まれる第１の山と第２の山に含まれる細胞が分離可能であると判断すればよい。この分類は，たとえば，第１の山に由来する細胞群（又はこれらのうち所定の割合が含まれる細胞群）と，第２の山に由来する細胞群（又はこれらのうち所定の割合が含まれる細胞群）による主成分分析の主成分プロットにおける領域面積を求めるとともに，これらの領域が重複している面積も求めて，重複部分の面積と，第１又は第２の領域の面積との比較の値を用いて，第１の山と第２の山とが定量的分類において分離可能か否か判定してもよい。 It is preferable that this computer stores a threshold value for determining whether or not the first mountain and the second mountain included in the mountains of a plurality of distributions are separable in the quantitative classification in the storage unit. When performing principal component analysis, determine whether or not the cells belonging to each peak of the expression level distribution of individual cells can be separated in the principal component plot. In this case, by comparing the threshold value indicating the degree of separation with the threshold value stored in the storage unit, the cells contained in the first mountain and the second mountain contained in the mountains of a plurality of distributions can be separated. You can judge that there is. This classification includes, for example, a group of cells derived from the first crest (or a group of cells containing a predetermined proportion of these) and a group of cells derived from the second crest (or a predetermined proportion of these). Find the area of the principal component plot of the principal component analysis by the cell group, and also find the area where these regions overlap, and compare the area of the overlapping portion with the area of the first or second region. The value of may be used to determine whether the first peak and the second peak are separable in the quantitative classification.

たとえば，分類可能遺伝子候補選出手段（１９）が，第１の山と第２の山とが分離可能な場合に，その遺伝子を，細胞を分類可能な遺伝子の候補として選出する。このコンピュータは，上記のとおり，第１の山と第２の山とが定量的分類において分離可能である場合，上記の遺伝子を，細胞を分類可能な遺伝子として出力してもよい。 For example, when the first peak and the second peak are separable, the classifiable gene candidate selection means (19) selects the gene as a candidate for a gene that can classify cells. As described above, this computer may output the above gene as a gene capable of classifying cells when the first peak and the second peak are separable in the quantitative classification.

細胞を分類可能な遺伝子として選出する手段（２１）により，分類可能遺伝子選出工程（Ｓ１０５）が行われる。細胞を分類可能な遺伝子として選出する手段（２１）は，複数の分布の山に含まれる第１の山と第２の山に含まれる細胞群をそれぞれ主成分１及び主成分２としたときに，主成分１及び主成分２のＭＡプロットにおいて，主成分１及び主成分２を分離することができるか否か判定し，分離可能なものを，細胞を分類可能な遺伝子として選出する。ＭＡプロットは，横軸を２群間の平均発現量，縦軸を２群間の発現量の比としてプロットした散布図である。ＭＡプロットにおいて，主成分１及び主成分２が分離可能か否かは先に説明したと同様の方法を用いてもよい。ＭＡプロット自体は公知でありたとえばソフトウェアＲを用いても行うことができる。また，あらかじめ閾値を決めておき，ソフトウェアＲを用いてその閾値とＭＡプロットに関する数値とを比較することで，主成分１及び主成分２が分離可能か否かを判定できる。 The classifiable gene selection step (S105) is performed by the means (21) for selecting cells as classifiable genes. The means (21) for selecting cells as genes that can be classified is when the first mountain and the cell group contained in the second mountain included in the mountains having a plurality of distributions are the main component 1 and the main component 2, respectively. , In the MA plot of the main component 1 and the main component 2, it is determined whether or not the main component 1 and the main component 2 can be separated, and those that can be separated are selected as genes that can classify cells. The MA plot is a scatter plot in which the horizontal axis is the average expression level between the two groups and the vertical axis is the ratio of the expression level between the two groups. In the MA plot, whether or not the main component 1 and the main component 2 can be separated may be determined by the same method as described above. The MA plot itself is known and can also be performed using, for example, software R. Further, by determining the threshold value in advance and comparing the threshold value with the numerical value related to the MA plot using software R, it is possible to determine whether or not the main component 1 and the main component 2 can be separated.

無血清浮遊順化させたＣＨＯ−Ｋ１細胞をＴ−２５フラスコで撹拌培養した。ソニー製のＳＨ８００セルソーターを用いて，ＣＨＯ−Ｋ１細胞を９６穴プレートに単細胞分取した。それぞれの細胞を，０．４μＬの培地に静置した。Ｑｕａｒｔｚ−ｓｅｑのプロトコルに従って，シークエンシングライブラリを調製した。Ｉｌｌｕｍｉｎａ社製のＮｅｘｔＳｅｑ５００を用いてシーケンシングを実施した。得られたリードを，Ｂｏｗｔｉｅプログラムを用いて，ＲｅｆＳｅｑに登録されているＣＨＯ−Ｋ１細胞トランスクリプトームリファレンスにマップした（ｂｏｗｔｉｅｏｐｔｉｏｎ:−ｌ７５ −ｎ２）。Ｓｈｅｌｌプログラムを用いて，各サンプルの全てのメッセンジャーＲＮＡの相対発現量としてのトランスクリプトームデータを取得した。Ｒソフトウェアとｓｔａｔｓパッケージを用いて主成分分析を実施した。全てのメッセンジャーＲＮＡについてＲソフトウェアを用いてＲＰＭ（Ｒｅａｄｓｐｅｒｍｉｌｌｉｏｎｍａｐｐｅｄｒｅａｄｓによる正規化）を算出した。全てのメッセンジャーＲＮＡについて，個別細胞での発現量分布を描画した。全てのメッセンジャーＲＮＡについての個別細胞での発現量分布を観察し，多峰性を持つメッセンジャーＲＮＡ情報を得た。 Serum-free floating acclimatized CHO-K1 cells were stirred and cultured in a T-25 flask. CHO-K1 cells were separated into 96-well plates using a Sony SH800 cell sorter. Each cell was allowed to stand in 0.4 μL of medium. Sequencing libraries were prepared according to the Quartz-seq protocol. Sequencing was performed using NextSeq500 manufactured by Illumina. The resulting reads were mapped to the CHO-K1 cell transcriptome reference registered in RefSeq using the Bowtie program (bowtie option: -l 75-n 2). Transcriptome data as relative expression levels of all messenger RNAs in each sample were obtained using the Shell program. Principal component analysis was performed using R software and the statistics package. RPM (normalization by Reads per million mapped reads) was calculated using R software for all messenger RNAs. The expression level distribution in individual cells was drawn for all messenger RNAs. The expression level distribution of all messenger RNAs in individual cells was observed to obtain multimodal messenger RNA information.

図３は，あるｍＲＮＡ（エノラーゼ）の個別細胞での発現量分布を示すヒストグラムである。横軸はＲＰＭ値であり，縦軸は頻度（ＲＰＭの階級に含まれる細胞の数）を示す。図３において，階級の幅を自由に設定することができ，図３では１７０となっている。図３は，横軸の最小が３００，最大が４２００である。図３から，エノラーゼのＲＰＭが１８００未満である細胞群と，１８００以上である細胞群とが存在することがわかる。 FIG. 3 is a histogram showing the expression level distribution of a certain mRNA (enolase) in individual cells. The horizontal axis is the RPM value, and the vertical axis is the frequency (the number of cells included in the RPM class). In FIG. 3, the width of the class can be freely set, and in FIG. 3, it is 170. In FIG. 3, the minimum on the horizontal axis is 300 and the maximum is 4200. From FIG. 3, it can be seen that there are a group of cells having an RPM of Enolase less than 1800 and a group of cells having an RPM of 1800 or more.

図３の示されるヒストグラムは，ＲＰＭが１８００未満と１８００以降において２つの山が存在する。このようなヒストグラムを，多峰性を持つメッセンジャーＲＮＡの発現量とする。この例では，例えば，頻度の増加又は減少が所定階級以上（例えば２階級以上や３階級以上）続いた場合に，分布が増加傾向又は現象傾向にあると判断しても良い。そして，増加傾向及びそれに続く減少傾向が一つの山を構成し，いずれかに峰が存在する。また，減少傾向及びそれに続く増加傾向が谷を構成し，それに続く減少傾向があれば更なる峰が存在することとなる。 In the histogram shown in FIG. 3, there are two peaks when the RPM is less than 1800 and after 1800. Such a histogram is used as the expression level of multimodal messenger RNA. In this example, for example, when the frequency increases or decreases in a predetermined class or higher (for example, 2nd class or higher or 3rd class or higher), it may be determined that the distribution tends to increase or phenomenon. Then, the increasing tendency and the subsequent decreasing tendency constitute one mountain, and there is a peak in one of them. In addition, a decreasing tendency and a subsequent increasing tendency form a valley, and if there is a subsequent decreasing tendency, there will be further peaks.

多峰性を持つメッセンジャーＲＮＡの発現量に基づいて、個別細胞の分類を行なった。多峰性を持つメッセンジャーＲＮＡの発現量に基づいて得られた個別細胞の分類に基づいて主成分分析の結果として得られた主成分１と主成分２からなるプロットに含まれる個別細胞のプロットを識別した。その結果を図４に示す。図３においてエノラーゼ遺伝子のＲＰＭが１８００未満の細胞群を細胞群Aとし，図３においてＲＰＭが１８００以上の細胞群を細胞群Bとするである。 Individual cells were classified based on the expression level of multimodal messenger RNA. The plot of individual cells included in the plot consisting of principal component 1 and principal component 2 obtained as a result of principal component analysis based on the classification of individual cells obtained based on the expression level of multimodal messenger RNA. Identified. The result is shown in FIG. In FIG. 3, the cell group having an RPM of less than 1800 of the enolase gene is referred to as cell group A, and in FIG. 3, the cell group having an RPM of 1800 or more is referred to as cell group B.

図４は，実施例における主成分分析の結果を示す。横軸は主成分１の主成分スコアを示し，縦軸は主成分２の主成分スコアを示す。主成分分析により，細胞群Aと細胞群Bとが分類可能な候補と成りうることがわかる。なお，図４における各ポイントは，それぞれ１つの細胞に対応する。このように，多峰性を持つメッセンジャーＲＮＡの発現量に基づいて得られた個別細胞の分類に基づいて主成分分析の結果として得られた主成分１と主成分２からなるプロットに含まれる個別細胞のプロットを識別し，主成分１と主成分２からなるプロットが２つ以上のクラスターに分類可能となる多峰性を持つメッセンジャーＲＮＡ情報として，エノラーゼを得た。 FIG. 4 shows the results of principal component analysis in the examples. The horizontal axis shows the principal component score of the principal component 1, and the vertical axis shows the principal component score of the principal component 2. Principal component analysis shows that cell group A and cell group B can be classifiable candidates. Each point in FIG. 4 corresponds to one cell. In this way, the individual included in the plot consisting of principal component 1 and principal component 2 obtained as a result of principal component analysis based on the classification of individual cells obtained based on the expression level of multimodal messenger RNA. Cell plots were identified, and enolase was obtained as multimodal messenger RNA information that allows the plot consisting of principal component 1 and principal component 2 to be classified into two or more clusters.

エノラーゼの発現量によって個別細胞が分類されていることを確かめるため，エノラーゼの発現量によって分類された個別細胞クラスター間（細胞群Aと細胞群B）で発現変動遺伝子解析を行なった。発現変動遺伝子解析にはＲソフトウェアとＴＣＣパッケージを用いた。統計処理により，ＦａｌｓｅＤｉｓｃｏｖｅｒｙＲａｔｅ５%以下のメッセンジャーＲＮＡとして１９９個を得た。それら１９９個の遺伝子はＭＡプロットにおいて明確なクラスターを形成した（図５）。ＭＡプロットは，２群間比較用のデータに対して，横軸を２群間の発現量の積の対数，縦軸を２群間の発現量の比の対数としてプロットした散布図である。ＭＡプロットにおいては，１つの遺伝子が１つの点としてプロットされる。図５の例では，Ｇ１群及びＧ２群は，それぞれ細胞群A及び細胞群Bを表す。Ｇ１群及びＧ２群において変動のないものが縦軸の値が０の付近に現れる。Ｇ１群又はＧ２群のみで高く発現しているものが，縦軸の値が０から離れた位置に現れる。図５から以上より、ＣＨＯ−Ｋ１細胞に含まれる小集団を検出した。 In order to confirm that individual cells are classified according to the expression level of enolase, expression variation gene analysis was performed between individual cell clusters (cell group A and cell group B) classified according to the expression level of enolase. R software and TCC package were used for expression variation gene analysis. Statistical processing gave 199 messenger RNAs with False Discovery Rate of 5% or less. The 199 genes formed clear clusters on the MA plot (Fig. 5). The MA plot is a scatter plot in which the horizontal axis is the logarithm of the product of the expression levels between the two groups and the vertical axis is the logarithm of the ratio of the expression levels between the two groups with respect to the data for comparison between the two groups. In the MA plot, one gene is plotted as one point. In the example of FIG. 5, G1 group and G2 group represent cell group A and cell group B, respectively. In the G1 group and the G2 group, those having no fluctuation appear in the vicinity of the value on the vertical axis of 0. Those that are highly expressed only in the G1 group or the G2 group appear at positions where the value on the vertical axis is far from 0. From the above, a small population contained in CHO-K1 cells was detected.

本発明は，バイオテクノロジーや医薬産業において利用されうる。 The present invention can be used in the biotechnology and pharmaceutical industries.

１１発現量分布算出手段
１３分布分析手段
１５分類可能遺伝子初期候補選出手段
１７分離可能か否か判定する手段
１９分類可能遺伝子候補選出手段
２１分類可能遺伝子選出手段 11 Expression level distribution calculation means 13 Distribution analysis means 15 Classifiable gene initial candidate selection means 17 Means for determining whether or not segregation is possible 19 Classifiable gene candidate selection means 21 Classifiable gene selection means

Claims

An expression level distribution calculation step for obtaining the expression level distribution of multiple individual cells for a plurality of genes from transcriptome data of a plurality of cells, and
A distribution analysis step for determining whether the expression level distribution of the plurality of individual cells has a mountain of multiple distributions, and
Including the step of selecting the initial candidate for a classifiable gene, which selects the gene determined to have a plurality of peaks of distribution in the distribution analysis step as the initial candidate of the gene that can classify the cell.
It is a method for selecting genes that can classify cells.
In the MA plot of the main component 1 and the main component 2, when the cell groups included in the first mountain and the second mountain included in the plurality of distribution mountains are the main component 1 and the main component 2, respectively, A. A method of determining whether or not the main component 1 and the main component 2 can be separated at a position equal to or higher than a predetermined value of, and selecting the separable gene as a gene that can classify cells.

The method according to claim 1.
Quantitative classification of the initial candidates of genes that can classify cells is performed, and it is determined whether or not the first mountain and the second mountain included in the mountains of the plurality of distributions can be separated in the quantitative classification, and separation is performed. It further includes a process of selecting classifiable gene candidates, which selects cells as candidates for classifiable genes.
A method for selecting genes that can classify cells.

The method according to claim 2.
The quantitative classification is a method that includes principal component analysis.

It is a gene selection device that can classify cells including computers.
The computer
An expression level distribution calculation means (11) for obtaining the expression level distribution of a plurality of individual cells for a plurality of genes from transcriptome data of a plurality of cells, and
Distribution analysis means (13) for determining whether the expression level distribution of the plurality of individual cells has a mountain of a plurality of distributions, and
The distribution analysis means (13) includes a classifiable gene initial candidate selection means (15) for selecting a gene determined to have a mountain of a plurality of distributions as an initial candidate for a gene that can classify cells.
It ’s a device ,
In the MA plot of the main component 1 and the main component 2, when the cell groups included in the first mountain and the second mountain included in the multiple distribution mountains are the main component 1 and the main component 2, respectively, An apparatus further comprising means (21) for determining whether or not component 1 and main component 2 can be separated, and selecting the separable gene as a gene capable of classifying cells .

The device according to claim 4.
A means for quantitatively classifying initial candidates of genes that can classify cells and determining whether or not the first mountain and the second mountain included in the mountains of the plurality of distributions can be separated in the quantitative classification ( 17) and
An apparatus further comprising a classifiable gene candidate selection means (19), which selects the gene as a candidate for a gene that can classify cells when the first mountain and the second mountain are separable.

Computer,
An expression level distribution calculation means (11) for obtaining the expression level distribution of a plurality of individual cells for a plurality of genes from transcriptome data of a plurality of cells, and
Distribution analysis means (13) for determining whether the expression level distribution of the plurality of individual cells has a mountain of a plurality of distributions, and
With the classifiable gene initial candidate selection means (15), which selects a gene determined by the distribution analysis means to have a plurality of distribution peaks as an initial candidate for a gene that can classify cells.
In the MA plot of the main component 1 and the main component 2, when the cell groups included in the first mountain and the second mountain included in the multiple distribution mountains are the main component 1 and the main component 2, respectively, It includes a means (21) for determining whether or not component 1 and component 2 can be separated and selecting the separable gene as a gene that can classify cells .
A program that works like this.