JP2008021260A

JP2008021260A - System for identifying rna sequence on genome by mass spectrometry

Info

Publication number: JP2008021260A
Application number: JP2006194780A
Authority: JP
Inventors: Tsutomu Suzuki; 勉鈴木; Taketsune Miyauchi; 健常宮内; Hiroo Ueda; 宏生上田; Takeo Suzuki; 健夫鈴木; Yuriko Sakaguchi; 裕理子坂口
Original assignee: University of Tokyo NUC
Current assignee: University of Tokyo NUC
Priority date: 2006-07-14
Filing date: 2006-07-14
Publication date: 2008-01-31
Also published as: WO2008007662A1

Abstract

<P>PROBLEM TO BE SOLVED: To provide a device for identifying a minor RNA molecule, especially for identifying an RNA molecule on a genome sequence from RNA molecular weight information by in silico, its method or the like. <P>SOLUTION: An RNA molecule retrieval device marks and identifies on an optional genome sequence an arbitrary RNA molecule, including at least one target RNA fragment, comprises a storing means 10 for storing data relative to the arbitrary genome sequence of an arbitrary species and a DNA decomposition enzyme that can cut the sequence or RNA decomposition enzyme or the data of cut mechanism of both the DNA decomposition enzyme and the RNA decomposition enzyme; an inputting means 20 for reading the target RNA fragment molecular weight, obtained by measuring at least one target RNA segment that can be cut by a decomposition enzyme similar to a decomposition enzyme; and a calculating means for collating at least one target RNA segment molecular weight read with sequence data existing in the storing means 10 and with data relative to the cut mechanism and calculating a candidate region where the target RNA segment exists on the sequence of the storing means 10. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、対象RNA断片を含む任意のRNA分子を任意の生物種の任意のゲノム配列上で標記し同定するRNA分子検索装置およびその検索方法、そして、コンピュータに対象RNA断片を含む任意のRNA分子を任意の生物種の任意のゲノム配列上で標記し同定する機能を実現させる、対象RNA検索プログラムおよびそのプログラムを記載したコンピュータ読み取り可能な記録媒体に関する。 The present invention relates to an RNA molecule search apparatus and method for marking and identifying any RNA molecule containing a target RNA fragment on any genome sequence of any biological species, and any RNA containing the target RNA fragment in a computer The present invention relates to a target RNA search program that realizes a function of marking and identifying a molecule on an arbitrary genome sequence of an arbitrary species and a computer-readable recording medium describing the program.

最近、RNA干渉やマイクロRNAの発見によって、タンパク質をコードしないRNA（機能性RNA）が担う新しい機能は注目されている。機能性RNAはそれ自身が遺伝子の最終産物であり、これらが機能性高分子として振る舞い、遺伝子の発現調節から、発生や分化など高次生命現象に関わる重要な働きを担っていることが次第に明らかになりつつある。また、最近機能性RNAの異常が疾患の原因になっているという例が報告されつつあり、疾患の原因としてタンパク質の異常のみならずRNAの異常も視野に入れる必要がある。機能性RNA研究を強力に推進するためには、従来から行われてきたRNAを「情報」として捉えるアプローチでは不十分であり、RNAを「分子」として捉える新しい方法論の開発が不可欠である。 Recently, due to the discovery of RNA interference and microRNA, new functions that RNAs that do not encode proteins (functional RNAs) are attracting attention. Functional RNA itself is the final product of genes, and these act as functional macromolecules and gradually reveal that they play important roles related to higher life phenomena such as development and differentiation from the regulation of gene expression It is becoming. Recently, there are reports of cases in which functional RNA abnormalities cause disease, and it is necessary to consider not only protein abnormalities but also RNA abnormalities as a cause of diseases. In order to strongly promote functional RNA research, the conventional approach of capturing RNA as “information” is insufficient, and the development of a new methodology for capturing RNA as “molecule” is indispensable.

しかしながら、これまでのRNAの解析法では、逆転写PCRによりcDNAを増幅し配列を決定する方法などが主流であるが、この手法ではRNAが持つ配列情報のみしか読み取ることができない。RNAの転写後プロセシングや修飾などの質的な情報を得るためには不十分である。また、PCRによるバイアスを考慮するとその方法は決して定量的な解析であるとは言い難い。放射性同位体によって標識し、複数の塩基配列特異的リボヌクレアーゼを用いて配列を解析する方法（ドニスケラー法）や修飾塩基を含めた解析法である口野らのポストラベル法なども用いられているが、いずれも熟練した技術と時間と手間がかかる方法であり、汎用的ではない。 However, in the conventional RNA analysis methods, the method of amplifying cDNA by reverse transcription PCR and determining the sequence is the mainstream, but this method can read only the sequence information of RNA. It is insufficient to obtain qualitative information such as post-transcriptional processing and modification of RNA. Also, considering the bias by PCR, the method is by no means a quantitative analysis. The method of labeling with radioisotopes and analyzing the sequence using multiple base sequence-specific ribonucleases (Doniskeller method) and the post-label method of Kuchino et al., Which is an analysis method including modified bases, are also used. All of them are skilled techniques, time consuming and laborious methods, and are not versatile.

一方、ノーベル化学賞を受賞した島津製作所の田中耕一氏が発明した生体高分子の２大イオン化法のうちの一つ（MALDI法）は、質量分析法によるタンパク質研究に大きな貢献を寄与した。それによって、タンパク質の質量測定法であるペプチドマスフィンガープリント(PMF)法が確立された。 On the other hand, one of the two major ionization methods of biopolymers (MALDI method) invented by Koichi Tanaka of Shimadzu Corporation who won the Nobel Prize in Chemistry contributed greatly to protein research by mass spectrometry. Thereby, the peptide mass fingerprint (PMF) method, which is a mass measurement method for proteins, was established.

微量タンパク質の同定が飛躍的に進歩した背景には、質量分析の進歩に加え、ゲノム解析による遺伝子データベースの充実が挙げられる。もはや、タンパク質の同定にN末や内部ペプチドのシーケンスをする必要がなく、SDS-PAGEなどで分離したタンパク質をトリプシンなどのアミノ酸残基特異的プロテアーゼで消化しペプチドの質量を測定するだけで同定することができる。ペプチドマスフィンガープリント(PMF)法では、解析対象とする生物種由来の全タンパク質の配列をインシリコでトリプシンを切断することを想定し、リジン(K)とアルギニン(R)で切断したペプチドをリスト化し各ペプチドの分子量を、仮想的データベースとして利用する。 In addition to the progress of mass spectrometry, the background of the dramatic progress in the identification of trace proteins is the enhancement of gene databases by genome analysis. There is no longer any need to sequence the N-terminal or internal peptide for protein identification. Proteins separated by SDS-PAGE are identified by digesting with amino acid residue-specific protease such as trypsin and measuring the mass of the peptide. be able to. In the peptide mass fingerprint (PMF) method, assuming that trypsin is cleaved in silico from the sequence of all proteins derived from the species to be analyzed, peptides cleaved with lysine (K) and arginine (R) are listed. The molecular weight of each peptide is used as a virtual database.

このデータベースに対し、実際の解析したペプチドの分子量セットを参照し最も類似性の高いタンパク質を検索することで同定することが可能である。複数のペプチドが1種類のタンパク質の配列内に落ちる可能性は低いことから、必ずしも全てのペプチドが帰属できなくとも正解率が高く、PMF法は今やプロテオミクス研究には欠かすことのできない重要な技術である。 The database can be identified by searching for the most similar protein with reference to the molecular weight set of the actually analyzed peptide. Since it is unlikely that multiple peptides will fall within the sequence of a single protein, the correct answer rate is high even if not all peptides can be assigned, and the PMF method is now an important technique indispensable for proteomics research. is there.

ペプチドマスフィンガープリント法により簡便に微量タンパク質を同定することが可能となったが、微量RNAの簡便な同定法は存在していない。従来、RNA分子はタンパク質と比べイオン化が難しく、RNA分子の質量分析による高感度検出は困難であったが、本発明者らによりRNA分子の高感度質量分析が可能となり、微量RNA分子同定のための質量分析データが得られるようになった。しかしながら、ペプチドとRNAではモノマーの種類の数が違うこと、検索対象とすべきデータベースが違うことなどから、分子量リストからペプチドマスフィンガープリント法のデータ処理部分のままではRNA分子を同定することはできない。そのため、微量RNA分子を同定する、特にそのRNA分子量情報からインシリコでRNA分子をゲノム配列上に同定する新手法が望まれる。 Although it has become possible to easily identify a trace amount of protein by the peptide mass fingerprint method, there is no simple method for identifying a trace amount of RNA. Conventionally, RNA molecules are difficult to ionize compared to proteins, and high-sensitivity detection by mass spectrometry of RNA molecules has been difficult, but the present inventors have made it possible to perform high-sensitivity mass analysis of RNA molecules to identify trace RNA molecules. Mass spectrometric data can be obtained. However, RNA molecules cannot be identified from the molecular weight list if the data processing part of the peptide mass fingerprint method is used because the number of types of monomers differs between peptides and RNA and the database to be searched is different. . Therefore, a new method for identifying a minute amount of RNA molecule, particularly for identifying an RNA molecule on a genome sequence in silico from its RNA molecular weight information is desired.

本発明者は、ペプチドマスフィンガープリント法と同等な同定法をRNA分子に使用することができるように鋭に努力した結果、RNA断片間の分子量差に着目し、測定した分子量リストとゲノム配列データベースを加工した仮想的な分子量リストの類似性を評価し、スコア化することでRNA分子を同定することができる本願発明であるRNAマスフィンガープリント法（RMF法）を見出した。 As a result of earnest efforts so that the identification method equivalent to the peptide mass fingerprint method can be used for RNA molecules, the present inventor paid attention to the molecular weight difference between RNA fragments, and measured molecular weight list and genome sequence database. The present inventors have found an RNA mass fingerprint method (RMF method) that can identify an RNA molecule by evaluating the similarity of a virtual molecular weight list obtained by processing and scoring.

本発明（RNAマスフィンガープリント法）は、高感度質量分析によって解析された微量RNAの分子量データを用いて、ゲノムデータベースから迅速にRNA遺伝子を同定する方法で、RNA断片間の分子量の差分に基づく。図４に示すように、大腸菌と酵母のゲノム塩基配列を所定のRNA分解酵素でRNAを切断したときにできる断片の分子量を塩基の組成ごとにまとめ、分子量順にソートした結果について調査を行った。ゲノム塩基配列をGで切断したときにできる断片間の最小の分子量差は、大腸菌、酵母共に0.21(Da)であった。質量分析計の誤差が0.1（Da）であるため、大腸菌と酵母では断片分子量により組成をしぼりこむことが可能であることがわかった。 The present invention (RNA mass fingerprint method) is a method for rapidly identifying RNA genes from a genomic database using molecular weight data of minute amounts of RNA analyzed by high-sensitivity mass spectrometry, based on the difference in molecular weight between RNA fragments. . As shown in FIG. 4, the molecular weights of the fragments obtained when the genomic base sequences of Escherichia coli and yeast were cleaved with a predetermined RNA-degrading enzyme were summarized for each base composition, and the results were sorted in order of molecular weight. The minimum molecular weight difference between fragments when the genomic nucleotide sequence was cut with G was 0.21 (Da) for both E. coli and yeast. Since the mass spectrometer error was 0.1 (Da), it was found that the composition of E. coli and yeast could be reduced by the molecular weight of the fragments.

一方、タンパク質をトリプシンで切断（RとKで切断）したときの断片の分子量を算出すると、各組成の分子量差が小さいものが多く、0.05(Da)以下の分子量差のものが80%を越えているため、分子量のみから組成を絞り込むことは非常に困難である。特徴としてゲノム断片は、タンパク質と比べて分子量から組成へ帰属しやすいと言える。また、マウス、ヒトなどのゲノムサイズの大きい高等生物種では、組成の組み合わせパターンが増え、理論上の塩基配列の組み合わせに近づくことが予想される為、35塩基までの組成間の分子量の差の算出を行った。断片間の分子量差の最小値は0.17（Da）であり、精度の高い質量分析計であればマウス、ヒトなどのゲノムサイズの大きい高等生物であっても、分子量から組成を絞り込むことが可能であることが判明した。 On the other hand, when calculating the molecular weight of fragments when the protein is cleaved with trypsin (cleaved with R and K), the molecular weight difference of each composition is often small, and the molecular weight difference of 0.05 (Da) or less exceeds 80%. Therefore, it is very difficult to narrow down the composition only from the molecular weight. As a feature, it can be said that genomic fragments are more easily assigned to the composition from molecular weight than protein. In addition, in higher-species species with large genome sizes such as mice and humans, the combination pattern of the composition increases and it is expected to approach the theoretical base sequence combination, so the difference in molecular weight between compositions up to 35 bases Calculation was performed. The minimum molecular weight difference between fragments is 0.17 (Da), and a high-precision mass spectrometer can narrow down the composition from molecular weight even for higher organisms such as mice and humans with large genome sizes. It turned out to be.

したがって、本発明は、任意の生物種の任意のゲノム配列、および、当該配列を切断することができるDNA分解酵素またはRNA分解酵素もしくはその両方の切断メカニズムに関するデータを格納する記憶手段（１０）、前記分解酵素と同様な分解酵素で切断されることが可能な少なくとも１つの対象RNA断片を測定して得たその対象RNA断片分子量を読み込む入力手段（２０）と、読み込まれた少なくとも１つの対象RNA断片分子量を記憶手段（１０）にある配列データおよび切断メカニズムに関するデータと照合させ、当該対象RNA断片が記憶手段（１０）の配列の上に存在する候補領域を算出する算出手段（３０）と、からなる当該少なくとも１つの対象RNA断片を含む任意のRNA分子を任意のゲノム配列上で標記し同定するRNA分子検索装置に関する。 Accordingly, the present invention provides a storage means (10) for storing data on any genomic sequence of any organism species and a cleavage mechanism of a DNA-degrading enzyme or an RNA-degrading enzyme or both capable of cleaving the sequence. Input means (20) for reading the molecular weight of the target RNA fragment obtained by measuring at least one target RNA fragment that can be cleaved by the same degrading enzyme as the decomposing enzyme, and at least one read target RNA A calculation means (30) for comparing the fragment molecular weight with the sequence data in the storage means (10) and the data relating to the cleavage mechanism, and calculating a candidate region in which the target RNA fragment is present on the sequence of the storage means (10); The present invention relates to an RNA molecule search apparatus for marking and identifying an arbitrary RNA molecule comprising at least one target RNA fragment consisting of:

また、本発明は、任意の生物種の任意のゲノム配列、および、当該配列を切断することができるDNA分解酵素またはRNA分解酵素もしくはその両方の切断メカニズムに関するデータを格納する記憶ステップ（１０）、前記分解酵素と同様な分解酵素で切断されることが可能な少なくとも１つの対象RNA断片を測定して得たその対象RNA断片分子量を読み込む入力ステップ（２０）と、読み込まれた少なくとも１つの対象RNA断片分子量を記憶ステップ（１０）にある配列データおよび切断メカニズムに関するデータと照合させ、当該対象RNA断片が記憶ステップ（１０）の配列の上に存在する候補領域を算出する算出ステップ（３０）と、からなる当該少なくとも１つの対象RNA断片を含む任意のRNA分子を任意のゲノム配列上で標記し同定するRNA分子検索方法に関する。 The present invention also provides a storage step (10) for storing data on an arbitrary genomic sequence of an arbitrary species and a cleavage mechanism of a DNA-degrading enzyme and / or an RNA-degrading enzyme capable of cleaving the sequence, An input step (20) for reading the molecular weight of the target RNA fragment obtained by measuring at least one target RNA fragment that can be cleaved by the same degrading enzyme as the decomposing enzyme, and the read at least one target RNA A calculation step (30) for comparing the fragment molecular weight with the sequence data in the storage step (10) and the data on the cleavage mechanism, and calculating a candidate region in which the target RNA fragment is present on the sequence in the storage step (10); RNA molecule search method for marking and identifying any RNA molecule comprising at least one RNA fragment of interest on any genome sequence On.

さらに、本発明は、コンピュータに、任意の生物種の任意のゲノム配列、および、当該配列を切断することができるDNA分解酵素またはRNA分解酵素もしくはその両方の切断メカニズムに関するデータを格納する記憶機能（１０）、前記分解酵素と同様な分解酵素で切断されることが可能な少なくとも１つの対象RNA断片を測定して得たその対象RNA断片分子量を読み込む入力機能（２０）と、読み込まれた少なくとも１つの対象RNA断片分子量を記憶機能（１０）にある配列および切断メカニズムに関するデータと照合させ、当該対象RNA断片が記憶機能（１０）の配列の上に存在する候補領域を算出する算出機能（３０）と、を実現させる、当該少なくとも１つの対象RNA断片を含む任意のRNA分子を任意のゲノム配列上に標記し同定するRNA分子検索プログラム、または、そのプログラムを記載したコンピュータ読み取り可能な記録媒体に関する。 Furthermore, the present invention provides a memory function for storing in a computer any genome sequence of any organism species, and data relating to the cleavage mechanism of DNA-degrading enzyme or RNA-degrading enzyme or both capable of cleaving the sequence ( 10) an input function (20) for reading the molecular weight of the target RNA fragment obtained by measuring at least one target RNA fragment that can be cleaved by the same degrading enzyme as the decomposing enzyme, and at least one read The calculation function (30) for comparing the molecular weights of two target RNA fragments with the data relating to the sequence and the cleavage mechanism in the memory function (10) and calculating the candidate region where the target RNA fragment is present on the sequence of the memory function (10) An RNA molecule search program for marking and identifying any RNA molecule containing the at least one target RNA fragment on any genome sequence. Ram, or, a computer-readable recording medium according to the program.

本発明は、微量なRNAをPCRによる増幅やラジオアイソトープによる標識なしに、高感度質量分析法を用いて直接測定することで、その分子量情報からインシリコでRNA遺伝子の配列を同定することができ、抗体で免疫沈降した細胞内に存在する微量なRNA−タンパク質複合体（RNP）に含まれるRNAを迅速かつ定量的に測定することができるため、RNA−タンパク質の相互作用解析のまったく新しい基盤技術となりうるもので、将来的にRNA-タンパク質の相互作用ネットワーク作りにも大きく貢献することが期待される。 In the present invention, by directly measuring a small amount of RNA using high-sensitivity mass spectrometry without PCR amplification or radioisotope labeling, the RNA gene sequence can be identified in silico from its molecular weight information. RNA contained in a small amount of RNA-protein complex (RNP) present in cells immunoprecipitated with antibodies can be rapidly and quantitatively measured, making it a completely new fundamental technology for RNA-protein interaction analysis. It is expected to contribute greatly to the creation of RNA-protein interaction networks in the future.

本発明のプログラムをはじめとする方法および装置によれば、RNAの質量分析法は次世代のRNA研究を支える重要な基盤技術となりうるものであり、この技術を生かすためにはＲＭＦが不可欠である。装置メーカー、バイオインフォマティクス産業、創薬ベンチャー、国家プロジェクトなどを巻き込んで大規模に展開できる可能性がある。 According to the method and apparatus including the program of the present invention, RNA mass spectrometry can be an important basic technology that supports next-generation RNA research, and RMF is indispensable to make use of this technology. . There is a possibility of large-scale deployment involving device manufacturers, bioinformatics industry, drug discovery ventures, national projects, and so on.

〔用語定義〕
本発明の内容をよりわかりやすくするために明細書に記載の用語をここで定義する。本発明における「組成」とは、配列の順序に関係なく、断片に含まれる塩基種類およびその数を表した用語である。例えば、A1U0C2G1で表される断片の組成はアデニンを１残基、ウラシルを０残基、シトシンを２残基、グアニンを１残基含む断片であることを意味し、その配列の順序とは無関係である。また、本発明における「分子量」とは、実際の分子量もしくは質量分析機から得られるデータである質量電荷比(m/z)および電荷(z)に基づいて公知の方法で算出した測定対象となる物質の分子量のいずれかを表し、分子量あるいはそれに準ずるデータを表した用語である。 [Definition of terms]
In order to make the content of the present invention easier to understand, terms used in the specification are defined here. The “composition” in the present invention is a term representing the kind of base and the number thereof contained in a fragment regardless of the sequence order. For example, the composition of the fragment represented by A1U0C2G1 means that it is a fragment containing 1 residue of adenine, 0 residue of uracil, 2 residues of cytosine and 1 residue of guanine, regardless of the sequence order. It is. Further, the “molecular weight” in the present invention is a measurement object calculated by a known method based on the actual molecular weight or the mass-to-charge ratio (m / z) and the charge (z) which are data obtained from a mass spectrometer. It is a term that represents one of the molecular weights of a substance, and represents the molecular weight or equivalent data.

本発明における「ゲノム配列」とは、本特許出願時に公知された任意の生物種の任意のゲノムの２本鎖にそれぞれ対応する一本鎖のRNAの配列、また、たとえばRNAのゲノムを持つウイルスの場合ではその２本鎖RNA、１本鎖RNAの配列をも含み、さらに、１本鎖DNAのゲノムではその対応するRNAの配列を表した用語で、「ゲノム断片」とは、本特許出願時に実際に存在する任意のRNA分解酵素またはDNA分解酵素もしくはその両方の切断メカニズムにしたがって、仮想的に前記ゲノム配列を切断した場合にできるゲノム断片を表した用語で、「ゲノム断片分子量」とは、ゲノム断片の分子量を表した用語で、「ゲノム断片組成」とは、仮想的に切断されたゲノム配列の断片の組成を表した用語で、「ゲノム断片位置」とは、ゲノム配列の上にそのゲノム断片の存在する場所を示す位置データを表した用語で、「ゲノム断片数」とは、ゲノム配列の上にある同じゲノム断片組成を有するゲノム断片の数を表した用語である。 The “genomic sequence” in the present invention is a single-stranded RNA sequence corresponding to the double strand of an arbitrary genome of an arbitrary biological species known at the time of filing of the present patent application, for example, a virus having an RNA genome. In this case, the term includes the double-stranded RNA and single-stranded RNA sequences, and the single-stranded DNA genome represents the corresponding RNA sequence. The term "genomic fragment molecular weight" is a term used to describe a genomic fragment formed when the genomic sequence is virtually cleaved according to the cleaving mechanism of any RNA degrading enzyme and / or DNA degrading enzyme that is actually present. The term "genomic fragment composition" is a term representing the molecular weight of a genomic fragment. The term "genomic fragment position" is a term representing the composition of a virtually fragmented genomic sequence fragment. The genome break In the presence term representing the position data indicating where to the, the "number of genomic fragments" is a term that represents the number of genomic fragments of the same genomic fragment composition above the genomic sequence.

本発明における「対象RNA」とは、ゲノム配列の上同定しようとするある特定のRNA分子、特に機能性RNA分子を表した用語で、「対象RNA断片」とは、前記ゲノム断片を得るために用いたRNA分解酵素と同じもので対象RNAを実際に切断して得た断片を表した用語で、「対象RNA断片番号」とは、切断された対象RNA断片に付ける番号を表した用語で、「対象RNA断片分子量」とは、対象RNA断片の分子量を表した用語で、「対象RNA断片組成」とは対象RNA断片分子量と同じ分子量を有するゲノム断片組成を表した用語で、「対象RNA断片数」とは、ゲノム配列の上にある、対象RNA断片組成と同じ組成を有するゲノム断片数を表した用語で、「対象RNA断片位置」とは、ゲノム配列の上にある、対象RNA断片組成と同じ組成を有するゲノム断片位置を表した用語である。
〔本発明の実施態様〕
本発明の少なくとも１つの対象RNA断片を含む任意のRNA分子を任意のゲノム配列上で標記し同定するRNA分子検索装置は、任意の生物種の任意のゲノム配列、および、当該配列を切断することができるDNA分解酵素またはRNA分解酵素もしくはその両方の切断メカニズムに関するデータを格納する記憶手段（１０）、前記分解酵素と同様な分解酵素で切断されることが可能な少なくとも１つの対象RNA断片を測定して得たその対象RNA断片分子量を読み込む入力手段（２０）と、読み込まれた少なくとも１つの対象RNA断片分子量を記憶手段（１０）にある配列データおよび切断メカニズムに関するデータと照合させ、当該対象RNA断片が記憶手段（１０）の配列の上に存在する候補領域を算出する算出手段（３０）とからなる。 In the present invention, the “target RNA” is a term representing a specific RNA molecule to be identified on the genome sequence, particularly a functional RNA molecule, and the “target RNA fragment” is used to obtain the genomic fragment. It is a term that represents the fragment obtained by actually cleaving the target RNA with the same RNase used, and the `` target RNA fragment number '' is a term that represents the number assigned to the cleaved target RNA fragment, `` Target RNA fragment molecular weight '' is a term representing the molecular weight of the target RNA fragment, and `` target RNA fragment composition '' is a term representing a genomic fragment composition having the same molecular weight as the target RNA fragment molecular weight. “Number” is a term representing the number of genomic fragments having the same composition as the target RNA fragment composition above the genome sequence, and “target RNA fragment position” is the target RNA fragment composition above the genome sequence. Is a term representing the position of a genomic fragment having the same composition.
Embodiment of the present invention
An RNA molecule search apparatus for marking and identifying an arbitrary RNA molecule containing at least one target RNA fragment of the present invention on an arbitrary genomic sequence is capable of cleaving an arbitrary genomic sequence of an arbitrary biological species and the sequence. Storage means (10) for storing data relating to the cleavage mechanism of DNA-degrading enzyme or RNA-degrading enzyme or both, and measuring at least one target RNA fragment that can be cleaved by a degrading enzyme similar to the degrading enzyme Input means (20) for reading the molecular weight of the target RNA fragment obtained in this manner, and comparing the read at least one target RNA fragment molecular weight with the sequence data and the data on the cleavage mechanism in the storage means (10), Computation means (30) for computing a candidate area in which a fragment exists on the array of storage means (10).

記憶手段（１０）に記憶する、当該配列を切断することができるDNA分解酵素またはRNA分解酵素もしくはその両方の切断メカニズムに関するデータは、RNA分解酵素を例にすると、グアニン（G）を特異的に切断するRNaseT1、シトシン（C）を特異的に切断するRNaseCL3や、UもしくはCを特異的に切断するRNaseA、そしてAもしくはGを特異的に切断するRNaseU2などのデータを含む。また、記憶手段（１０）は、記憶領域（たとえばメモリ上）に格納されるその分解酵素の切断メカニズムに関するデータによって、記憶領域（たとえばメモリ）に展開される記憶手段（１０）で記憶する任意の生物種の任意のゲノム配列を仮想的に切断した断片の関するデータを記憶することができる。 The data on the cleaving mechanism of the DNA degrading enzyme and / or the RNA degrading enzyme capable of cleaving the sequence memorized in the memory means (10) is specific to guanine (G) when RNase is taken as an example. Data include RNaseT1 that cleaves, RNaseCL3 that specifically cleaves cytosine (C), RNaseA that specifically cleaves U or C, and RNaseU2 that cleaves A or G specifically. In addition, the storage means (10) stores any data stored in the storage means (10) developed in the storage area (for example, memory) according to data relating to the cleavage mechanism of the degrading enzyme stored in the storage area (for example, on the memory) Data relating to fragments obtained by virtually cutting an arbitrary genome sequence of a biological species can be stored.

本発明は、その任意のゲノム配列を仮想的に切断した断片に関するデータに一例として、ゲノム断片分子量、ゲノム断片組成、ゲノム断片数およびゲノム断片位置からなる１組のデータが挙げられ、また、記憶領域（たとえばメモリ）上での格納スペースを節約するために図２に示すように当該１組のデータの中少なくとも２つのデータを格納する記憶手段（１１）をさらに含むことができる。また、別の例として、同図にあるテーブルＥに示すように下記誤差を修正する修正手段（２２）に関するデータを格納することもできる。 In the present invention, as an example of data relating to a fragment obtained by virtually cutting an arbitrary genome sequence, a set of data including a molecular fragment molecular weight, a genome fragment composition, the number of genome fragments, and a genome fragment position can be cited. In order to save the storage space on the area (for example, memory), as shown in FIG. 2, it may further include a storage means (11) for storing at least two data among the set of data. As another example, as shown in the table E in the figure, data relating to the correction means (22) for correcting the following error can be stored.

本発明における入力手段（２０）は、前記分解酵素と同様な分解酵素で切断されることが可能な少なくとも１つの対象RNA断片を測定して得たその対象RNA断片分子量を読み込む入力手段であり、ここでは、対象RNA断片を特にRNA分解酵素で実際切断する必要はなく、その分子量が既知の場合はその分子量を、その分子量が未知の場合は、直接に分子量を、たとえばLC/MS（液体クロマトグラフィー/マススペクトロメトリー）あるいはMALDI-TOF MS（マトリックス支援レーザ脱離イオン化法/飛行時間型質量分析計）で測定し入力することができる。 The input means (20) in the present invention is an input means for reading the molecular weight of the target RNA fragment obtained by measuring at least one target RNA fragment that can be cleaved by the same decomposing enzyme as the decomposing enzyme, Here, it is not necessary to actually cleave the target RNA fragment with an RNase. If the molecular weight is known, the molecular weight is directly measured. If the molecular weight is unknown, the molecular weight is directly measured, for example, LC / MS (liquid chromatography). (Graphography / Mass Spectrometry) or MALDI-TOF MS (Matrix Assisted Laser Desorption / Ionization / Time-of-Flight Mass Spectrometer).

本発明は、より正確に対象RNA断片を同定するために、前記分解酵素と同様な分解酵素で実際に対象RNAを切断して得られた少なくとも１つの対象RNA断片を測定して得たその対象RNA断片分子量を読み込む入力手段（２１）をさらに含むができる。たとえば、実際に切断された対象RNA断片のそれぞれの分子量を配列I(n)（nは１以上の整数で対象RNA断片番号を示す）として入力することができる。 In order to more accurately identify a target RNA fragment, the present invention provides a target obtained by measuring at least one target RNA fragment obtained by actually cleaving the target RNA with a degrading enzyme similar to the above degrading enzyme. An input means (21) for reading the RNA fragment molecular weight can be further included. For example, the molecular weight of each actually cleaved target RNA fragment can be input as the sequence I (n) (n is an integer of 1 or more and indicates the target RNA fragment number).

本発明における算出手段（３０）は、読み込まれた少なくとも１つの対象RNA断片分子量を記憶手段（１０）にある配列データおよび切断メカニズムに関するデータと照合させ、当該対象RNA断片が記憶手段（１０）の配列の上に存在する候補領域を算出する算出手段であり、たとえば、対象RNA断片分子量を記憶領域に格納されている仮想的に切断されたゲノム配列断片に関するデータと照合し、その対象RNA断片がゲノム配列上に存在する候補領域を算出することができる。 The calculation means (30) in the present invention collates the read molecular weight of at least one target RNA fragment with the sequence data in the storage means (10) and the data relating to the cleavage mechanism, and the target RNA fragment is stored in the storage means (10). A calculation means for calculating a candidate region existing on a sequence.For example, the molecular weight of a target RNA fragment is compared with data on a virtually cut genomic sequence fragment stored in a storage region, and the target RNA fragment is Candidate regions existing on the genome sequence can be calculated.

本発明は、より正確に対象RNAがゲノム配列上に存在する候補領域を算出するために、読み込まれた少なくとも１つの対象RNA断片分子量を記憶手段（１０）または記憶手段（１１）もしくはその両方のデータと照合させた後、さらに、対象RNA断片組成を抽出する抽出手段（３１）をさらに含むことができる。具体的には、対象RNA断片分子量と一致する分子量をもつゲノム断片分子量に対応する、該ゲノム断片分子量と同じ組にあるゲノム断片組成を対象RNA断片組成としてたとえば、行列H(n)（nは１以上の整数で対象RNA断片番号を示す）の形式で定義し記憶領域（媒体も含む）に格納する。 In order to calculate the candidate region where the target RNA is present on the genome sequence more accurately, the present invention uses at least one of the read target RNA fragment molecular weights in the storage means (10) and / or the storage means (11). After collating with the data, it can further include extraction means (31) for extracting the target RNA fragment composition. Specifically, a genomic fragment composition corresponding to a genomic fragment molecular weight having a molecular weight that matches the molecular weight of the target RNA fragment and in the same set as the genomic fragment molecular weight is defined as a target RNA fragment composition, for example, matrix H (n) (n is The target RNA fragment number is indicated by an integer of 1 or more) and stored in a storage area (including a medium).

しかしながら、対象RNA断片の分子量をLC/MS（液体クロマトグラフィー/マススペクトロメトリー）あるいはMALDI-TOF MS（マトリックス支援レーザ脱離イオン化法/飛行時間型質量分析計）で測定しても修飾基などの様々な要因によってゲノム断片分子量との誤差が生じる。その原因として、(1) RNAフラグメントの末端リン酸基の形状による誤差、(2)組成中のU/C数の内訳の誤りによる誤差、(3)修飾により仮想的な切断が実際には起こらないことによる誤差、(4)元RNAの両末端フラグメントによる誤差、(5)天然同位体の影響等で抽出すべき質量を誤ってしまう誤差などが考えられる。 However, even if the molecular weight of the target RNA fragment is measured by LC / MS (liquid chromatography / mass spectrometry) or MALDI-TOF MS (matrix-assisted laser desorption / ionization / time-of-flight mass spectrometer), Various factors cause an error in the molecular weight of the genomic fragment. The reasons are as follows: (1) error due to the shape of the terminal phosphate group of the RNA fragment, (2) error due to an error in the breakdown of the U / C number in the composition, and (3) virtual cleavage actually occurs due to modification. Errors due to absence, (4) errors due to fragments at both ends of the original RNA, (5) errors that cause incorrect mass to be extracted due to the influence of natural isotopes, etc.

本発明は、より正確な対象RNA分子量を入力させるために、入力手段（２１）で読込まれた少なくとも１つの対象RNA断片分子量に対する誤差を修正する修正手段（２２）をさらに含み、様々なケースにおいて、誤差を持つ対象RNA断片分子量の扱いを包括的に規定できる。たとえば、本特許出願現在RNA塩基配列生じうる全ての分子量変化を予め所定のデータベースに格納して記憶させる。メモリ上展開した対象RNA断片分子量とゲノム断片分子量との照合結果一致しないときには、対象RNA断片に誤差を生じる原因が存在すると判断し、その誤差を生じていると思われる対象RNA断片分子量に対し、分子量の誤差修正を行う。 The present invention further includes a correction means (22) for correcting an error with respect to the molecular weight of at least one target RNA fragment read by the input means (21) in order to input a more accurate target RNA molecular weight, and in various cases In addition, it is possible to comprehensively define how to handle molecular weights of target RNA fragments with errors. For example, all the molecular weight changes that can occur in the current RNA base sequence of this patent application are stored and stored in advance in a predetermined database. When the target RNA fragment molecular weight developed in memory and the genomic fragment molecular weight do not match, it is judged that there is a cause of error in the target RNA fragment, and for the target RNA fragment molecular weight that seems to be causing the error, Correct the molecular weight error.

本発明は、より正確に対象RNA断片分子がゲノム配列上に存在する候補領域を算出するために、前記算出手段（３０）に、得られた対象RNA断片組成をさらに記憶手段（１０）または記憶手段（１１）もしくはその両方のデータと照合させ、当該ゲノム配列の上にある少なくとも１つ対象RNA断片数を抽出する抽出手段（３２）を含ませることができる。具体的には、対象RNA断片組成と一致するゲノム断片組成に対応する、該ゲノム断片組成と同じ組にあるゲノム断片数を対象RNA断片数としてたとえば、行列F(n)（nは１以上の整数で対象RNA断片番号を示す）の形式で定義し記憶領域（媒体も含む）に格納する。 In the present invention, in order to calculate a candidate region where the target RNA fragment molecule is present on the genome sequence more accurately, the calculation means (30) further stores the obtained target RNA fragment composition in the storage means (10) or the storage. Extraction means (32) for collating with the data of means (11) or both and extracting the number of at least one target RNA fragment on the genome sequence can be included. Specifically, for example, the matrix F (n) (n is 1 or more), where the number of genomic fragments corresponding to the genomic fragment composition corresponding to the target RNA fragment composition and in the same set as the genomic fragment composition is the target RNA fragment number. The target RNA fragment number is indicated by an integer) and stored in the storage area (including the medium).

本発明は、より正確に対象RNA断片分子がゲノム配列上に存在する候補領域を算出するために、前記算出手段（３０）に、得られた対象RNA断片組成をさらに記憶手段（１０）または記憶手段（１１）もしくはその両方のデータと照合させ、当該ゲノム配列の上にある少なくとも一箇所の対象RNA断片位置を抽出する抽出手段（３３）をさらに含ませることができる。具体的には、対象RNA断片組成と一致するゲノム断片組成に対応する、該ゲノム断片組成と同じ組にあるゲノム断片位置を対象RNA断片位置としてたとえば、行列L(n)（nは１以上の整数で対象RNA断片番号を示す）の形式で定義し記憶領域（媒体も含む）に格納する。それによって、対象RNA断片がゲノム上の存在する可能性の高い場所を特定することができる。 In the present invention, in order to calculate a candidate region where the target RNA fragment molecule is present on the genome sequence more accurately, the calculation means (30) further stores the obtained target RNA fragment composition in the storage means (10) or the storage. An extraction means (33) for collating with the data of the means (11) or both and extracting at least one target RNA fragment position on the genome sequence can be further included. Specifically, for example, a matrix L (n) (where n is 1 or more) is defined as a target RNA fragment position corresponding to a genomic fragment composition that matches the target RNA fragment composition. The target RNA fragment number is indicated by an integer) and stored in the storage area (including the medium). Thereby, it is possible to identify a place where the target RNA fragment is likely to exist on the genome.

本発明は、より正確に対象RNA断片分子がゲノム配列上に存在する候補領域を算出するために、前記算出手段（３０）に、得られた少なくとも一箇所の対象RNA断片位置からゲノム配列の所定方向に所定の塩基長で設けられるフレーム内のゲノム配列組成を走査させる走査手段（３４）をさらに含ませることができる。ゲノム配列の上に対象RNA断片組成が存在することは、そのあたりに対象RNA存在の可能性が高いことを示すので、ゲノム配列上にある全ての対象RNA断片位置から所定のフレームを設け、そのフレーム内の全ゲノム配列組成を対象RNA断片組成で走査することにより、対象RNA断片組成が全て入っているフレームをゲノム配列上に検出することができる。 According to the present invention, in order to calculate a candidate region where the target RNA fragment molecule is present on the genome sequence more accurately, the calculation means (30) sends a predetermined genome sequence from at least one target RNA fragment position obtained. A scanning means (34) for scanning the genome sequence composition in the frame provided with a predetermined base length in the direction can be further included. The presence of the target RNA fragment composition on the genome sequence indicates that there is a high possibility that the target RNA exists, so a predetermined frame is provided from all target RNA fragment positions on the genome sequence, By scanning the entire genome sequence composition in the frame with the target RNA fragment composition, a frame containing the entire target RNA fragment composition can be detected on the genome sequence.

本発明におけるフレームの長さは、限定されたものではない。好ましくは対象RNA塩基長である。対象RNAの塩基配列の長さをフレームとすることで、そのフレームに対象RNA断片組成の全てが入ればそのフレーム自体が同定しようとする対象RNAである可能性が極めて高く、対象RNAがゲノム配列の上に存在する位置をほぼ突き止めることになる。また、本発明においては、たとえば電気泳動等の他の手段で対象RNAの塩基の長さを測定してフレームの長さを決めることが好ましい。また、本願特許出願時における公知した対象RNAの塩基配列の長さを測定することができる方法の全てを本発明で用いることができる。 The length of the frame in the present invention is not limited. The target RNA base length is preferred. By setting the length of the base sequence of the target RNA as a frame, if the entire target RNA fragment composition is included in the frame, the frame itself is very likely to be the target RNA to be identified, and the target RNA is the genome sequence. Will almost locate the position on the top. In the present invention, it is preferable to determine the length of the frame by measuring the base length of the target RNA by other means such as electrophoresis. In addition, all of the known methods capable of measuring the length of the base sequence of the target RNA at the time of filing this patent application can be used in the present invention.

本発明は、算出したゲノム配列上に存在する対象RNA断片分子の候補領域を数字化するために、前記算出手段（３０）に、得られたフレーム内の組成と一致する少なくとも１つの対照RNA断片組成の数（対象RNA断片数）をもとに、フレーム内のその対象RNA断片の出現確率を算出する算出手段（３５）をさらに含ませることができる。本発明において、出現確率を算出する好ましい算出手段として、出現頻度比率法または二項分布法が挙げられる。 In the present invention, in order to digitize the candidate region of the target RNA fragment molecule present on the calculated genome sequence, the calculation means (30) asks the calculation means (30) for at least one control RNA fragment composition that matches the composition in the obtained frame. Based on the number (number of target RNA fragments), a calculation means (35) for calculating the appearance probability of the target RNA fragment in the frame can be further included. In the present invention, a preferred calculation means for calculating the appearance probability includes an appearance frequency ratio method or a binomial distribution method.

本発明で用いる出現頻度比率法とは、ゲノムを仮想的に切断して得られたRNA断片の総数をF_total、そのゲノム断片の中数が最も多い２塩基以上のある所定のゲノム断片のゲノム断片数をF_max、ゲノム配列の上に存在するある対象RNA断片数をFn（nは１以上の整数で対象RNA断片番号を示す）としたとき、その対象RNA断片のゲノム配列上での出現頻度比率（P(n)）を以下の式で算出し、これをフレーム内での組成断片出現確立としてスコアの計算に使用する方法である。 The frequency ratio method used in the present invention refers to the total number of RNA fragments obtained by virtually cutting a genome as F _total , and the genome of a given genomic fragment having two or more bases with the largest number of genomic fragments. When the number of fragments is F _{max and the} number of target RNA fragments existing on _the genome sequence is Fn (n is an integer of 1 or more and indicates the target RNA fragment number), the target RNA fragment appears on the genome sequence. In this method, the frequency ratio (P (n)) is calculated by the following formula, and this is used to calculate the score as the occurrence of a composition fragment within a frame.

P(a)=Fa / F_total÷F_max / F_total P (a) = Fa / F _total ÷ F _max / F _total

例えば、RNA分解酵素としてRNaseT1を用いる場合、スコアに反映させる塩基長を３塩基以上とするときには、その中で最も出現頻度の高いAOU1C1G1という３塩基の組成の断片の数をF_maxとして用いればよい。 For example, when RNaseT1 is used as an RNA-degrading enzyme, if the base length reflected in the score is 3 bases or more, the number of fragments having a composition of 3 bases, AOU1C1G1, having the highest appearance frequency among them may be used as F _max. .

本発明は、出現頻度比率法以外に二項分布法を用いることもできる。ここで二項分布法とは、ゲノム上の任意の１点における特定の組成が現れる確率p（p = 特定の組成の出現頻度/ゲノム長）を用いる方法である。あるフレーム内に特定の組成が特定の回数現れる確率はpを成功確率、フレーム長を試行回数とした二項分布に従うと考えられる。ここでは、フレーム長をlとするときに確率変数Xが二項分布に従い、B(p,l)に対し、すなわちX 〜 B(p,l )で、pは組成の理論的な出現確率を使用しても良い。このような二項分布、またはこれを近似するポアソン分布から導かれる確率をフレーム内での組成断片出現確率としてスコアの計算に使用することができる。 In the present invention, a binomial distribution method can be used in addition to the appearance frequency ratio method. Here, the binomial distribution method is a method using a probability p (p = frequency of appearance of a specific composition / genomic length) that a specific composition appears at an arbitrary point on the genome. The probability that a specific composition appears a specific number of times in a frame is considered to follow a binomial distribution with p as the success probability and the frame length as the number of trials. Here, when the frame length is l, the random variable X follows a binomial distribution, and for B (p, l), that is, from X to B (p, l), p is the theoretical appearance probability of the composition. May be used. A probability derived from such a binomial distribution or a Poisson distribution that approximates the binomial distribution can be used for the calculation of the score as a composition fragment appearance probability within the frame.

本発明は、算出したゲノム配列上に存在する対象RNAの候補領域をさらに明確した数字で表すために、前記算出手段（３５）に、フレーム内の前記対象RNA断片の出現確率よりスコアを算出する算出手段（３６）をさらに含ませることができる。本発明は、前記出現頻度比率法または二項分布法を用いて算出した、フレーム内に入っている全ての対象RNA断片の出現確率もしくは比率であるP(n)のログ（log）を足しあわせた値を、もしくは出現確率もしくは比率であるP(n)を掛け合せた積に対してログ（log）を取った値を、そのフレーム内に対象RNA断片が存在する可能性を示すスコアにある。 The present invention calculates a score from the appearance probability of the target RNA fragment in the frame in the calculation means (35) in order to express the candidate region of the target RNA existing on the calculated genome sequence with a more specific number. A calculation means (36) can further be included. The present invention adds the log of P (n), which is the appearance probability or ratio of all target RNA fragments in the frame, calculated using the appearance frequency ratio method or binomial distribution method. Or a value obtained by multiplying the product obtained by multiplying P (n), which is the appearance probability or ratio, by the log, is a score indicating the possibility that the target RNA fragment is present in the frame.

このスコアの値は、フレーム内にある対象RNA断片の出現確率もしくは比率（0<P(n)<1）の積であるため、フレームに対象RNA断片が多ければその１より小さい正数の積は小さくなり、全ての断片が一つのフレームにあれば、その積は最小値になる。また、わかり易くするためにその積に対してマイナスログ（-log）を取ることで、スコアの値が大きければフレーム内対照RNAの出現頻度が高くなる。また、出現頻度のP(n)に対しマイナスログ（-log）を取ってから足し算でスコアを求めることは数学的な観点からすれば全く同じであるため、本発明はスコアの算出におけるその順番に限定を設けない。 The value of this score is the product of the occurrence probability or ratio (0 <P (n) <1) of the target RNA fragment in the frame, so if there are many target RNA fragments in the frame, the product of positive numbers less than that one If all fragments are in one frame, the product is the minimum. In addition, by taking a minus log (-log) for the product for easy understanding, the appearance frequency of the in-frame control RNA increases as the score value increases. In addition, since taking a minus log (-log) with respect to the appearance frequency P (n) and obtaining a score by addition is exactly the same from a mathematical point of view, the present invention determines the order in calculating the score. There is no limit.

たとえば、あるフレーム内に特定の組成がk回現れる確率はP[X=k]をPfと表し、すなわち、Pf = P[X=k] である。-log(Pf)を特定の組成に対するスコアとする。フレーム内に出現する異なった組成ごとにスコアを算出し、その和をフレームのスコアとすることができる。 For example, the probability that a specific composition appears k times in a certain frame is expressed as P [X = k] as Pf, that is, Pf = P [X = k]. Let -log (Pf) be the score for a particular composition. A score can be calculated for each different composition appearing in the frame, and the sum can be used as the frame score.

本発明のフレームは、ゲノム配列の上に出現する対象RNA断片位置から設けられているので、その位置の数ほどのフレームが設けられていることになり、またフレーム１つに対して１つのスコアが算出することになる。よって、最も大きなスコアを順に並びかえることで上位スコアを抽出することができる。ゲノム配列の上に複数の対象RNAが存在する場合は、その上位スコアは１つに限らず、複数のスコアの存在はあり得る。 Since the frame of the present invention is provided from the position of the target RNA fragment appearing on the genome sequence, it means that there are as many frames as the number of the positions, and one score per frame. Will be calculated. Therefore, it is possible to extract a higher score by rearranging the largest scores in order. When there are a plurality of target RNAs on the genome sequence, the upper score is not limited to one, and there may be a plurality of scores.

対象RNA断片を含む任意のRNA分子を任意のゲノム配列上で標記し同定する本発明は、特にマウス、ヒト等の哺乳動物、特にヒトのある特定の対象RNA断片を含む任意のRNA分子をヒトのゲノム上に標記し同定することができる。 The present invention for marking and identifying an arbitrary RNA molecule containing an RNA fragment of interest on an arbitrary genome sequence is particularly suitable for mammals such as mice and humans, and in particular, an RNA molecule containing an RNA fragment of interest containing a specific RNA of interest is human. Can be identified and identified on the genome.

また、本発明は、RNAを「分子」としてその分子量、その組成に基づいて構成される。RNAと１つ塩基しか違わないDNAも「分子」として捕らえ、RNAの分子量および組成と大差のないDNA分子量および組成に基づいて本発明で対照DNA断片を含む任意のDNA分子を任意のゲノム配列上で標記し同定することができる。この場合には、RNA分解酵素に代わりにDNA分解酵素が用いられる。 In addition, the present invention is configured based on the molecular weight and composition of RNA as a “molecule”. DNA that differs from RNA by only one base is also regarded as a “molecule”, and any DNA molecule including a control DNA fragment according to the present invention can be represented on any genome sequence based on the DNA molecular weight and composition that is not significantly different from the molecular weight and composition of RNA And can be identified. In this case, a DNA degrading enzyme is used instead of the RNA degrading enzyme.

つまり、本発明は、対象DNA断片を含む任意のDNA分子を任意の生物種の任意のゲノム配列上で標記し同定するDNA分子検索装置およびその検索方法、そして、コンピュータを用いて対象DNA断片を含む任意のDNA分子を任意の生物種の任意のゲノム配列上で標記し同定する機能を実現させるプログラムおよびそのプログラムを記載したコンピュータ読み取り可能な記録媒体に関する。本発明は、上述したRNAに関する本発明の記載はそのまま対象DNA断片に適応することができる。 That is, the present invention provides a DNA molecule searching apparatus and method for marking and identifying an arbitrary DNA molecule containing the target DNA fragment on an arbitrary genome sequence of an arbitrary biological species, and a target DNA fragment using a computer. The present invention relates to a program that realizes a function of marking and identifying an arbitrary DNA molecule contained in an arbitrary genome sequence of an arbitrary species and a computer-readable recording medium describing the program. In the present invention, the description of the present invention relating to the RNA described above can be directly applied to the target DNA fragment.

本発明は、対象RNA断片を含む任意のRNA分子を任意の生物種の任意のゲノム配列上で標記し同定するRNA分子検索装置のみならず、その検索方法、そして、コンピュータを用いて対象RNA断片を含む任意のRNA分子を任意の生物種の任意のゲノム配列上で標記し同定する機能を実現させる、RNA分子検索プログラムおよびそのプログラムを記載したコンピュータ読み取り可能な記録媒体に関するものである。 The present invention provides not only an RNA molecule search apparatus for marking and identifying an arbitrary RNA molecule containing an RNA fragment of interest on an arbitrary genome sequence of an organism species, but also a search method thereof, and an RNA fragment of interest using a computer. The present invention relates to an RNA molecule search program that realizes a function of marking and identifying an arbitrary RNA molecule including a target on an arbitrary genome sequence of an arbitrary biological species, and a computer-readable recording medium describing the program.

また、本発明は、対象RNA断片を含む任意のRNA分子を任意の生物種の任意のゲノム配列上で標記し同定するRNA分子検索装置を構成する諸手段である、記憶手段（１０）、入力手段（２０）、算出手段（３０）、そして、記憶手段（１１）、入力手段（２１）、修正手段（２２）、抽出手段（３１）、抽出手段（３２）、抽出手段（３３）、走査手段（３４）、算出手段（３５）および算出手段（３６）を、それぞれ検索方法を構成するそれぞれのステップに対応させ、RNA分子検索方法を提供することができる。 The present invention also provides a storage means (10), an input, which are various means constituting an RNA molecule search apparatus for marking and identifying an arbitrary RNA molecule containing a target RNA fragment on an arbitrary genome sequence of an arbitrary biological species. Means (20), calculation means (30), storage means (11), input means (21), correction means (22), extraction means (31), extraction means (32), extraction means (33), scanning The means (34), the calculation means (35), and the calculation means (36) can correspond to the respective steps constituting the search method, thereby providing an RNA molecule search method.

また、本発明は、上記手段を、コンピュータを用いて対象RNA断片を含む任意のRNA分子を任意の生物種の任意のゲノム配列上で標記し同定する機能を実現させる、RNA分子検索プログラムおよびそのプログラムを記載したコンピュータ読み取り可能な記録媒体を構成するそれぞれの機能に対応させることができる。 In addition, the present invention provides an RNA molecule search program that realizes a function of marking and identifying an arbitrary RNA molecule containing an RNA fragment of interest on an arbitrary genome sequence of an arbitrary biological species using a computer, and the means described above. It is possible to correspond to each function constituting a computer-readable recording medium describing the program.

本発明は、RNA分子検索装置に関する実施態様のみ記載したが、その検索方法、そして、コンピュータを用いて対象RNA断片を含む任意のRNA分子を任意の生物種の任意のゲノム配列上で標記し同定する機能を実現させる、RNA分子検索プログラムおよびそのプログラムを記載したコンピュータ読み取り可能な記録媒体に関する実施態様についても、RNA分子検索装置に関する実施態様の記載に対応して読みかえることができるので、ここで開示したこととなる。 In the present invention, only the embodiment related to the RNA molecule search apparatus has been described. However, the RNA search method and a computer are used to mark and identify any RNA molecule containing the target RNA fragment on any genome sequence of any organism species. The embodiment relating to the RNA molecule search program and the computer-readable recording medium describing the program that realizes the function to be performed can also be read in correspondence with the description of the embodiment relating to the RNA molecule search apparatus. It will be disclosed.

本発明の最も好ましい実施態様は図１で示す。 The most preferred embodiment of the present invention is shown in FIG.

図中、１１は、記憶手段（１０）に含まれている、ゲノム配列を切断することができるDNA分解酵素またはRNA分解酵素もしくはその両方の切断メカニズムにしたがって当該任意の生物種の任意のゲノム配列を仮想的に切断し、ゲノム断片分子量、ゲノム断片組成、ゲノム断片数およびゲノム断片位置からなる１組のデータの中少なくとも２つのデータを格納する記憶手段（１１）を示す（図２参照）。 In the figure, 11 is an arbitrary genomic sequence of any biological species included in the storage means (10) according to the cleavage mechanism of a DNA-degrading enzyme and / or RNA-degrading enzyme that can cleave the genomic sequence. The storage means (11) which virtually cuts and stores at least two pieces of data consisting of the molecular fragment molecular weight, the genome fragment composition, the number of genome fragments and the genome fragment position is shown (see FIG. 2).

この本発明の実施態様は、まず図２に示すように、任意の生物種類の任意のゲノム配列に対応するRNA配列を所定の特異的にRNAを分解するRNA分解酵素の切断メカニズムにしたがって仮想的に切断し、ゲノム断片分子量、ゲノム断片組成、ゲノム断片数およびゲノム断片位置からなる１組のデータが格納された記憶手段（１１）を有する。 In this embodiment of the present invention, as shown in FIG. 2, an RNA sequence corresponding to an arbitrary genomic sequence of an arbitrary biological species is hypothesized according to a cleavage mechanism of an RNase that specifically degrades RNA. And storing means (11) in which a set of data including the molecular weight of the genomic fragment, the genomic fragment composition, the number of genomic fragments, and the genomic fragment position is stored.

図２のゲノム断片データベース２０には、少なくとも、公知された配列、たとえば、市販のNCBInrやTrEMBLデータベースなどから入手することができるゲノム配列２本鎖（表裏）に対応するゲノム配列を特異的なRNA分解酵素の切断メカニズムにしたがって、たとえばコンピュータ上インシリコ（in silico）で仮想的に切断し、そのゲノム断片分子量、ゲノム断片組成、ゲノム断片位置、ゲノム断片数を一組のデータとして格納するテーブルを含む。 The genome fragment database 20 shown in FIG. 2 contains at least a known sequence, for example, a genome sequence corresponding to a double-stranded genome sequence (front and back) that can be obtained from a commercially available NCBInr or TrEMBL database. Including a table that virtually cuts in silico on a computer in accordance with the cleavage mechanism of the degrading enzyme and stores the molecular fragment molecular weight, genome fragment composition, genome fragment position, and number of genome fragments as a set of data .

図２に示すテーブルＤ（２４）のように、仮想的に切断されたゲノム断片のゲノム断片組成、ゲノム断片分子量、ゲノム断片数およびゲノム断片位置は、同じテーブルに一組のデータとして格納されてもよく、また、テーブルＡ乃至Ｃ（２１、２２および２３）のように、ゲノム断片組成を中心に他のデータと一組にして別のテーブルに格納してもよい。本発明は、当該データをメモリ上に展開するときに容量を小さく抑えるために、図２の２１、２２および２３に示すように、ゲノム断片組成、ゲノム断片分子量、ゲノム断片数およびゲノム断片位置をそれぞれ別のテーブルに格納することが好ましい。 As shown in Table D (24) shown in FIG. 2, the genome fragment composition, genome fragment molecular weight, genome fragment number, and genome fragment position of the virtually cut genome fragment are stored as a set of data in the same table. Alternatively, as shown in Tables A to C (21, 22 and 23), they may be stored in a separate table as a set with other data centering on the genome fragment composition. In the present invention, in order to keep the capacity small when expanding the data on the memory, as shown in 21, 22 and 23 of FIG. 2, the genome fragment composition, the genome fragment molecular weight, the number of genome fragments, and the genome fragment position are set. Each is preferably stored in a separate table.

本願特許出願時に公知されたすべてのRNA分解酵素の特異的な切断は、本発明の仮想的にRNA配列切断に用いることができる、たとえば、グアニン（G）を特異的に切断するRNaseT1、シトシンを特異的に切断するRNaseCL3や、UもしくはCを特異的に切断するRNaseA、そしてAもしくはGを特異的に切断するRNaseU2などがある。本発明に用いる特異的に切断RNA分解酵素は上記例に限らない。 Specific cleavage of all RNases known at the time of filing of the present patent application can be used for virtual RNA sequence cleavage of the present invention. For example, RNaseT1, which specifically cleaves guanine (G), cytosine There are RNaseCL3 that specifically cleaves, RNaseA that specifically cleaves U or C, and RNaseU2 that specifically cleaves A or G. The specifically cleaved RNase used in the present invention is not limited to the above example.

この本発明の実施態様のゲノム断片データベースに使用されるゲノムは、任意の生物種の任意のゲノムであり、大腸菌、酵母から各種の哺乳動物、そしてヒトまで特に限定されない。本実施態様をより分かりやすく説明するために、大腸菌および酵母のゲノムを用いたが、それには限定されない。ここで使用される大腸菌ゲノムには、たとえば、大腸菌K12 MG1655株等があげられ、酵母のゲノムには、たとえば、出芽酵母Saccharomyces_cerevisiae等があげられる。また、大腸菌の遺伝子産物の名前として、5S rRNA、6S RNA、4.5S RNA、23S rRNA、16S rRNA等、また、出芽酵母の遺伝子の名前として、snR9、scR1、snR128、snR190、snR14、snR6等がある
図中、１２は、記憶手段（２０）に含まれている、前記分解酵素と同様な分解酵素で実際に切断して得られた少なくとも１つの対象RNA断片を測定して得たその対象RNA断片分子量を行列I(n)（nは１以上の整数で対象RNA断片番号を示す）として読み込む入力手段（２１）を示す（図３参照）。 The genome used in the genome fragment database of this embodiment of the present invention is an arbitrary genome of an arbitrary species, and is not particularly limited from E. coli, yeast to various mammals, and humans. In order to explain this embodiment more clearly, E. coli and yeast genomes were used, but the present invention is not limited thereto. Examples of the E. coli genome used here include E. coli K12 MG1655 strain, and examples of the yeast genome include budding yeast Saccharomyces_cerevisiae. In addition, E. coli gene product names include 5S rRNA, 6S RNA, 4.5S RNA, 23S rRNA, 16S rRNA, etc., and budding yeast gene names include snR9, scR1, snR128, snR190, snR14, snR6, etc. In the figure, reference numeral 12 denotes a target RNA obtained by measuring at least one target RNA fragment obtained by actually cleaving with the same decomposing enzyme as the decomposing enzyme contained in the memory means (20). An input means (21) for reading the fragment molecular weight as a matrix I (n) (n is an integer of 1 or more and indicates the target RNA fragment number) is shown (see FIG. 3).

本実施態様は、実際に存在する配列等が未知のRNA分子をゲノム配列の上に同定することを目的とし、対象RNAを前記分解酵素と同様な分解酵素で対象RNAを実際に切断して得た少なくとも１つの対象RNA断片を測定して得たその対象RNA断片分子量およびその対象RNA断片番号からなる１組のデータを読み込む。 The purpose of this embodiment is to identify an RNA molecule whose sequence or the like actually exists on the genome sequence, and obtain the target RNA by actually cleaving the target RNA with the same degrading enzyme as the degrading enzyme. A set of data consisting of the molecular weight of the target RNA fragment obtained by measuring at least one target RNA fragment and the target RNA fragment number is read.

具体的には、Gを特異的に切断するRNaseT1で図３の３０のような配列未知の対象RNA分子を３１のような対象RNA断片に切断する。次いで、たとえばLC/MS（液体クロマトグラフィー/マススペクトロメトリー）あるいはMALDI-TOF MS（マトリックス支援レーザ脱離イオン化法/飛行時間型質量分析計）ですべての対象RNA断片分子量を測定し、３２のようにその対象RNA断片番号とそれに対応する対象RNA断片分子量をたとえばテーブルＹに配列I(n)（nは１以上の整数で対象RNA断片番号を示す）の形式で記憶領域（媒体も含む）に格納する。 Specifically, a target RNA molecule having an unknown sequence such as 30 in FIG. 3 is cleaved into a target RNA fragment such as 31 with RNaseT1 that specifically cleaves G. Next, measure the molecular weight of all RNA fragments of interest using, for example, LC / MS (liquid chromatography / mass spectrometry) or MALDI-TOF MS (matrix-assisted laser desorption / ionization / time-of-flight mass spectrometer). The target RNA fragment number and the corresponding molecular weight of the target RNA fragment are stored in a storage area (including the medium) in the form of a sequence I (n) (where n is an integer of 1 or more, for example) in Table Y. Store.

図中、１３および１４は、入力手段（２１）で読込まれた少なくとも１つの対象RNA断片分子量に対する誤差を修正する修正手段（２２）を示す。本実施態様は本特許出願現在RNA塩基配列に対して生じうる全ての分子量変化を予め図２のテーブルＥ（２５）のように所定のデータベースに格納して記憶させる。メモリ上展開した対象RNA断片分子量とゲノム断片分子量との照合結果一致しないときには、対象RNA断片に誤差を生じる原因が存在すると判断され（図１の１３参照）、その誤差を生じていると思われる対象RNA断片分子量に対し、分子量の誤差修正を行う（図１の１４参照）。 In the figure, reference numerals 13 and 14 denote correction means (22) for correcting an error with respect to the molecular weight of at least one target RNA fragment read by the input means (21). In this embodiment, all the molecular weight changes that can occur with respect to the RNA base sequence as of the present patent application are stored in advance in a predetermined database as shown in Table E (25) of FIG. If the comparison result between the molecular weight of the target RNA fragment developed in memory and the molecular weight of the genomic fragment does not match, it is determined that there is a cause for an error in the target RNA fragment (see 13 in FIG. 1), and this error seems to have occurred. The molecular weight error is corrected for the molecular weight of the target RNA fragment (see 14 in FIG. 1).

図中、１５は、前記算出手段（３０）に含まれている、読み込まれた少なくとも１つの対象RNA断片分子量を記憶手段（１０）または記憶手段（１１）もしくはその両方のデータと照合させた後、さらに、対象RNA断片組成を行列H(n)（nは１以上の整数で対象RNA断片番号を示す）として抽出する抽出手段（３１）を示す。 In the figure, reference numeral 15 denotes a state in which at least one read RNA fragment molecular weight included in the calculation means (30) is collated with the data of the storage means (10) and / or the storage means (11). Furthermore, extraction means (31) for extracting the target RNA fragment composition as a matrix H (n) (n is an integer of 1 or more and indicates the target RNA fragment number) is shown.

本実施態様は、図３のテーブルＹおよび図２にあるテーブルＡ（２１）を計算領域、たとえばメモリ上に展開し、対象RNA断片分子量をゲノム断片分子量と照合させ、一致する場合対象RNA断片分子量と同じ分子量を持つゲノム組成を対象RNA断片組成として定義し、たとえば、対象RNA断片番号とを一組のデータとして、たとえばテーブルＹに行列H(n)（nは１以上の整数で対象RNA断片番号を示す）の形式で記憶領域（媒体も含む）に格納する。この手段により、実際に切断された対象RNAの各断片の組成情報を知ることができる。 In this embodiment, the table Y in FIG. 3 and the table A (21) in FIG. 2 are expanded on a calculation region, for example, a memory, and the target RNA fragment molecular weight is collated with the genome fragment molecular weight. Define the genomic composition having the same molecular weight as the target RNA fragment composition, for example, the target RNA fragment number as a set of data, for example, a matrix H (n) in table Y (n is an integer of 1 or more and the target RNA fragment It is stored in a storage area (including a medium) in the form of a number. By this means, the composition information of each fragment of the target RNA actually cleaved can be known.

図中、１６は、前記算出手段（３０）に含まれている、得られた対象RNA断片組成をさらに記憶手段（１０）または記憶手段（１１）もしくはその両方のデータと照合させ、当該ゲノム配列の上にある少なくとも１つ対象RNA断片数を行列F(n)（nは１以上の整数で対象RNA断片番号を示す）として抽出する抽出手段（３２）を示す。対象RNA断片組成を図２のテーブルＣにあるゲノム断片組成と照合させ、一致する場合ゲノム配列の上にある、対象RNA断片組成と同じ組成を有するゲノム断片数をゲノム配列の上にある対象RNA断片数と定義し、たとえば対象RNA断片番号とを一組のデータとして、たとえば、行列F(n)（nは１以上の整数で対象RNA断片番号を示す）のようにメモリ上に格納する。 In the figure, 16 indicates that the obtained target RNA fragment composition contained in the calculation means (30) is further collated with the data of the storage means (10) and / or the storage means (11), and the genome sequence The extraction means (32) for extracting the number of at least one target RNA fragment above the matrix F (n) (where n is an integer equal to or greater than 1 and indicates the target RNA fragment number). The target RNA fragment composition is collated with the genomic fragment composition shown in Table C of FIG. 2, and the number of genomic fragments having the same composition as the target RNA fragment composition above the genomic sequence when they match is the target RNA above the genomic sequence. The number of fragments is defined. For example, a target RNA fragment number is stored as a set of data on a memory as a matrix F (n) (n is an integer of 1 or more and indicates a target RNA fragment number).

図中、１７は、前記算出手段（３０）が、得られた対象RNA断片組成をさらに記憶手段（１０）または記憶手段（１１）もしくはその両方のデータと照合させ、当該ゲノム配列の上にある少なくとも一箇所の対象RNA断片位置をL(n)（nは１以上の整数で対象RNA断片番号を示す）として抽出する抽出手段（３３）を示す。対象RNA断片組成を図２のテーブルＢにあるゲノム断片組成と照合させ、一致する場合ゲノム配列の上にある、対象RNA断片組成と同じ組成を有するゲノム断片位置を対象RNA断片位置として定義し、たとえば対象RNA断片番号とを一組のデータとして、たとえば、行列L(n)（nは１以上の整数で対象RNA断片番号を示す）のようにメモリ上に格納することができる。それによって、対象RNA断片がゲノム上の存在する可能性の高い場所を特定することができる。 In the figure, 17 indicates that the calculation means (30) collates the obtained target RNA fragment composition with the data of the storage means (10) and / or the storage means (11), and is above the genome sequence. The extraction means (33) for extracting at least one target RNA fragment position as L (n) (n is an integer of 1 or more and indicates the target RNA fragment number) is shown. The target RNA fragment composition is collated with the genomic fragment composition in Table B of FIG. 2, and the genome fragment position having the same composition as the target RNA fragment composition above the genomic sequence is defined as the target RNA fragment position when matching. For example, the target RNA fragment number can be stored on the memory as a set of data, for example, as a matrix L (n) (n is an integer of 1 or more and indicates the target RNA fragment number). Thereby, it is possible to identify a place where the target RNA fragment is likely to exist on the genome.

図中、１８は、ゲノム配列の上のフレーム内に対象RNA断片組成の存在を、該RNA断片組成の出現頻度で表そうとし、前記フレーム内に存在する少なくとも１つの対象RNA断片数F(n)をもとに二項分布法でフレーム内のその対象RNA断片の出現頻度P(n)を算出する算出手段（３５）を示す。 In the figure, 18 indicates the presence of the target RNA fragment composition in the frame above the genome sequence by the appearance frequency of the RNA fragment composition, and the number F (n of at least one target RNA fragment existing in the frame ), A calculation means (35) for calculating the appearance frequency P (n) of the target RNA fragment in the frame by the binomial distribution method is shown.

図中、１９は、前記抽出手段（５）で得た少なくとも一箇所の対象RNA断片位置からゲノム配列の所定方向に当該対象RNAの塩基長で設けられるフレーム内のゲノム配列組成が走査される走査手段（３４）を示す。ゲノム配列の上に対象RNA断片組成が存在することは、そのあたりに対象RNA分子が存在する可能性があることを示すので、対象RNA断片組成が存在する全ての対象RNA断片位置から所定のフレームを設け、そのフレーム内の全ゲノム配列を走査することにより、対象RNA断片組成が全て入っているフレームをゲノム配列上に検出することができる。図５参照。 In the figure, 19 is a scan in which the genomic sequence composition in a frame provided with the base length of the target RNA is scanned in a predetermined direction of the genomic sequence from at least one target RNA fragment position obtained by the extraction means (5). Means (34) are shown. The presence of the target RNA fragment composition above the genome sequence indicates that there may be a target RNA molecule around it, so a predetermined frame from all target RNA fragment positions where the target RNA fragment composition exists By scanning the entire genome sequence in the frame, a frame containing the entire target RNA fragment composition can be detected on the genome sequence. See FIG.

図中、２１または２２は、前記算出手段（３５）に含まれる、フレーム内の前記対象RNA断片の出現確率よりスコアを算出する算出手段（３６）を示す。また、２３および２４は、得られたスコアのリストに登録して、そのリストを表示することを示す（図６および図７参照）。 In the figure, reference numeral 21 or 22 denotes calculation means (36) that is included in the calculation means (35) and calculates a score from the appearance probability of the target RNA fragment in a frame. Reference numerals 23 and 24 denote registration in the obtained score list and display of the list (see FIGS. 6 and 7).

実施例１
本発明は、上述した記載に基づいて、精製した大腸菌5SリボソームRNA分子にRNase T1を作用させて作成したフラグメントのLC/MSによる測定データから、A1U2C5G1、
A1U2C5G1、A2U1C4G1、A3U1C2G1、A3U0C1G1、A3U0C1G1、A2U1C1G1、A2U1C1G1、A1U0C2G1、A1U0C2G1、A0U1C2G1、A0U1C2G1、A2U0C0G1、A1U1C0G1、A1U1C0G1、A0U0C2G1、に相当する組成の一部乃至大部分が大腸菌5SリボソームRNA遺伝子領域に含まれることを見出し、大腸菌（K12 MG1655株）
（ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K12/よりダウンロード
可能）をゲノム配列とし、上記RNA断片を対象RNA断片としてそのゲノム中での位置を同定
し、大腸菌に8ヶ所存在する5SリボソームRNA遺伝子の帰属に成功した。8遺伝子のうちの1
つは、最も出現頻度の低いフラグメントの組成が異なるため、他の7遺伝子と異なるスコ
アが算出されている。その結果を図６（ａ）乃至（ｃ）、に示す。 Example 1
The present invention, based on the measurement data by LC / MS of a fragment prepared by reacting RNase T1 with purified E. coli 5S ribosomal RNA molecule based on the above description, A1U2C5G1,
A1U2C5G1, contained in A2U1C4G1, A3U1C2G1, A3U0C1G1, A3U0C1G1, A2U1C1G1, A2U1C1G1, A1U0C2G1, A1U0C2G1, A0U1C2G1, A0U1C2G1, A2U0C0G1, A1U1C0G1, A1U1C0G1, A0U0C2G1, part or most of the E. coli 5S ribosomal RNA gene region of the composition corresponding to E. coli (K12 MG1655 strain)
(Downloadable from ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K12/) is the genomic sequence, the above RNA fragment is identified as the target RNA fragment, and its location in the genome is identified. Successful assignment of the existing 5S ribosomal RNA gene. 1 out of 8 genes
One has a different score from the other 7 genes because the composition of the least frequently occurring fragment is different. The results are shown in FIGS. 6 (a) to (c).

実施例２
本発明は、上述した記載に基づいて、精製した出芽酵母5SリボソームRNA分子にRNase T1を作用させて作成したフラグメントのLC/MSによる測定データから、A4U4C4G1、A4U3C5G1、A4U1C2G1、A3U1C3G1、A2U4C1G1、A2U3C2G1、A2U2C2G1、A2U2C2G1、A0U3C3G1、A3U1C1G1、A2U2C1G1、A1U3C1G1、A0U3C2G1、A3U1C0G1、A2U2C0G1、A1U1C2G1、A0U1C3G1、A0U2C2G1、A3U0C0G1、A2U1C0G1、A2U1C0G1、A1U1C1G1、A1U0C2G1、A0U1C2G1、A0U1C2G1、A1U1C0G1、に相当する組成の一部乃至大部分が5SリボソームRNA遺伝子領域に含まれることを見出し、出芽酵母（ftp://ftp.ncbi.nih.gov/genomes/Saccharomyces_cerevisiaeよりダウンロード可能）をゲノム配列とし、上記RNA断片を対象RNA断片としてそのゲノム中での位置を同定し、出芽酵母に6ヶ所存在する5SリボソームRNA遺伝子の帰属に成功した。その結果を図７（ａ）乃至（ｃ）、に示す。 Example 2
The present invention, based on the measurement data by LC / MS of a fragment prepared by allowing RNase T1 to act on a purified budding yeast 5S ribosomal RNA molecule based on the above description, A4U4C4G1, A4U3C5G1, A4U1C2G1, A3U1C3G1, A2U4C1G1, A2U3C2G1, A2U2C2G1, A2U2C2G1, A0U3C3G1, A3U1C1G1, A2U2C1G1, A1U3C1G1, A0U3C2G1, A3U1C0G1, A2U2C0G1, A1U1C2G1, A0U1C3G1, A0U2C2G1, A3U0C0G1, A2U1C0G1, A2U1C0G1, A1U1C1G1, A1U0C2G1, A0U1C2G1, A0U1C2G1, A1U1C0G1, corresponding to a part or most of the composition Is found in the 5S ribosomal RNA gene region, budding yeast (downloadable from ftp://ftp.ncbi.nih.gov/genomes/Saccharomyces_cerevisiae) is used as the genome sequence, and the above RNA fragment is used as the target RNA fragment. We identified the position of the 5S ribosomal RNA gene in 6 locations in Saccharomyces cerevisiae and succeeded in assigning it. The results are shown in FIGS. 7 (a) to (c).

本発明ゲノム配列の上で対象RNA断片を同定する手段順を示すフローチャート。The flowchart which shows the means order which identifies object RNA fragment on this invention genomic sequence. ゲノム断片分子量、ゲノム断片組成、ゲノム断片数およびゲノム断片位置をデータベースに格納することを示す模式図。The schematic diagram which shows storing a genome fragment molecular weight, a genome fragment composition, the number of genome fragments, and a genome fragment position in a database. 対象RNA断片の生成および対象RNA断片番号順で対象RNA断片分子量を格納することを示す模式図。The schematic diagram which shows storing the production | generation of an object RNA fragment, and object RNA fragment molecular weight in order of an object RNA fragment number. ゲノム断片間の分子量の差がペプチド断片間の分子量の差と異なることを示すグラフ。The graph which shows that the difference in the molecular weight between genome fragments differs from the difference in the molecular weight between peptide fragments. ゲノム配列の上設けたフレームでフレーム内のゲノム配列組成を走査する模式図。The schematic diagram which scans the genome arrangement | sequence composition in a frame with the frame provided on the genome arrangement | sequence. 実施例１における対象RNA断片が大腸菌ゲノム配列の上に存在するスコアを順に並べた表。The table | surface which put in order the score which the target RNA fragment in Example 1 exists on an E. coli genome sequence. 実施例１における対象RNA断片が大腸菌ゲノム配列の上に存在するスコアを順に並べた表。The table | surface which put in order the score which the target RNA fragment in Example 1 exists on an E. coli genome sequence. 実施例１における対象RNA断片が大腸菌ゲノム配列の上に存在するスコアを順に並べた表。The table | surface which put in order the score which the target RNA fragment in Example 1 exists on an E. coli genome sequence. 実施例２における対象RNA断片が酵母ゲノム配列の上に存在するスコアを順に並べた表。The table | surface which arranged in order the score in which the object RNA fragment in Example 2 exists on a yeast genome sequence. 実施例２における対象RNA断片が酵母ゲノム配列の上に存在するスコアを順に並べた表。The table | surface which arranged in order the score in which the object RNA fragment in Example 2 exists on a yeast genome sequence. 実施例２における対象RNA断片が酵母ゲノム配列の上に存在するスコアを順に並べた表。The table | surface which arranged in order the score in which the object RNA fragment in Example 2 exists on a yeast genome sequence.

Claims

A storage means (10) for storing data on an arbitrary genomic sequence of an arbitrary species and a cleavage mechanism of a DNA degrading enzyme and / or an RNA degrading enzyme capable of cleaving the sequence, the same as the degrading enzyme Input means (20) for reading the molecular weight of the target RNA fragment obtained by measuring at least one target RNA fragment that can be cleaved by a degrading enzyme; and storage means (10) for the read molecular weight of at least one target RNA fragment And at least one target RNA comprising a calculation means (30) for calculating a candidate region in which the target RNA fragment is present on the sequence of the storage means (10). An RNA molecule search device that marks and identifies an arbitrary RNA molecule containing a fragment on an arbitrary genome sequence.

The storage means (10) virtually cleaves any genome sequence of any organism species according to a cleaving mechanism of a DNA-degrading enzyme and / or a RNA-degrading enzyme capable of cleaving the genome sequence, and a genome fragment The RNA molecule search according to claim 1, further comprising storage means (11) for storing at least two data among a set of data consisting of molecular weight, genome fragment composition, genome fragment number and genome fragment position. apparatus.

Input means (21) for reading the molecular weight of the target RNA fragment obtained by measuring at least one target RNA fragment obtained by actually cleaving the input means (20) with a degrading enzyme similar to the degrading enzyme The RNA molecule search apparatus according to claim 1 or 2, further comprising:

The RNA molecule according to claim 3, wherein the input means (20) further comprises a correction means (22) for correcting an error relative to the molecular weight of at least one target RNA fragment read by the input means (21). Search device.

After the calculation means (30) collates the read molecular weight of at least one target RNA fragment with the data of the storage means (10) and / or the storage means (11), the composition of at least one target RNA fragment The RNA molecule search device according to claim 1, further comprising an extraction unit (31) for extracting the RNA.

The calculation means (30) further collates the obtained RNA fragment composition with the data of the storage means (10) and / or the storage means (11), and then extracts the number of at least one target RNA fragment. 6. The RNA molecule search device according to claim 5, comprising an extraction means (32).

The calculation means (30) further collates the obtained target RNA fragment composition with the data of the storage means (10) and / or the storage means (11), and then extracts at least one target RNA fragment position. The RNA molecule search device according to claim 6, further comprising extraction means (33) for performing the operation.

The calculating means (30) further comprises scanning means (34) for scanning a genomic sequence composition in a frame provided with a predetermined base length in a predetermined direction on the genome sequence from at least one obtained target RNA fragment position. The RNA molecule search apparatus according to claim 7, comprising:

The RNA molecule search apparatus according to claim 8, wherein the scanning means (34) provides the base length of the target RNA as a predetermined base length of the frame.

The calculation means (30) calculates the appearance probability of the target RNA fragment in the frame based on the number of at least one control RNA fragment composition (number of target RNA fragments) that matches the obtained composition in the frame. The RNA molecule search device according to claim 8 or 9, further comprising a calculation means (35) for performing the calculation.

The RNA molecule search device according to claim 10, wherein the calculating means (35) calculates the appearance probability of the target RNA fragment in the frame by an appearance frequency ratio method.

The RNA molecule search device according to claim 10, wherein the calculating means (35) calculates the appearance probability of the target RNA fragment in the frame by binomial distribution method.

13. The RNA molecule search apparatus according to claim 10, wherein the calculation means (35) further includes calculation means (36) for calculating a score based on the appearance probability of the target RNA fragment in a frame.

14. The RNA molecule search apparatus according to claim 1, wherein the arbitrary genome sequence of the arbitrary biological species is a human arbitrary genome sequence.

15. The DNA molecule search apparatus according to claim 1, wherein the target RNA fragment is a DNA fragment.

A storage step (10) for storing data on an arbitrary genomic sequence of an arbitrary species and a cleavage mechanism of a DNA-degrading enzyme and / or an RNA-degrading enzyme capable of cleaving the sequence; An input step (20) for reading at least one target RNA fragment molecular weight obtained by measuring at least one target RNA fragment that can be cleaved by a degrading enzyme; and a step for storing at least one read target RNA fragment molecular weight (10) And at least one target RNA comprising: a calculation step (30) in which the target RNA fragment is collated with data relating to the sequence and the cleavage mechanism in (1) and a candidate region in which the target RNA fragment is present on the sequence in the storage step (10) is calculated An RNA molecule search method for identifying and identifying an arbitrary RNA molecule containing a fragment on an arbitrary genome sequence.

The memory step (10) virtually cleaves any genome sequence of any organism species according to a cleaving mechanism of a DNA-degrading enzyme and / or a RNA-degrading enzyme capable of cleaving the genome sequence, The RNA molecule search according to claim 16, further comprising a storage step (11) for storing at least two data among a set of data consisting of molecular weight, genome fragment composition, genome fragment number, and genome fragment position. Method.

The input step (20) includes an input step (21) for reading the molecular weight of the target RNA fragment obtained by measuring at least one target RNA fragment obtained by actually cleaving with the same degrading enzyme as the decomposing enzyme. The RNA molecule search method according to claim 16, further comprising:

19. The RNA molecule according to claim 18, wherein the input step (20) further includes a correction step (22) for correcting an error with respect to the molecular weight of at least one target RNA fragment read in the input step (21). retrieval method.

After the calculation step (30) collates the read molecular weight of at least one target RNA fragment with the data of the storage step (10) and / or the storage step (11), the composition of at least one target RNA fragment 20. The RNA molecule search method according to claim 16, further comprising an extraction step (31) for extracting.

After the calculation step (30) collates the obtained target RNA fragment composition with the data of the storage step (10) and / or the storage step (11), the number of at least one target RNA fragment is further extracted. The RNA molecule search method according to claim 20, comprising an extraction step (32).

After the calculation step (30) collates the obtained target RNA fragment composition with the data of the storage step (10) and / or the storage step (11), at least one target RNA fragment position is further extracted. The RNA molecule search method according to claim 21, further comprising an extraction step (33).

The calculation step (30) further includes a scanning step (34) for scanning a genome sequence composition in a frame provided with a predetermined base length in a predetermined direction on the genome sequence from the obtained position of at least one target RNA fragment. The RNA molecule search method according to claim 22, comprising:

The RNA molecule search method according to claim 23, wherein the scanning step (34) provides the base length of the target RNA as a predetermined base length of the frame.

The calculation step (30) calculates the appearance probability of the target RNA fragment in the frame based on the number of at least one control RNA fragment composition (number of target RNA fragments) that matches the obtained composition in the frame. The RNA molecule search method according to claim 23 or 24, further comprising a calculating step (35).

26. The RNA molecule search method according to claim 25, wherein the calculating step (35) calculates the appearance probability of the target RNA fragment in the frame by an appearance frequency ratio method.

The RNA molecule search method according to claim 25, wherein the calculating step (35) calculates the appearance probability of the target RNA fragment in the frame by a binomial distribution method.

28. The RNA molecule search method according to claim 25 to claim 27, wherein the calculation step (35) further includes a calculation step (36) for calculating a score based on the appearance probability of the target RNA fragment in a frame.

29. The RNA molecule search method according to any one of claims 16 to 28, wherein the arbitrary genomic sequence of the arbitrary biological species is a human arbitrary genomic sequence.

30. The DNA molecule search method according to claim 16, wherein the target RNA fragment is a DNA fragment.

A memory function (10) for storing in a computer an arbitrary genomic sequence of an arbitrary species and data on a cleavage mechanism of a DNA degrading enzyme and / or an RNA degrading enzyme capable of cleaving the sequence, the degrading enzyme An input function (20) for reading at least one target RNA fragment molecular weight obtained by measuring at least one target RNA fragment that can be cleaved by the same degrading enzyme, and memorizing at least one read target RNA fragment molecular weight The calculation function (30) for checking the candidate region in which the target RNA fragment is present on the sequence of the memory function (10) is collated with the data on the sequence and the cleavage mechanism in the function (10), An RNA molecule search program for marking and identifying any RNA molecule containing at least one target RNA fragment on any genome sequence.

The memory function (10) virtually cleaves any genome sequence of any organism species according to a cleaving mechanism of a DNA-degrading enzyme and / or a RNA-degrading enzyme capable of cleaving a genome sequence, and a genome fragment 32. The RNA molecule search according to claim 31, further comprising a memory function (11) for storing at least two data among a set of data consisting of molecular weight, genome fragment composition, genome fragment number, and genome fragment position. apparatus.

The input function (20) has an input function (21) for reading the molecular weight of the target RNA fragment obtained by measuring at least one target RNA fragment obtained by actually cleaving with the same degrading enzyme as the decomposing enzyme. The RNA molecule search program according to any one of claims 31 to 32, further comprising:

The RNA molecule according to claim 33, wherein the input function (20) further includes a correction function (22) for correcting an error with respect to the molecular weight of at least one target RNA fragment read by the input function (21). Search program.

After the calculation function (30) collates the read molecular weight of at least one target RNA fragment with the data of the memory function (10) and / or the memory function (11), the composition of at least one target RNA fragment The RNA molecule search program according to any one of claims 31 to 34, further comprising an extraction function (31) for extracting.

The calculation function (30) further collates the obtained target RNA fragment composition with the data of the memory function (10) and / or the memory function (11), and then extracts the number of at least one target RNA fragment. 36. The RNA molecule search program according to claim 35, comprising an extraction function (32).

The calculation function (30) further collates the obtained RNA fragment composition with the data of the memory function (10) and / or the memory function (11), and then extracts at least one target RNA fragment position The RNA molecule search program according to claim 36, further comprising an extraction function (33).

The calculation function (30) further includes a scanning function (34) for scanning a genomic sequence composition in a frame provided with a predetermined base length in a predetermined direction on the genome sequence from at least one target RNA fragment position obtained. The RNA molecule search program according to claim 37, comprising:

The RNA molecule search program according to claim 38, wherein the scanning function (34) provides the base length of the target RNA as a predetermined base length of the frame.

The calculation function (30) calculates the appearance probability of the target RNA fragment in the frame based on the number of at least one control RNA fragment composition (number of target RNA fragments) that matches the obtained composition in the frame. 40. The RNA molecule search program according to claim 38 or 39, further comprising a calculation function (35) for performing the calculation.

41. The RNA molecule search program according to claim 40, wherein the calculation function (35) calculates an appearance probability of the target RNA fragment in a frame by an appearance frequency ratio method.

41. The RNA molecule search program according to claim 40, wherein the calculation function (35) calculates an appearance probability of the target RNA fragment in a frame by a binomial distribution method.

43. The RNA molecule search program according to claim 40, wherein the calculation function (35) further includes a calculation function (36) for calculating a score based on the appearance probability of the target RNA fragment in a frame.

44. The RNA molecule search program according to any one of claims 31 to 43, wherein the arbitrary genomic sequence of the arbitrary biological species is a human arbitrary genomic sequence.

45. The DNA molecule search program according to claim 31, wherein the target RNA fragment is a DNA fragment.

The medium which described the program in any one of Claims 31-45.