JP3903420B2

JP3903420B2 - System for identifying functional sites of RNA from nucleotide sequences

Info

Publication number: JP3903420B2
Application number: JP2002037443A
Authority: JP
Inventors: 健二山本; 純次北川
Original assignee: National Institute of Biomedical Innovation NIBIO
Current assignee: National Institute of Biomedical Innovation NIBIO
Priority date: 2002-02-14
Filing date: 2002-02-14
Publication date: 2007-04-11
Anticipated expiration: 2022-02-14
Also published as: JP2003242153A

Description

【０００１】
【発明の属する技術分野】
本発明はＲＮＡの機能性部位を同定する方法に関する。
【０００２】
【従来の技術】
ＲＮＡはリボヌクレオチドが連結した鎖状分子である。生体内において、リボヌクレオチドを構成する塩基には、Ａ（アデニン）、Ｕ（ウラシル）、Ｃ（シトシン）および、Ｇ（グアニン）の４種類が存在する。生体内におけるＲＮＡは、主に蛋白質合成において次のような重要な役割を果たしている。
mＲＮＡ：蛋白質合成において、ゲノムの塩基配列情報に基づいて蛋白質を構成するアミノ酸配列を指示する(messenger RNA;mRNA)
tＲＮＡ：蛋白質合成において、蛋白質を構成するアミノ酸をｍＲＮＡの遺伝暗号にしたがって運ぶ(transfer RNA;tRNA)
【０００３】
蛋白質合成において、ＲＮＡの塩基配列情報が単にアミノ酸配列をコードするのみならず、ＲＮＡ自身が高次構造を形成することで複雑な機能を持つことが知られている。ＲＮＡが有する、アミノ酸配列をコードする機能以外の機能に必要な領域を機能性部位(functional region)と言う。たとえば翻訳開始領域のｍＲＮＡの二次構造が、蛋白質への翻訳過程を制御する要因となっていることが指摘されている。このような機能は、アミノ酸配列をコードするための遺伝暗号とは異なる機能である。
【０００４】
また、特定の構造を有するＲＮＡは、塩基配列を認識して切断する酵素のような作用を有する。このような作用を有するＲＮＡは、リボザイムと呼ばれている。スプライシングにおいてリボザイムが果たす役割も重要視されている。真核細胞においては、ゲノムの塩基配列が転写された後に、遺伝子を構成するエキソンが選択的に連結されて成熟ｍＲＮＡを生成する。この過程をスプライシングと呼ぶ。リボザイムを構成するＲＮＡの立体構造、そしてリボザイムによって切断されるＲＮＡの立体構造のいずれもが、スプライシングに影響を与える重要な要因となる。このような機能も、アミノ酸配列をコードする遺伝暗号とは異なる機能である。
【０００５】
塩基配列のデータから、それがＲＮＡの段階でどのような機能を持ち得るかを同定することは、医学、薬学的な研究を進める上で重要である。ＲＮＡの機能性部位を予測する従来の技術として、対象の塩基配列に対応する複数の系統に渡る塩基配列を集めて比較するという方法がある。たとえば以下に示すように塩基配列を整列(alignment)させ、比較することで、共通する塩基部位を抽出する。以下の例では、大文字で示した配列が、４つの種の間で保存された配列として抽出される。
種１ ...auuGCCggAUA...
種２ ...acgGCCauAUA...
種３ ...ucuGCCguAUA...
種４ ...uucGCCuaAUA...
【０００６】
抽出された塩基配列を含む部位は、進化的に保存されている部位であるため、これらの部位は機能性部位である可能性が高いと考えられる。アミノ酸配列においてはコドンの縮重があるため、塩基配列が相違していても翻訳アミノ酸配列が保存されている可能性がある。一方アミノ酸配列とは無関係に、ＲＮＡの塩基配列において保存性の高い領域は、ＲＮＡの機能性部位として重要である可能性が高い。
この方法は、多数の種に渡った多くのＲＮＡ塩基配列情報の集積を必要とする。したがってそれらの配列を得るための生物実験には多大な時間と労力が必要となる。
【０００７】
単なる塩基配列の比較ではなく、塩基配列情報に基づいてＲＮＡの高次構造を予測することによってＲＮＡの機能性部位を見出すことが試みられている。たとえばZuckerは、塩基配列情報に基づいて、ＲＮＡの立体構造を予測するためのアルゴリズムを確立した(M. Zucker & P. Stiegler Nucleic Acids Research 9, 133-148, 1981)。
【０００８】
またShapiroは、ＲＮＡをステムとループから構成された木構造として表現し、二つの木構造の違いを数値的な差で表した。 (Shapiro, B.A. and Zhang, K., Comparing multiple RNA secondary structures using tree comparisons, CABIOS, 6(4), 309-318, 1990.)。更にHofackerらは、ＲＮＡの二次構造間の距離をグラフの差に基づいて評価する手法を用いた(Automatic detection of conserved RNA structure elements in complete RNA virus genomes. Nucleic Acids Res. Aug 15;26(16):3825-36, 1998.)。
【０００９】
このような手法を用いてＲＮＡの立体構造を比較すれば、単なる塩基配列の比較に基づく解析方法に比べて、より特異的に機能性部位を推定できるはずである。しかし立体構造を比較するためには、やはり実際に塩基配列情報を得なければならない。公知のＲＮＡの機能性部位の推定方法は、あくまでも実際に存在する塩基配列間の比較によって共通構造を見出し、それを機能性部位として同定するための方法であった。つまり、機能性部位の推定のためにＲＮＡ塩基配列情報の集積を必要とする点においては、塩基配列を比較する方法と同じ課題を有している。
【００１０】
【発明が解決しようとする課題】
本発明は、ＲＮＡの塩基配列情報の集積を必要としない、ＲＮＡの機能性部位を同定するための新規な方法の提供を課題とする。
【００１１】
【課題を解決するための手段】
本発明者らは、ＲＮＡの有する機能と高次構造の関連に着目した。ＲＮＡは、塩基間の相補的結合に基づいて、特徴的な高次構造を形成することが知られている。そしてＲＮＡの機能は、そのＲＮＡが形成する高次構造によって維持されていると考えられる。
このような事実に基づいて、本発明者らは、解析対象の塩基配列に対して変異を与え、得られた変異塩基配列における高次構造の変化の度合いを指標として、変異を与えた部位の機能的重要性を予測することができるのできると考えた。そして、このような理論に基づいて予測されたＲＮＡの機能性部位が、既知の方法によって確認された機能性部位と一致することを確認して本発明を完成した。
【００１２】
更に本発明者らは、この解析方法を一塩基置換によってもたらされる変異がＲＮＡの機能性部位に与える影響の評価に応用できることを見出した。より具体的には、突然変異を一塩基の置換に限定し、その時の高次構造が変化する度合いを測ることで、ある１つの塩基がＲＮＡの構造と機能に及ぼす影響を知ることができる。このような解析に基づいて、単塩基置換を評価することも可能となる。
【００１３】
すなわち本発明は、以下のＲＮＡ機能性部位を同定するための方法、コンピュータープログラム、並びにそのための装置に関する。あるいは本発明は、本発明によるＲＮＡ機能性部位を同定するための方法を応用した、ＲＮＡの機能に影響を与える単塩基多型の同定方法に関する。
【００１４】
本発明［１］は、
機能性部位を同定すべきＲＮＡの解析対象塩基配列及び該ＲＮＡの２次構造を数値化した数値列を記憶するとともに、該解析対象塩基配列に対して人為的に生成可能なすべての変異塩基配列からなる集合及びそれら変異塩基配列にかかるＲＮＡの２次構造を数値化した数値列からなる集合を独立に検索可能に記憶するメモリユニット、制御ユニット、同定された機能性部位の変異塩基の位置ｉ及び機能性部位を同定すべきＲＮＡと変異塩基配列にかかるＲＮＡとの２次構造における違いＣ（ｉ）を表示する表示ユニットを少なくとも含む、ＲＮＡの機能性部位を同定するためのシステムであって、
ａ）前記メモリユニットから読み出された解析対象塩基配列に対して、構成する各塩基をもとの塩基とは異なる３とおりの塩基に順に置換した単塩基多型の変異塩基配列を人為的に生成する手段、
ｂ）前記メモリユニットから前記解析対象塩基配列並びに前記変異塩基配列を読み出し、該解析対象塩基配列又は該変異塩基配列を有する各ＲＮＡの２次構造を予測する機能モジュールと、予測された各２次構造をループ構造とステム構造からなる木構造として表現し、該木構造におけるノード及びのランクに基づいて一意的に数値列に変換する機能モジュールとを用いて、該解析対象塩基配列又は該変異塩基配列を有するＲＮＡの２次構造を数値化する手段、
ｃ）ｂ）で得られた解析対象塩基配列にかかる数値列（ Sa(n) ）と各変異塩基配列にかかる数値列（ Sb(m) ）を順次比較して、相同部分を一致させるように整列させ、以下の式を用いて、一致しない部分の距離をノードのランクの差をとることにより算出して２次構造の違いを数値の違いとしてそれぞれ取得する、構造間距離 (d(n,m)) を算出する手段、
（式２）
d(n,m)＝
min｛d(n-1,m-1)＋Costs(n,m),d(n,m-1)＋Costs(0),d(n-1,m)＋Costs(0)｝
但し、上記式２中、
（式１）
Costs(n,m)＝｜Sa(n)−Sb(m)|
Costs(0)=1
である。
ｄ）ｃ）において前記２次構造の違いとして取得された構造間距離(d(n,m))をもって定義される、前記解析対象塩基配列のＲＮＡと前記各変異塩基配列のＲＮＡの間の２次構造の違いＣ（ｉ）（但し、変数ｉは、前記変異塩基配列における変異を有する塩基の位置ｉを表す。）が、該２次構造の違いＣ（ｉ）の平均値Ｃａより標準偏差σ以上大きい場合に、該変異塩基配列における変異を有する塩基の位置ｉを、解析対象塩基配列の機能性部位の一部であるとして同定する手段、および
ｅ）ｄ）において同定された機能性部位の塩基の位置ｉ及び２次構造の違いＣ（ｉ）を表示ユニットに出力する手段、
前記制御ユニットは、前記ａ）〜ｅ）の各手段を含むものであることを特徴とする、ＲＮＡの機能性部位を同定するためのシステム。
また、本発明［２］は、
前記ｄ）における数値の違いＣ（ｉ）が、Ｃ（ｉ）の平均値Ｃａより標準偏差１．６５σ以上大きい場合に、該変異塩基配列における変異を有する塩基の位置ｉを、解析対象塩基配列の機能性部位の一部であるとして同定する手段であることを特徴とする、請求項１記載のＲＮＡの機能性部位を同定するためのシステム。
【００１５】
【発明の実施の形態】
本発明は、上記工程ａ）−ｄ）を含むＲＮＡの機能性部位を同定する方法に関する。
本発明において機能性部位(functional region)とは、ＲＮＡが有する、アミノ酸配列をコードする機能以外の機能に必要な領域を言う。たとえば遺伝情報の発現制御機構が、ＲＮＡが作り出す高次構造によって支えられている例が公知である。
【００１６】
具体的には、大腸菌の熱ショック応答がＲＮＡポリメラーゼのσサブユニットの一種であるσ³²のmRNAにおける翻訳開始領域の二次構造の安定性によって制御されていることが明らかにされている。この制御機構は、転写されたｍＲＮＡの二次構造がAUG翻訳開始コドンやリボソーム結合部位に影響を与える例である。
【００１７】
さらに、骨格筋や中枢神経系に症状が現れるミトコンドリア病では、多くの場合、ミトコンドリアDNA上のtRNA遺伝子内の特定の位置の塩基置換が原因であると言われている。ＤＮＡの点変異でなぜミトコンドリアの異常が起こるのか、なぜ点変異の種類により臨床症状が変わるのかなどといった問題はほとんど明らかにされていない。点変異がｔＲＮＡの構造変化を引き起こしリボソーム上でのコドン対合能を低下させることが示唆されており、本発明はこのようなＲＮＡの構造変化が直接的な形質の原因となっている例を予測するための有用な手段と成り得る。
【００１８】
また、ｍＲＮＡの高次構造の安定性はタンパク質への翻訳効率と密接に関わっている。したがってｍＲＮＡの塩基配列のどの部分への変異がｍＲＮＡの高次構造の保持にとって重要であるかを本発明によって予測することで、特定の遺伝子の翻訳効率を制御できる可能性がある。例えばＲＮＡウィルスの翻訳の抑制等といった、非常に重要な技術として本発明を応用できる可能性がある。
【００１９】
ＲＮＡの遺伝暗号がＲＮＡの一次構造（すなわち塩基配列）に依存しているのに対して、ＲＮＡの遺伝暗号以外の機能はＲＮＡの高次構造に依存している。したがって、機能性部位は、特定の高次構造の維持に必要な部位と考えることができる。つまり、機能性部位は、ＲＮＡが特定の高次構造を維持するのに必要な塩基配列を含む領域と定義することもできる。
【００２０】
また本発明の機能性部位とは、次のように定義することもできる。ＲＮＡは、ヌクレオシドなどの低分子物質や蛋白質などの高分子物質と特異的に結合し、特定の反応を触媒することなどを通して生体内で重要な役割を果たしていると考えられている。したがってＲＮＡの機能性部位とは、このような役割のために不可欠な構造部位と定義することもできる。機能性部位に障害（変異）を持つＲＮＡは、本来の生体内機能を発揮できないため、何らかの異常の原因となる可能性がある。
【００２１】
上記工程ａ）において、機能性部位を同定すべきＲＮＡは、任意のＲＮＡであることができる。本発明において、機能性部位を同定すべきＲＮＡの塩基配列を、解析対象塩基配列と言う。したがって、たとえば以下に示すような塩基配列情報を、解析対象塩基配列とすることができる。
実際に取得されたｃＤＮＡの塩基配列に基づく塩基配列情報、
ゲノムの解析によって予測された仮想遺伝子を構成する塩基配列情報、
人為的に構築した遺伝子の塩基配列情報、
融合蛋白質をコードする遺伝子の塩基配列情報、および
キメラ蛋白質をコードする遺伝子の塩基配列情報等
【００２２】
塩基配列をＤＮＡに求めるときには、そのまま本発明の解析を進めても良いし、あるいはＲＮＡの塩基配列とした上で解析を進めることもできる。一般にＲＮＡの塩基配列を表現する際には、それに対応するＤＮＡの塩基配列を用いることが多い。本明細書でも、これに従い、ＲＮＡの塩基配列はそれを転写するＤＮＡの塩基配列で表現することがある。塩基配列情報が成熟ＲＮＡに由来するものでありさえすれば、ＲＮＡの塩基配列に変換することにより、本発明の解析対象塩基配列とすることができる。ＤＮＡの塩基配列は、塩基tを塩基uに置換することによりＲＮＡの塩基配列へ変換される。
【００２３】
本発明は、解析対象塩基配列に対して、塩基配列に変異を有する少なくとも１つの変異塩基配列を人為的に生成する工程を含む。本発明において、変異塩基配列とは、解析対象塩基配列に任意の変異を含む塩基配列を言う。本発明における変異を与える方法や、変異の種類は限定されない。より具体的には、解析対象塩基配列に対して、塩基の置換、欠失、挿入、あるいは付加による変異を与えることによって、変異塩基配列を生成することができる。
【００２４】
これらの変異を導入する塩基は、単一であっても良いし複数の塩基を変異の対象とすることもできる。さらに、塩基の置換、欠失、挿入、あるいは付加による変異が、いずれか一つである場合のみならず、複数種の変異が導入された変異塩基配列を生成することもできる。また塩基の置換、挿入、あるいは付加による変異を与える場合には、１〜４種類の塩基の任意の塩基を用いることができる。たとえば置換においては、ある塩基に対して最大で３とおりの置換が考えられる。本発明における変異は、３とおりの置換を全て生成することもできるし、１または２とおりの塩基へ置換することもできる。
【００２５】
特に特定の１塩基を３とおりの塩基に置換することによって生成される３種類の塩基配列は、本発明における望ましい変異塩基配列である。解析対象塩基配列を構成する全ての塩基について、このような変異塩基配列を生成することによって、解析対象塩基配列が変異によって取りうる全ての構造を比較することができる。置換による変異はＲＮＡの長さの変化を伴わない。そのため、各塩基が構造に与える影響を、他の要因の影響の無い条件の元で比較することができる。
【００２６】
本発明において、好ましい変異塩基配列は、より少ない領域に限局された変異を含む塩基配列である。本発明の解析の目的は、ＲＮＡの機能性部位を同定することである。本発明は、変異によってもたらされる立体構造の変化を指標としていることから、変異塩基配列における変異部位は、より限局された領域に配置することが望ましい。変異部位を限局することによって、結果的に、より正確なＲＮＡの機能性部位の同定が期待できる。
【００２７】
たとえば、解析対象塩基配列における１あるいは数個の特定の塩基配列に変異を与えることによって得ることができる、複数の変異塩基配列は、本発明における好ましい変異塩基配列である。変異を与えるための塩基の数は、少なくとも１、たとえば１〜１０、あるいは１〜５、更には１〜２を示すことができる。
【００２８】
本発明において、解析対象塩基配列に変異を与えるための特定の１塩基として、単塩基多型（Single Nucleotide Polymorphism、以下ＳＮＰｓと省略する）であることが明らかにされている塩基を選択することができる。ＳＮＰｓは、一塩基の挿入、欠失、置換による多型を言う。ヒトゲノムの中で最も出現頻度が高い多型であることから、疾患の診断や治療の研究対象としてＳＮＰｓは注目されている。
なお遺伝学的には、核酸の塩基配列上の差異のうち、ある集団内で１％以上の頻度で存在する変異が、特に多型と呼ばれている。集団とは、地理的な隔離や亜種によって区別される集団を意味する。例えば日本人では１％に満たない変異であっても、他の人種で１％以上の頻度で見出されれば、変異ではなく多型である。
【００２９】
遺伝子領域における塩基配列の変異は、スプライシングの異常やｍＲＮＡの立体構造の変化を等を通じて遺伝子の翻訳異常につながる可能性がある。ＳＮＰｓはヒトのゲノム上で３００〜６００塩基に一つの割合で存在すると推測されている。そして、実際に膨大な数のＳＮＰｓが次々と明らかにされつつある。これらのＳＮＰｓには、それぞれ、ＲＮＡの立体構造の変化に影響を与えるもの、翻訳アミノ酸配列に変化を与えるもの、そして遺伝的な変化に結びつかないものが含まれているはずである。本発明の機能性部位の同定方法を利用すれば、膨大な数のＳＮＰｓの中から、ＲＮＡの機能性部位に影響を与える可能性があるＳＮＰｓを容易に同定することができる。したがって、本発明における変異を与える塩基として、ＳＮＰｓであることが明らかにされた塩基は、望ましい塩基の一つである。
【００３０】
次いで、解析対象塩基配列と、それに基づいて生成された変異塩基配列のそれぞれについて、各塩基配列を有するＲＮＡの高次構造を予測し、数値化する。高次構造は、二次構造以上の高次構造を含む。一般に二次構造は、ＲＮＡを構成する塩基配列において、相補的な塩基対間での結合によって形成されるステムとループから成る平面内に配置可能な構造を言う。一方、三次構造は、ＲＮＡの二次構造の各部位が空間内で相互作用することで形成される立体構造のことを言う。三次構造は、二次構造に依存している。ただし三次構造の変化の原因の全てを、二次構造の変化のみで説明することはできない。しかし、二次構造の変化は、多くの場合三次構造の変化を伴う。したがって、二次構造の変化を予測することは、三次構造の変化を予測することに他ならない。
【００３１】
ＲＮＡの塩基配列に基づいて立体構造を予測する方法は公知である。たとえば、Zuckerの二次構造予測プログラムは、与えられた塩基配列とエネルギーパラメータに基づいて、エネルギーを最小にするようなＲＮＡの二次構造を総当り的に発見するアルゴリズムを利用している。
【００３２】
またShapiroの方法では、ＲＮＡをステムとループから構成された木構造として表現し、さらに、塩基対の長さを考慮している。彼らは、これらの構造間の距離として、独自に、Tree Edit Operationと呼ばれる操作を定義し、二つの木構造の間を入れ替えることができる最小のTree Edit Operationのステップ数を構造間の距離として定義した (Shapiro, B.A. and Zhang, K., Comparing multiple RNA secondary structures using tree comparisons, CABIOS, 6(4), 309-318, 1990.)。
更にHofackerらは、ＲＮＡの二次構造の各塩基を、それを囲んでいる塩基対の数で評価して、全体の構造をMountain Plotと呼ばれる二次元グラフで表し、構造間の距離をそれらのグラフの差で評価する手法を用いた(Automatic detection of conserved RNA structure elements in complete RNA virus genomes. Nucleic Acids Res. Aug 15;26(16):3825-36, 1998.)。
【００３３】
ＲＮＡの二次構造を予測するアルゴリズムとして、Zuckerの手法の他に、Vienna RNA packageの手法も公知である(Hofacker et. al., Monatsh. Chem. 125: 167-188 1994.)。この手法は、Zuckerと同様に、ダイナミックプログラミングの手法を用いている。そして塩基対間で結合が生じる確率を考慮することで、予測精度の向上を図った。
これらの二次構造の予測アルゴリズムは、数百塩基程度の短いＲＮＡ分子であれば、高い信頼度でその二次構造を予測することができる。また予測に必要な計算時間は、通常、数秒程度に過ぎないので、本発明に利用することができる。
【００３４】
更に、ＲＮＡの三次構造の計算を目的とするソフトウエアも公知である。たとえばInsightやAmberといった三次構造の計算用のソフトウェアは、分子動力学法(MD)を用いて三次元構造を計算するためのソフトウエアである(Pearlman et. al., AMBER 4.0, University of California, San Francisco. 1991.)。MDは、多体系の運動方程式を差分法によって数値的に解く手法である。これらのソフトウエアを用いた三次構造の解析は、まず、X線結晶解析やNMR等の実験的手法により、目的の物質の初期構造としての立体構造を測定し、そこに原子や分子を追加しつつMDを用いて計算するという工程を含む。
これらの三次構造の予測手法を本発明に応用する場合、対象のＲＮＡを結晶化させ、NMRなどでその三次構造を実験的に確認しておく必要がある。これが可能であれば、突然変異を入れた場合の三次構造を計算機で計算し、その構造の変化を数値化することで、二次構造の予測方法と同様に本発明の方法に利用することができる。
【００３５】
本発明において予測された高次構造は数値化され、比較される。たとえば上記のような方法によって、高次構造を数値として比較することができる。より具体的には、実施例に示したような、ＲＮＡの二次構造を構成する木構造に着目した比較解析方法は、本発明における高次構造の数値化および比較のための方法として好ましい。この方法は、木構造として表現した二次構造を配列に変換しそれらを整列させ比較することで二次構造間の距離を計算するアルゴリズムに基づいている。
【００３６】
上記工程ｂ）で得られた数値を比較して、高次構造の違いを数値の違いとして取得することができる（工程ｃ）。更に工程ｃ）において取得される前記数値に有意な差が見られた場合に、変異塩基配列における変異を有する個所を、解析対象塩基配列の機能性部位の一部であると同定することができる（工程ｄ）。
【００３７】
本発明において、変異を加えた時に解析対象塩基配列の構造から変化する程度が有意な差を越えた場合、変異を与えた部位が機能性部位であると判断される。本発明において、有意性は、たとえば次のようにして判断される。
まず、解析対象塩基配列に変異を与えることによって生じる構造間の距離の分布が正規分布であると考える。そしてその平均および標準偏差に基づいて、有意差を検定することにより、有意な差を与える変異個所を見出すことができる。このようにして有意な差の原因となる変異を同定する工程を、実施例に示した。実施例においては、有意水準３０％、あるいは１０％で有意差があると判断した。本発明においては、構造の違いを数値として表現しているため、各塩基の変異によってもたらされる構造の相違を定量的に比較することができる。
【００３８】
機能性部位におけるＳＮＰｓは、なんらかの機能異常の原因となることが予測される。ＲＮＡにおける、機能異常の原因となる変異には、進化的に許容されない変異も含まれる。進化的に許容されない変異は、多くの場合致死的な異常の原因となる可能性もある。したがって、ＲＮＡにおける機能性部位を見出すことは、生理機能に与える影響に基づいてＳＮＰｓを分類する上で重要な情報を与える。逆に有意な差を越えないような部位におけるＳＮＰｓは、有害な表現型の原因とならないと予測される。
【００３９】
本発明は、上記ＲＮＡの機能性部位の同定方法を実施するためのプログラムを提供する。本発明のプログラムは、次の工程をコンピュータに実行させるためのソースコードで構成される。本発明のプログラムは、このプログラムを記録したコンピュータ読み取り可能な記録手段に格納することができる。
ａ）機能性部位を同定すべきＲＮＡの解析対象塩基配列に対して、塩基配列に変異を有する少なくとも１つの変異塩基配列を人為的に生成する工程、
ｂ）解析対象塩基配列を有するＲＮＡの高次構造と、変異塩基配列を有するＲＮＡの高次構造を数値化する工程、
ｃ）ｂ）で得られた数値を比較して、高次構造の違いを数値の違いとして取得する工程、および
ｄ）ｃ）において前記数値に有意な差が見られた場合に、変異塩基配列における変異を有する個所を、解析対象塩基配列の機能性部位の一部であると同定する工程
【００４０】
本発明において、工程ａ）では、解析対象塩基配列として入力された塩基配列をもとに、変異塩基配列を生成する。変異塩基配列については、既に述べたとおりである。たとえば、解析対象塩基配列として入力された全ての塩基について、もとの塩基とは異なる３種類の塩基に置換した塩基を有する塩基配列を変異塩基配列として生成する。このような変異塩基配列を生成するためのプログラムを実施例に示した。
【００４１】
図３に、実施例に用いた一塩基置換を付加する場合の変異配列作成プログラムの流れ図を示す。このプログラムでは、塩基配列の全塩基に対してＳＮＰｓを想定し、各塩基に対して、その塩基を他の三通りの塩基に置換した配列を作成し、メモリに格納する。
【００４２】
本発明において、上記工程ｂ）や工程ｃ）には、既に述べたような公知のアルゴリズムを応用することができる。これらのアルゴリズムに基づいて、コンピュータ上で実際に塩基配列を解析するためのプログラムも公知である(M. Zucker & P. Stiegler Nucleic Acids Research 9, 133-148, 1981)。なお、ＲＮＡの高次構造計算プログラムは多くのものが知られているが、本発明では、任意の高次構造計算プログラムを適用することができる。用いる高次構造計算プログラムに応じて、高次構造の構造の類似度を数値化するために適当な距離関数の定義を行う。
【００４３】
加えて本発明は、前記ＲＮＡの機能性部位を同定するための方法を実現するためのシステムを提供する。すなわち本発明は、以下の手段を含む、ＲＮＡの機能性部位を同定するためのシステムに関する。
ａ）機能性部位を同定すべきＲＮＡの解析対象塩基配列に対して、塩基配列に変異を有する少なくとも１つの変異塩基配列を人為的に生成する手段、
ｂ）解析対象塩基配列を有するＲＮＡの高次構造と、変異塩基配列を有するＲＮＡの高次構造を数値化する手段、
ｃ）ｂ）で得られた数値を比較して、高次構造の違いを数値の違いとして取得する手段、および
ｄ）ｃ）において前記数値に有意な差が見られた場合に、変異塩基配列における変異を有する個所を、解析対象塩基配列の機能性部位の一部として同定する手段
【００４４】
本発明によるＲＮＡの機能性部位を同定するためのシステムは、たとえば前記本発明のプログラムをコンピュータに実行させることにより、実現することができる。たとえば手段ａ）は、上記プログラムを実行したコンピュータに、本発明における解析対象塩基配列を入力する手段、与えられた解析対象塩基配列にプログラムにしたがって変異を与えて変異塩基配列を生成するための手段、そして生成された変異塩基配列を格納する手段とで構成される。
【００４５】
解析対象塩基配列は、通常、キーボードや各種の記録媒体から本発明のシステムに読みこむことができる。あるいは、ネットワークを介して他のコンピュータや、ネットワーク上の情報格納デバイスから読み込むこともできる。更には、シーケンサが生成する塩基配列情報を、解析対象塩基配列として本発明のシステムに直接読みこませることもできる。
【００４６】
読みこんだ塩基配列がＤＮＡの情報である場合には、自動的にＲＮＡの塩基配列情報に変換させることもできる。塩基配列がＤＮＡであることは、塩基tの存在を検知することにより確認することができる。たとえばZuckerの構造計算プログラムは、ＤＮＡの塩基配列をＲＮＡの塩基配列に変換した上で構造計算の対象とするステップを装備している。しかし、この後に行う高次構造の解析結果に影響を与えることが無ければ、必ずしもＲＮＡの塩基配列情報に変換する必要は無い。
【００４７】
本発明のためのプログラムが動作するコンピュータの制御ユニットは、与えられた解析対象塩基配列にプログラムにしたがって変異を与えて変異塩基配列を生成する。変異を与える方法は、パラメーターとして任意に設定することができる。たとえば実施例においては、解析対象塩基配列を構成する全ての塩基に対して変異を与えている。この他、たとえば解析対象塩基配列における特定のＳＮＰｓが機能性部位に与える影響を評価するために、特定の塩基に対してのみ変異を与えるように、プログラムを修正することもできる。制御ユニットによって生成された変異塩基配列は、メモリユニットに格納される。
【００４８】
本発明のシステムにおいて、手段ｂ）解析対象塩基配列を有するＲＮＡの高次構造と、変異塩基配列を有するＲＮＡの高次構造を数値化する手段は、たとえば次のような手段によって構成することができる。本発明のプログラムが動作するコンピュータの制御ユニットは、前記手段ａ）が生成した塩基配列情報をメモリユニットから読み出し、プログラムとして与えられたアルゴリズムに基づいて、高次構造を予測し、数値化するための手段として機能する。その結果生成された数値は、メモリユニットに格納される。
【００４９】
更に本発明のシステムにおいて、手段ｃ）：手段ｂ）で得られた数値を比較して、高次構造の違いを数値の違いとして取得する手段は、たとえば次のような手段によって構成することができる。本発明のプログラムが動作するコンピュータの制御ユニットは、メモリユニットに格納された高次構造に対応する数値を順次読み出して、解析対象塩基配列に基づいて生成された数値と比較する手段として機能する。比較の結果は、メモリユニットに数値として格納される。
【００５０】
そして本発明のシステムにおいて、手段ｄ）：手段ｃ）において前記数値に有意な差が見られた場合に、変異塩基配列における変異を有する個所を、解析対象塩基配列の機能性部位の一部として同定する手段は、たとえば次のような手段によって構成することができる。本発明のプログラムが動作するコンピュータの制御ユニットは、メモリユニットに格納された比較結果の数値を順次読み出して、プログラムとして与えられた判断基準と比較し、有意な差が見出されたかどうかを判定する手段として機能する。判定の結果、有意差ありと判定された組み合せを構成する変異塩基配列における変異部位は、解析対象塩基配列を有するＲＮＡにおいて、機能性部位であると同定される。
【００５１】
本発明のシステムは、判定結果に基づいて、以下のような情報を出力することができる。
１．解析対象塩基配列において、機能性部位として同定された塩基、およびその有意水準。（表１）
２．解析対象塩基配列において、高次構造への影響が少ないと推定される塩基、およびその有意水準。
３．解析対象塩基配列を構成する各塩基に変異を与えた場合の高次構造の変化レベル。（たとえば図７。これらの情報は、表形式の数値データとしても、またグラフとしても出力することができる。）
４．解析対象塩基配列と、各塩基に変異を有する変異塩基配列によって構成される高次構造そのもの（たとえば図６や図１０など）。例えば、Zuckerのアルゴリズムを用いる場合は二次構造が、またInsightを用いる場合は三次構造が得られる。
５．解析対象塩基配列と、各塩基に変異を有する変異塩基配列によって構成される高次構造を数値化する工程で得られる高次構造の表現形式（たとえば図６や図１０など）。例えば、本発明で用いた二次構造を木構造として表現することで数値化する方法の場合、各変異塩基配列が構成する高次構造の木構造表現を出力することができる。この種の情報を参照することで、変異が構造にどのような影響を及ぼしたのかを確認することができる。
【００５２】
これらの情報は、画面への表示、あるいは紙への出力など、任意の形で出力することができる。本発明の解析システムをネットワーク上で運用する場合には、ネットワークを介して異なるコンピュータの画面に結果を出力することもできる。あるいは,出力結果をデータファイルとして異なるコンピュータに伝送することもできる。
【００５３】
図１は本発明のシステムのハードウエア構成であり、図２は本発明の手法の流れ図である。
（１）入力ユニットからＲＮＡ塩基配列が入力される。（ステップ１）
（２）制御ユニットは、ハードディスクの変異配列作成プログラムに基づいて、配列の各塩基に対して突然変異を付加した配列の集合を作成し、メモリに格納する。（ステップ２）
（３）制御ユニットは、任意の高次構造計算プログラムに基づいて、メモリ内に格納された配列の集合の高次構造を計算し、それらを再びメモリに格納する。（ステップ３）
（４）制御ユニットは構造距離計算プログラムに基づいて、高次構造の構造の類似度を数値化するための距離関数を用い、メモリ内に格納された高次構造の集合間の距離を計算する。（ステップ４）
（５）このようにして入力された塩基配列の各塩基に突然変異が与えられた場合に元の高次構造が変化する程度を数値化して算出し、その結果を表示ユニットへ表示する。（ステップ５）
【００５４】
【実施例】
以下実施例により本発明を具体的に説明するが、本発明はこれらの実施例に限定されない。なお、本明細書において引用された文献は、全て本明細書の一部として組込まれる。
【００５５】
本発明の実施例として、ヒトのU2snRNAの塩基配列を用いて、他の種の塩基配列情報との比較を行うことなく、その機能性部位及びＳＮＰｓの重要度の予測を行った例を示す。U2 snRNAはタンパク質の情報を持たないＲＮＡであり、イントロンのスプライシングにおいて重要な機能を果たしていると考えられている。
配列番号：５に、ヒトのU2snRNAの塩基配列を示した。この配列に対して、まず１２番目の塩基gをaに（配列番号：６）、また１５７番目の塩基gをaに（配列番号：７）人為的に置換した配列を作成した。
これらの３種類の塩基配列からZuckerのアルゴリズムを用いて二次構造を計算した結果を図４に示す。図４から明らかなように、１２番目の塩基の変異（２）は大きく全体の二次構造を変化させているのに対し、１５７番目の塩基の変異（３）はほとんど構造変化を引き起こしていないことが分かる。
以下の実施例では、このような過程を全ての塩基の、全ての塩基置換について実施し、それぞれの場合における構造の変化を数値化し、それに基づいて機能性部位の評価とＳＮＰｓの評価を行う。
【００５６】
図２は、本発明によるＲＮＡ機能性部位予測手法の流れ図である。
ステップ１では、解析対象塩基配列を入力する。ここでは、ヒトのU2snRNAの塩基配列（配列番号：１）を入力として与える。
ステップ２では、突然変異を付加した塩基配列を作製する。この実施例では、付加する突然変異を一塩基置換に限定するものとする。一つの塩基について、３通りの一塩基置換が存在するので、ここでは、図５に示すように、それぞれの塩基を自分自身以外の３通りの塩基に置換したものを全ての塩基について作成する。ここでは、U2snRNAの配列が１８７塩基対から成るので、各塩基を３通りに置換した、計187*3=561種類の塩基配列を作成する。この時用いた変異配列作成プログラムの流れ図を図３に示した。
【００５７】
ステップ３では、作成した全ての塩基配列について高次構造を計算する。本実施例では、高次構造として二次構造を考え、高次構造計算プログラムとしてMichael Zuckerの二次構造予測アルゴリズムを用いた(M. Zucker & P. Stiegler Nucleic Acids Research 9, 133-148, 1981)。Zuckerの二次構造予測プログラムは、与えられた塩基配列とエネルギーパラメータに基づいて、エネルギーを最小にするようなＲＮＡの二次構造を総当り的に発見するものである。
【００５８】
ステップ４では、二次構造の構造間距離計算プログラムとして、図６、および図１０に示すような、木構造として表現した二次構造を配列に変換しそれらを整列させ比較することで二次構造間の距離を計算するアルゴリズムを用いた。ここで、木構造とは、単一ノード、または複数の木構造を左右に順序づけられた子供として持つノードからなる構造として再帰的に定義する。また、木構造のルートノードとは、そのノードを子供とするノードが存在しない木構造の中の唯一のノードを指す。さらに、ノードのランクとは、ノードの直下にある子供ノードの数を指す。木構造の各ノードは、ＲＮＡの二次構造のループ構造に対応し、各ノードを結ぶパスはステム構造に対応する。
木構造は、それに含まれる全てのノードのランクを、ルートノードから、親から子供、左から右の順で、順番に並べた配列によって一意に表すことができる。この配列表現は、ＲＮＡの二次構造の構造的な特徴を数値列として持っており、二次構造間の比較は、この配列間の比較へと帰着される。
【００５９】
配列間の比較は、ダイナミックプログラミングの手法によって行う。
配列Sa(n)とSb(m)の間の距離d(n,m)は、式１のように、二つのノードの間のコストを各ノードのランクの差の絶対値として定義し、一方のノードしかない場合のコストを１と定義した時に、式２のように、コストを最小にするように配列を整列させた時のコストの総和として定義される。
【００６０】
【数１】
（式１）

【数２】
（式２）

【００６１】
この過程により、二つの木構造の相同な部分が一致するように配列を整列させ、一致しない構造間の距離をノードのランクの差をとることで算出することで、二つの木構造間の距離を算出したことになる。
【００６２】
図７は、本発明で述べた手法をU2snRNAに対して実施した場合に得られる、各塩基位置と、その塩基に変異を与えた場合の構造変化の程度をグラフとして示した図である。本手法の流れ図の図２におけるステップ５の段階で、このグラフに基づいて、機能性部位の予測とＳＮＰｓの評価を行う。これらの評価手法を以下に解説する。
【００６３】
各塩基に人工的に変異を加えた時に元の構造から変化する距離をc(i)で表す。ここで変数iは塩基の位置を表し、U2snRNAの場合は、i=1,2,3…187の値をとる。C(i)≠0であるようなc(i)の値の分布が正規分布に従うと仮定した時、NをC(i)≠0であるような塩基iの総数を表すものとすると、これらの平均値Caと標準偏差σは以下のように定められる。
【数３】
（式３）

【００６４】
【数４】
（式４）

この時、c(i)が以下のような値をとるような塩基iを構造を大きく変化させる部位、若しくは、構造に影響しない部位であると判断する。
【００６５】
【数５】
（式５）

有意水準10%で、塩基iは構造に影響しない
【００６６】
【数６】
（式６）

有意水準30%で、塩基iは構造に影響しない
【００６７】
【数７】
（式７）

有意水準30%で、塩基iは構造を変化させる
【００６８】
【数８】
（式８）

有意水準10%で、塩基iは構造を変化させる
これらのうち、構造を変化させるような塩基の存在する部位を機能性部位と判断し、このような部位におけるＳＮＰｓはＲＮＡの機能を妨げる可能性があると評価する。一方で、構造を変化させない塩基へのＳＮＰｓは、機能に与える影響が低いと評価するものとする。
【００６９】
図７は、c(i)の値をグラフ化したものであり、加えて、平均値Ca及び、これらの判断基準となる値もあわせて記している。図７から、5'端から数えて10から30、50から60、100から110塩基までの部位に変異を与えた場合に構造を変化させてしまうことが読み取れる。図７において、これらの領域に丸を付けて示した。一方、150から180塩基までの部位は変異を与えても比較的構造が変化しないことが分かる。これらの結果は、図４に示した12番目及び157番目の変異の場合における構造変化の結果と合致しており、構造変化の定量的な評価が正しく行われていることが分かる。
【００７０】
あるいはc(i)の値から、変異の有無によって所定の有意差を示した塩基を有意水準ごとに抽出することもできる。表１にこのようにして抽出された塩基の位置情報を示した。
【表１】

【００７１】
すなわち、図７のグラフを読み取ることで、10-30、50-60、100-110などの塩基部位への変異(ＳＮＰｓ)の方が、150-180の塩基部位への変異(ＳＮＰｓ)よりも、機能に対する影響が大きく、おそらく致死的である可能性が高いことが予想される。また、同様にして、10-30の塩基部位は、比較的柔軟な構造部位となっているのに対して、150-180の塩基部位は安定でしっかりした構造を形成していることが分かる。
次に、これらの予測結果を、実際のＲＮＡ塩基配列や構造の観測から得られる情報と比較し、予測結果の妥当性を検討する。
【００７２】
図８に、複数種（ショウジョウバエ、線虫、シロイヌナズナ、アフリカツメガエル等25種）の天然のU2snRNAから、塩基配列間での相同性比較により、各塩基部位における種間保存率を計算した結果を示す。保存割合が1に近いほど、その塩基が全ての種において保存されている傾向が強いことを意味している。これを見ると、10-30、50-60、100-110などの塩基部位は非常に保存されている割合が高く、150-180の塩基部位はそれほど保存されていないことが見て取れる。この事実は、図７のグラフから判断される機能性部位への変異の付加は進化的に許容されてこなかったことを示しており、この部位へのＳＮＰｓが致死的になるであろうという本発明による予測結果は妥当であったと考えることができる。
【００７３】
さらに、生物的実験によると（Madhani HD, Guthrie C, Annu. Rev. Genet. 28: 1-26, 1994.）、U2snRNAを構成する四つのヘアピン部位（図４（１）に示す）のうち、ヘアピンIはスプライシング過程の初めの段階ではヘアピン構造を保っているが、後に、ヘアピン構造が大きく開き、U6snRNAとの間で相補的な塩基間の結合を行うことが知られている。この事実は、ヘアピンIに相当する10-30の塩基部位が柔軟な構造を形成しているという本発明による予測結果に合致している。また、他の実験によると、ヘアピンIVにはＲＮＡ結合タンパクが結合することが知られている(Scherly, D. et.al., Nature, 345, 502-506, 1990.).。これは、ヘアピンIVに相当する150-180の塩基部位が比較的安定な構造を形成していることに対応すると考えられ、本発明による予測結果の妥当性を示している。
【００７４】
ところで、ヒトのU2snRNAをコードする遺伝子はDNA上の複数箇所に存在することが知られている。それらのほとんどの配列は相同であるが、一つの遺伝子では、5'端から４７番目に位置する塩基Uが欠けている(Hausner,T.P., Giglio,L.M. and Weiner,A.M., Evidence for base-pairing between mammalian U2 and U6 small nuclear ribonucleoprotein particles, Genes Dev. 4(12A), 2146-2156, 1990.)。
一方で、本手法から得られる図７より、47番目の塩基の構造変化距離はc(47)=0であることが読み取れる。すなわち、本手法から有害ではないであろうと予測されたＳＮＰｓの位置に実際に変異を持ち、なおかつ正常な機能を保持しているＲＮＡが存在していることから、本手法はＳＮＰｓやその他の突然変異の評価にも有用であると考えられる。
【００７５】
次に、ＳＮＰｓが実際の疾患の原因となっていると考えられているヒトのミトコンドリア病について、本手法を適用した実施例を示す。
ミトコンドリア病は高頻度に見られる遺伝病のひとつであり、筋と中枢神経に症状が表れ、さらに痙攣、痴呆などの症状が表れることが多い。その患者はミトコンドリアのtRNA遺伝子に変異をもつ場合が多く、さらに、変異を持つtRNAの種類によって症状が異なることが知られている (Schon EA et.al., J. Bioenerg. Biomembr. 29: 131-149, 1997.)。
【００７６】
本実施例では、リジンのtRNA(tRNA-Lys)を対象とする。tRNA-Lysの塩基配列を、配列番号：８に示した。tRNA-Lysでは、その5'端から19番目に位置する塩基gのaへの変異と、48番目に位置する塩基gのaへの変異がそれぞれミトコンドリア病の患者で同定されている (Verma, A. et.al., Pediatric Research 42(4): 448-454, 1997.)(Tiranti, V. et.al., Neuromuscular Disorders 9(2): 66-71, 1999)。
U2snRNAの場合と全く同様に、図２の過程により計算された、各塩基に変異を与えた場合の構造変化の程度を表すグラフを図９に示す。図９より、ミトコンドリア病の患者で同定された変異の位置における構造の変化距離は、ともに３であり、これらの位置へのＳＮＰｓは安全であるとは言えないと判断される。
このように、実際に疾患を引き起こすようなＳＮＰｓの予測手法としても、本手法が有効であることが確認できた。
【００７７】
【発明の効果】
本発明は、一つの解析対象塩基配列に基づいて、その塩基配列における機能性部位を同定することができる。たとえば種の間で保存された塩基配列を機能性部位として同定する公知の方法では、複数の種の相同遺伝子の塩基配列を明らかにするする必要があった。しかも、機能性部位の同定にあたっては、ある程度塩基配列が相違するものも求められることから、解析に必要な塩基配列情報の集積には困難が伴う。これに対して単一の塩基配列に基づいて、その機能性部位を同定することができる本発明の方法は、はるかに容易に実施することができる。
【００７８】
また本発明は、単塩基置換が機能性部位に与える影響を評価する方法として有用である。ゲノムの構造が明らかにされつつある現在、ＳＮＰｓのＲＮＡの機能性部位に与える影響の予測は、重要な研究課題のひとつである。莫大な情報量となるヒトのＳＮＰｓについて、ＲＮＡの機能性部位に与える影響を評価するには、解析能力に優れた方法が必須である。本発明によれば、他の種の相同遺伝子の情報の有無とは無関係に、特定のＳＮＰｓについて、ＲＮＡの機能性部位に与える影響を、容易に評価することができる。更に本発明による評価結果に基づいて、ＲＮＡの機能に与える影響の大きさにしたがって、ＳＮＰｓを分類することができる。
【００７９】
あるいは、ある遺伝子を構成する全塩基のそれぞれについて塩基を置換し、塩基の置換が高次構造に与える影響を評価することによって、ＲＮＡ上の機能性部位を高い解像度で特定することができる。本発明の解析を通じて明らかにされたＲＮＡの機能性部位に対して、次々と明らかにされてくるＳＮＰｓの情報を重ね合わせることで、機能性部位に影響を与える可能性があるＳＮＰｓを容易に同定することができる。
【００８０】
【配列表】

【図面の簡単な説明】
【図１】本発明のハードウエア構成を示す模式図である。
【図２】本発明のＲＮＡ機能性部位予測のフローチャートである。
【図３】本発明で使用する塩基に対する突然変異付加プログラムの一例を示すフローチャートである。
【図４】ヒトのU2 snRNA (1)、および 12番目の塩基をgからaに (2)、157番目の塩基をgからaに (3) それぞれ置換した場合の塩基配列から計算された二次構造を示す図である。ヒトのU2 snＲＮＡのヘアピンI〜IVの位置も示した。
【図５】本発明において、一塩基置換の突然変異を人為的に付加する過程を示す図。
【図６】本発明で使用する高次構造間の距離の算出方法の一実施例を示す模式図である。
【図７】本発明によって、ヒトのU2snＲＮＡの機能性部位を予測した結果を示すグラフである。縦軸は変異によって生じた二次構造の変化の距離、横軸は5'末端を１とする塩基の位置を示す。変異によって有意な差を生じた塩基、および有意な差を生じなかった塩基に対応するプロットを丸で囲んで示した。
【図８】複数種のU2snＲＮＡの塩基配列間での相同性比較を行うことで得られる、各塩基位置における種間での保存割合を示すグラフである。
【図９】本発明をヒトのミトコンドリアのtRNA-Lysに適用した場合に得られる結果を示すグラフである。縦軸は変異によって生じた二次構造の変化の距離、横軸は5'末端を１とする塩基の位置を示す。
【図１０】ヒトのU2 snRNAを構成する塩基のうち、１２番目と１５７番目の塩基に変異を与えた場合に得られる二次構造と、この構造に対応する木構造。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a method for identifying a functional site of RNA.
[0002]
[Prior art]
RNA is a chain molecule in which ribonucleotides are linked. In vivo, there are four types of bases constituting ribonucleotides: A (adenine), U (uracil), C (cytosine), and G (guanine). In vivo RNA mainly plays the following important role in protein synthesis.
mRNA: In protein synthesis, the amino acid sequence that constitutes a protein is indicated based on the base sequence information of the genome (messenger RNA; mRNA)
tRNA: In protein synthesis, the amino acids that make up a protein are transferred according to the genetic code of mRNA (transfer RNA; tRNA)
[0003]
In protein synthesis, it is known that RNA base sequence information not only encodes an amino acid sequence, but RNA itself has a complex function by forming a higher order structure. A region necessary for a function other than the function of coding an amino acid sequence in RNA is called a functional region. For example, it has been pointed out that the secondary structure of mRNA in the translation initiation region is a factor controlling the translation process into protein. Such a function is different from the genetic code for encoding an amino acid sequence.
[0004]
Moreover, RNA having a specific structure has an action like an enzyme that recognizes and cleaves a base sequence. RNA having such an action is called a ribozyme. The role played by ribozymes in splicing is also emphasized. In eukaryotic cells, exons constituting a gene are selectively linked after the base sequence of the genome is transcribed to generate mature mRNA. This process is called splicing. Both the three-dimensional structure of the RNA constituting the ribozyme and the three-dimensional structure of the RNA cleaved by the ribozyme are important factors affecting splicing. Such a function is also different from the genetic code encoding the amino acid sequence.
[0005]
It is important for the advancement of medical and pharmaceutical research to identify what function it can have from the nucleotide sequence data at the RNA stage. As a conventional technique for predicting a functional site of RNA, there is a method of collecting and comparing base sequences over a plurality of systems corresponding to a target base sequence. For example, as shown below, base sequences are aligned and compared to extract a common base site. In the following example, the sequence shown in capital letters is extracted as a sequence conserved among the four species.
Species 1 ... auuGCCggAUA ...
Species 2 ... acgGCCauAUA ...
Species 3 ... ucuGCCguAUA ...
Species 4 ... uucGCCuaAUA ...
[0006]
Since the site | part containing the extracted base sequence is a site | part preserve | saved evolutionally, it is thought that there is a high possibility that these sites are functional sites. Since the amino acid sequence has codon degeneracy, the translated amino acid sequence may be preserved even if the base sequence is different. On the other hand, regardless of the amino acid sequence, a highly conserved region in the RNA base sequence is likely to be important as a functional site of RNA.
This method requires the accumulation of a large amount of RNA base sequence information across a large number of species. Therefore, the biological experiment for obtaining these sequences requires a great deal of time and effort.
[0007]
Attempts have been made to find a functional site of RNA by predicting a higher-order structure of RNA based on base sequence information rather than simply comparing base sequences. For example, Zucker has established an algorithm for predicting the three-dimensional structure of RNA based on nucleotide sequence information (M. Zucker & P. Stiegler Nucleic Acids Research 9, 133-148, 1981).
[0008]
Shapiro expressed RNA as a tree structure composed of stems and loops, and expressed the difference between the two tree structures numerically. (Shapiro, B.A. and Zhang, K., Comparing multiple RNA secondary structures using tree comparisons, CABIOS, 6 (4), 309-318, 1990.). Furthermore, Hofacker et al. Used a method of evaluating the distance between RNA secondary structures based on the difference in the graph (Automatic detection of conserved RNA structure elements in complete RNA virus genomes. Nucleic Acids Res. Aug 15; 26 (16 ): 3825-36, 1998.).
[0009]
If the three-dimensional structure of RNA is compared using such a technique, the functional site should be able to be estimated more specifically than in the analysis method based on simple comparison of base sequences. However, in order to compare three-dimensional structures, it is still necessary to actually obtain base sequence information. The known method for estimating the functional site of RNA is a method for finding a common structure by comparing between actually existing base sequences and identifying it as a functional site. That is, it has the same problem as the method of comparing base sequences in that it requires accumulation of RNA base sequence information for estimation of functional sites.
[0010]
[Problems to be solved by the invention]
An object of the present invention is to provide a novel method for identifying a functional site of RNA that does not require accumulation of RNA base sequence information.
[0011]
[Means for Solving the Problems]
The present inventors paid attention to the relationship between RNA functions and higher-order structures. RNA is known to form a characteristic higher order structure based on complementary binding between bases. And it is thought that the function of RNA is maintained by the higher order structure which RNA forms.
Based on such facts, the present inventors gave a mutation to the base sequence to be analyzed, and used the degree of change in the higher-order structure in the obtained mutant base sequence as an index to indicate the site of the mutation. We thought that functional importance could be predicted. And it confirmed that the functional site | part of RNA estimated based on such a theory corresponds with the functional site | part confirmed by the known method, and completed this invention.
[0012]
Furthermore, the present inventors have found that this analysis method can be applied to the evaluation of the effect of mutation caused by single base substitution on the functional site of RNA. More specifically, it is possible to know the influence of a single base on the structure and function of RNA by limiting the mutation to a single base substitution and measuring the degree of change in the higher order structure at that time. Based on such analysis, single base substitution can be evaluated.
[0013]
That is, the present invention relates to a method, a computer program, and an apparatus for identifying the following RNA functional site. Or this invention relates to the identification method of the single nucleotide polymorphism which affects the function of RNA which applied the method for identifying the RNA functional site by this invention.
[0014]
  The present invention[1]
A base sequence for analysis of RNA to identify a functional site and a numerical sequence obtained by quantifying the secondary structure of the RNA, and all mutant base sequences that can be artificially generated for the base sequence for analysis A memory unit, a control unit, and a position i of a mutated base in an identified functional site And at least a display unit that displays the difference C (i) in the secondary structure between the RNA whose functional site should be identified and the RNA of the mutant base sequence,System for identifying functional sites of RNABecause
a)A single nucleotide polymorphism obtained by substituting each of the constituent bases with three different bases in order from the base sequence to be analyzed read out from the memory unit.Means for artificially generating mutant nucleotide sequences,
b)Reading the base sequence to be analyzed and the mutant base sequence from the memory unit,Base sequence to be analyzedOr the mutant base sequenceHaveeachOf RNA2Next structureA functional module that expresses each predicted secondary structure as a tree structure composed of a loop structure and a stem structure, and uniquely converts it into a numeric string based on the nodes and ranks of the tree structure; Using the analysis target nucleotide sequence or theRNA with a mutated base sequence2Means for quantifying the following structure,
c) obtained in b)Numeric string for the base sequence to be analyzed ( Sa (n) ) And each mutation base sequenceNumericColumn ( Sb (m) )TheSequentiallyCompared to,Align homologous parts so that they match, and use the following formula to calculate the distance of the non-matching parts by taking the difference in the ranks of the nodes and calculating 2Differences in the next structure as numerical differencesRespectivelyget,Distance between structures (d (n, m)) Calculatemeans,
(Formula 2)
  d (n, m) =
    min {d (n-1, m-1) + Costs (n, m), d (n, m-1) + Costs (0), d (n-1, m) + Costs (0)}
However, in the above formula 2,
(Formula 1)
  Costs (n, m) = | Sa (n) −Sb (m) |
  Costs (0) = 1
It is.
d) Inter-structure distance obtained as the difference in secondary structure in c)(d (n, m))Defined byThe difference in secondary structure C (i) between the RNA of the base sequence to be analyzed and the RNA of each mutant base sequence (where variable i represents the position i of the base having a mutation in the mutant base sequence). The standard deviation σ is larger than the average value Ca of the difference C (i) in the secondary structureIn some cases, it has a mutation in the mutant base sequenceBase position iMeans for identifying as a part of the functional site of the base sequence to be analyzed,and
e) means for outputting to the display unit the base position i of the functional site identified in d) and the difference C (i) in the secondary structure;
The system for identifying a functional site of RNA, wherein the control unit includes each of the means a) to e).
  In addition, the present invention[2]
When the difference C (i) in d) is larger than the average value Ca of C (i) by a standard deviation of 1.65σ or more, the position i of the base having a mutation in the mutant base sequence is determined as the base sequence to be analyzed. The system for identifying a functional site of RNA according to claim 1, wherein the system is a means for identifying as a part of the functional site.
[0015]
DETAILED DESCRIPTION OF THE INVENTION
The present invention relates to a method for identifying a functional site of RNA comprising the above steps a) to d).
In the present invention, a functional region refers to a region necessary for a function other than the function encoding an amino acid sequence, which RNA has. For example, an example in which the expression control mechanism of genetic information is supported by a higher-order structure created by RNA is known.
[0016]
Specifically, the heat shock response of E. coli is a kind of σ subunit of RNA polymerase σ³²It has been clarified that it is controlled by the stability of the secondary structure of the translation initiation region in mRNA. This control mechanism is an example in which the secondary structure of the transcribed mRNA affects the AUG translation initiation codon and the ribosome binding site.
[0017]
Furthermore, in mitochondrial diseases in which symptoms occur in skeletal muscles and the central nervous system, it is often said that the cause is a base substitution at a specific position in the tRNA gene on mitochondrial DNA. Problems such as why mitochondrial abnormalities occur due to DNA point mutations and why clinical symptoms change depending on the type of point mutations have not been clarified. It has been suggested that point mutations cause tRNA structural changes and reduce codon-pairing ability on ribosomes. The present invention is an example in which such RNA structural changes directly cause traits. It can be a useful tool for prediction.
[0018]
In addition, the stability of the higher order structure of mRNA is closely related to the efficiency of translation into proteins. Therefore, it is possible to control the translation efficiency of a specific gene by predicting according to the present invention which mutation in the base sequence of mRNA is important for maintaining the higher order structure of mRNA. For example, there is a possibility that the present invention can be applied as a very important technique such as suppression of RNA virus translation.
[0019]
Whereas the RNA genetic code depends on the primary structure (ie, base sequence) of RNA, functions other than the RNA genetic code depend on the higher order structure of RNA. Therefore, the functional site can be considered as a site necessary for maintaining a specific higher order structure. That is, a functional site can also be defined as a region containing a base sequence necessary for RNA to maintain a specific higher order structure.
[0020]
The functional site of the present invention can also be defined as follows. RNA is considered to play an important role in vivo through specifically binding to low molecular weight substances such as nucleosides and high molecular weight substances such as proteins and catalyzing specific reactions. Therefore, a functional site of RNA can also be defined as a structural site indispensable for such a role. RNA having a disorder (mutation) at a functional site cannot exhibit its original in vivo function and may cause some abnormality.
[0021]
In step a) above, the RNA whose functional site is to be identified can be any RNA. In the present invention, the base sequence of RNA whose functional site is to be identified is referred to as a base sequence to be analyzed. Therefore, for example, base sequence information as shown below can be used as an analysis target base sequence.
Base sequence information based on the actual base sequence of cDNA obtained,
Information on the base sequences that make up the hypothetical gene predicted by genome analysis,
Base sequence information of artificially constructed genes,
Base sequence information of the gene encoding the fusion protein, and
Nucleotide sequence information of genes encoding chimeric proteins
[0022]
When obtaining a base sequence from DNA, the analysis of the present invention may be carried out as it is, or the analysis may be carried out after making the base sequence of RNA. In general, when expressing the base sequence of RNA, the corresponding base sequence of DNA is often used. Also in this specification, the base sequence of RNA may be expressed by the base sequence of DNA that transcribes it. As long as the base sequence information is derived from mature RNA, the base sequence information can be converted into the base sequence of RNA to be the analysis target base sequence of the present invention. The base sequence of DNA is converted to the base sequence of RNA by replacing base t with base u.
[0023]
The present invention includes a step of artificially generating at least one mutant base sequence having a mutation in the base sequence with respect to the base sequence to be analyzed. In the present invention, the mutant base sequence refers to a base sequence containing an arbitrary mutation in the base sequence to be analyzed. The method for imparting mutation and the kind of mutation in the present invention are not limited. More specifically, a mutated base sequence can be generated by giving a mutation by base substitution, deletion, insertion or addition to the base sequence to be analyzed.
[0024]
The base into which these mutations are introduced may be a single base or a plurality of bases may be targeted for mutation. Furthermore, a mutant base sequence into which a plurality of types of mutations are introduced can also be generated, not only when there is only one mutation due to base substitution, deletion, insertion, or addition. Moreover, when giving the variation | mutation by substitution, insertion, or addition of a base, arbitrary bases of 1-4 types can be used. For example, in substitution, up to three types of substitution can be considered for a certain base. Mutations in the present invention can generate all three substitutions or can be substituted into one or two bases.
[0025]
In particular, three types of base sequences generated by substituting one specific base with three different bases are desirable mutant base sequences in the present invention. By generating such a mutant base sequence for all bases constituting the base sequence to be analyzed, all structures that can be taken by the base sequence to be analyzed can be compared. Mutation by substitution is not accompanied by a change in RNA length. Therefore, the influence which each base has on the structure can be compared under the condition without the influence of other factors.
[0026]
In the present invention, a preferable mutated base sequence is a base sequence containing a mutation limited to a smaller region. The purpose of the analysis of the present invention is to identify functional sites of RNA. Since the present invention uses the change in the three-dimensional structure caused by mutation as an index, it is desirable to arrange the mutation site in the mutant base sequence in a more limited region. By limiting the mutation site, as a result, it can be expected to identify the functional site of RNA more accurately.
[0027]
For example, a plurality of mutant base sequences that can be obtained by mutating one or several specific base sequences in the base sequence to be analyzed are preferable mutant base sequences in the present invention. The number of bases for imparting mutation can be at least 1, for example 1 to 10, alternatively 1 to 5, and further 1 to 2.
[0028]
In the present invention, a base that has been clarified to be a single nucleotide polymorphism (hereinafter abbreviated as SNPs) is selected as a specific base for giving a mutation to a base sequence to be analyzed. it can. SNPs refer to polymorphisms due to single-base insertions, deletions, and substitutions. Since it is a polymorphism having the highest appearance frequency in the human genome, SNPs have attracted attention as research targets for disease diagnosis and treatment.
Genetically, a mutation present at a frequency of 1% or more in a certain population among differences in the nucleic acid base sequence is particularly called a polymorphism. Group means a group that is distinguished by geographical isolation and subspecies. For example, even if the mutation is less than 1% in Japanese, it is not a mutation but a polymorphism if it is found at a frequency of 1% or more in other races.
[0029]
A mutation in the base sequence in the gene region may lead to an abnormal gene translation through splicing abnormalities or changes in the three-dimensional structure of mRNA. SNPs are presumed to exist at a rate of 300 to 600 bases on the human genome. In fact, a huge number of SNPs are being revealed one after another. Each of these SNPs should include those that affect RNA conformational changes, those that change the translated amino acid sequence, and those that do not lead to genetic changes. By using the functional site identification method of the present invention, it is possible to easily identify SNPs that may affect the functional site of RNA from a large number of SNPs. Therefore, bases that have been revealed to be SNPs as bases that give mutations in the present invention are one of desirable bases.
[0030]
Next, for each of the base sequence to be analyzed and the mutant base sequence generated based on the base sequence, the higher-order structure of RNA having each base sequence is predicted and digitized. The higher order structure includes a higher order structure higher than the secondary structure. In general, the secondary structure refers to a structure that can be arranged in a plane composed of a stem and a loop formed by binding between complementary base pairs in a base sequence constituting RNA. On the other hand, tertiary structure refers to a three-dimensional structure formed by the interaction of each site of RNA secondary structure in space. The tertiary structure depends on the secondary structure. However, all the causes of the change in the tertiary structure cannot be explained only by the change in the secondary structure. However, changes in secondary structure are often accompanied by changes in tertiary structure. Therefore, predicting a change in secondary structure is nothing but predicting a change in tertiary structure.
[0031]
A method for predicting a three-dimensional structure based on the base sequence of RNA is known. For example, Zucker's secondary structure prediction program uses an algorithm that comprehensively finds the secondary structure of RNA that minimizes energy based on a given base sequence and energy parameters.
[0032]
In the Shapiro method, RNA is expressed as a tree structure composed of stems and loops, and the length of base pairs is taken into consideration. They independently define an operation called Tree Edit Operation as the distance between these structures, and define the minimum number of Tree Edit Operation steps that can be swapped between the two tree structures as the distance between the structures. (Shapiro, BA and Zhang, K., Comparing multiple RNA secondary structures using tree comparisons, CABIOS, 6 (4), 309-318, 1990.).
In addition, Hofacker et al. Evaluated each base of RNA secondary structure by the number of base pairs surrounding it, and expressed the whole structure in a two-dimensional graph called Mountain Plot, and the distance between the structures was expressed by their distance. The method of evaluating by the difference of the graph was used (Automatic detection of conserved RNA structure elements in complete RNA virus genomes. Nucleic Acids Res. Aug 15; 26 (16): 3825-36, 1998.).
[0033]
As an algorithm for predicting the secondary structure of RNA, the method of Vienna RNA package is also known in addition to the method of Zucker (Hofacker et. Al., Monatsh. Chem. 125: 167-188 1994.). Similar to Zucker, this method uses a dynamic programming method. And the prediction accuracy was improved by taking into consideration the probability of occurrence of binding between base pairs.
These secondary structure prediction algorithms can predict the secondary structure with high reliability if the RNA molecule is a short RNA molecule of about several hundred bases. The calculation time required for the prediction is usually only about a few seconds and can be used in the present invention.
[0034]
Furthermore, software for the purpose of calculating the tertiary structure of RNA is also known. For example, software for calculating tertiary structures such as Insight and Amber is software for calculating three-dimensional structures using molecular dynamics (MD) (Pearlman et. Al., AMBER 4.0, University of California, San Francisco. 1991.). MD is a technique for solving multi-body equations of motion numerically by a difference method. For the tertiary structure analysis using these software, first, the three-dimensional structure as the initial structure of the target substance is measured by an experimental method such as X-ray crystallography or NMR, and atoms and molecules are added thereto. However, it includes a step of calculating using MD.
When applying these tertiary structure prediction methods to the present invention, it is necessary to crystallize the target RNA and experimentally confirm the tertiary structure by NMR or the like. If this is possible, it is possible to use the method of the present invention in the same way as the secondary structure prediction method by calculating the tertiary structure when a mutation is introduced and calculating the change in the structure numerically. it can.
[0035]
The higher order structures predicted in the present invention are quantified and compared. For example, higher order structures can be compared numerically by the method described above. More specifically, the comparative analysis method focusing on the tree structure constituting the secondary structure of RNA as shown in the examples is preferred as a method for quantification and comparison of higher order structures in the present invention. This method is based on an algorithm that calculates the distance between secondary structures by converting secondary structures expressed as a tree structure into arrays, aligning them, and comparing them.
[0036]
By comparing the numerical values obtained in the above step b), the difference in higher order structure can be obtained as the difference in numerical values (step c). Further, when a significant difference is found in the numerical value obtained in step c), a part having a mutation in the mutant base sequence can be identified as a part of the functional site of the base sequence to be analyzed. (Step d).
[0037]
In the present invention, when the degree of change from the structure of the base sequence to be analyzed exceeds a significant difference when a mutation is added, the site that has been mutated is determined to be a functional site. In the present invention, the significance is determined, for example, as follows.
First, it is considered that the distribution of the distance between structures caused by giving a mutation to the base sequence to be analyzed is a normal distribution. Then, based on the mean and standard deviation, by testing for a significant difference, it is possible to find a mutation site that gives a significant difference. The process of identifying mutations that cause significant differences in this way is shown in the Examples. In Examples, it was judged that there was a significant difference at a significance level of 30% or 10%. In the present invention, since the difference in structure is expressed as a numerical value, the difference in structure caused by the mutation of each base can be quantitatively compared.
[0038]
SNPs at functional sites are expected to cause some functional abnormality. Mutations that cause functional abnormalities in RNA include mutations that are not evolutionarily acceptable. Mutations that are not evolutionarily acceptable can often cause fatal abnormalities. Thus, finding functional sites in RNA provides important information in classifying SNPs based on their effects on physiological functions. Conversely, SNPs at sites that do not exceed a significant difference are predicted not to cause a harmful phenotype.
[0039]
The present invention provides a program for carrying out the above-described method for identifying a functional site of RNA. The program of the present invention is composed of source code for causing a computer to execute the following steps. The program of the present invention can be stored in a computer-readable recording means that records the program.
a) a step of artificially generating at least one mutant base sequence having a mutation in the base sequence with respect to a base sequence to be analyzed of RNA whose functional site is to be identified;
b) quantifying the higher order structure of RNA having the base sequence to be analyzed and the higher order structure of RNA having the mutant base sequence;
c) comparing the numerical values obtained in b) to obtain a difference in higher order structure as a numerical difference; and
d) a step of identifying a part having a mutation in the mutant base sequence as a part of the functional site of the base sequence to be analyzed when a significant difference is found in the numerical value in c)
[0040]
In the present invention, in step a), a mutant base sequence is generated based on the base sequence input as the analysis target base sequence. The mutated base sequence is as described above. For example, for all bases input as the analysis target base sequence, a base sequence having bases substituted with three types of bases different from the original base is generated as a mutant base sequence. A program for generating such a mutant base sequence is shown in the Examples.
[0041]
FIG. 3 shows a flowchart of the mutant sequence creation program in the case of adding the single base substitution used in the example. In this program, SNPs are assumed for all bases in the base sequence, and for each base, a sequence in which the base is replaced with the other three bases is created and stored in the memory.
[0042]
In the present invention, a known algorithm as described above can be applied to the above steps b) and c). A program for actually analyzing a base sequence on a computer based on these algorithms is also known (M. Zucker & P. Stiegler Nucleic Acids Research 9, 133-148, 1981). In addition, although many programs for RNA higher-order structure calculation are known, any higher-order structure calculation program can be applied in the present invention. In accordance with the higher-order structure calculation program to be used, an appropriate distance function is defined for quantifying the degree of similarity of the structure of the higher-order structure.
[0043]
In addition, the present invention provides a system for realizing a method for identifying a functional site of the RNA. That is, this invention relates to the system for identifying the functional site | part of RNA containing the following means.
a) means for artificially generating at least one mutated base sequence having a mutation in the base sequence with respect to the base sequence to be analyzed of RNA whose functional site should be identified;
b) Means for quantifying the higher-order structure of RNA having a base sequence to be analyzed and the higher-order structure of RNA having a mutated base sequence,
c) means for comparing the numerical values obtained in b) to obtain a difference in higher order structure as a numerical difference; and
d) Means for identifying a portion having a mutation in the mutated base sequence as a part of the functional site of the base sequence to be analyzed when a significant difference is found in the numerical value in c)
[0044]
The system for identifying a functional site of RNA according to the present invention can be realized, for example, by causing a computer to execute the program of the present invention. For example, means a) is a means for inputting a base sequence to be analyzed in the present invention to a computer that has executed the program, and means for generating a mutant base sequence by giving a mutation to the given base sequence to be analyzed according to the program. And means for storing the generated mutated base sequence.
[0045]
The base sequence to be analyzed can usually be read into the system of the present invention from a keyboard or various recording media. Alternatively, it can be read from another computer or an information storage device on the network via the network. Furthermore, the base sequence information generated by the sequencer can be directly read into the system of the present invention as the base sequence to be analyzed.
[0046]
When the read base sequence is DNA information, it can be automatically converted into RNA base sequence information. Whether the base sequence is DNA can be confirmed by detecting the presence of the base t. For example, Zucker's structural calculation program is equipped with a step for converting the base sequence of DNA into the base sequence of RNA and then subjecting it to structural calculation. However, it does not necessarily need to be converted into RNA base sequence information as long as it does not affect the analysis result of the higher-order structure performed thereafter.
[0047]
The control unit of the computer on which the program for the present invention operates generates a mutated base sequence by mutating the given base sequence to be analyzed according to the program. The method for imparting mutation can be arbitrarily set as a parameter. For example, in the embodiment, mutations are given to all the bases constituting the analysis target base sequence. In addition to this, for example, in order to evaluate the influence of specific SNPs in the analysis target base sequence on the functional site, the program can be modified so that only a specific base is mutated. The mutated base sequence generated by the control unit is stored in the memory unit.
[0048]
In the system of the present invention, means b) means for digitizing the higher-order structure of RNA having the base sequence to be analyzed and the higher-order structure of RNA having the mutant base sequence can be constituted by the following means, for example. it can. The control unit of the computer on which the program of the present invention operates reads out the base sequence information generated by the means a) from the memory unit, and predicts and digitizes the higher-order structure based on the algorithm given as the program. It functions as a means. The numerical value generated as a result is stored in the memory unit.
[0049]
Further, in the system of the present invention, the means for comparing the numerical values obtained by the means c) and the means b) and acquiring the difference in the higher order structure as the difference in the numerical values may be constituted by the following means, for example. it can. The control unit of the computer on which the program of the present invention operates functions as means for sequentially reading out the numerical values corresponding to the higher order structures stored in the memory unit and comparing them with the numerical values generated based on the base sequence to be analyzed. The result of the comparison is stored as a numerical value in the memory unit.
[0050]
In the system of the present invention, when a significant difference is found in the numerical values in the means d): means c), the part having the mutation in the mutant base sequence is regarded as a part of the functional site of the base sequence to be analyzed. The means for identifying can be configured by the following means, for example. The control unit of the computer on which the program of the present invention operates reads out the numerical value of the comparison result stored in the memory unit sequentially and compares it with the criterion given as a program to determine whether a significant difference has been found. Functions as a means to As a result of the determination, the mutation site in the mutation base sequence constituting the combination determined to have a significant difference is identified as a functional site in the RNA having the analysis target base sequence.
[0051]
The system of the present invention can output the following information based on the determination result.
1. The base identified as a functional site in the base sequence to be analyzed, and its significance level. (Table 1)
2. In the base sequence to be analyzed, the base that is assumed to have little effect on the higher order structure, and its significance level.
3. The level of change in higher-order structure when a mutation is given to each base constituting the analysis target base sequence. (For example, FIG. 7. These pieces of information can be output as numerical data in a tabular format or as a graph.)
4). A higher-order structure itself composed of a base sequence to be analyzed and a mutant base sequence having a mutation in each base (for example, FIG. 6 and FIG. 10). For example, when using the Zucker algorithm, a secondary structure is obtained, and when using Insight, a tertiary structure is obtained.
5. Higher-order structure expression format (for example, FIG. 6 or FIG. 10) obtained in the step of digitizing a higher-order structure composed of an analysis target base sequence and a mutated base sequence having a mutation in each base For example, in the case of a method of digitizing the secondary structure used in the present invention by expressing it as a tree structure, it is possible to output a tree structure expression of a higher order structure constituted by each mutant base sequence. By referencing this type of information, you can see how the mutation affected the structure.
[0052]
Such information can be output in any form such as display on a screen or output on paper. When the analysis system of the present invention is operated on a network, the result can be output to a screen of a different computer via the network. Alternatively, the output result can be transmitted as a data file to different computers.
[0053]
FIG. 1 is a hardware configuration of the system of the present invention, and FIG. 2 is a flowchart of the method of the present invention.
(1) An RNA base sequence is input from the input unit. (Step 1)
(2) The control unit creates a set of sequences in which mutations are added to each base of the sequence based on the hard disk mutated sequence creation program and stores it in the memory. (Step 2)
(3) The control unit calculates a higher-order structure of the set of arrays stored in the memory based on an arbitrary higher-order structure calculation program, and stores them again in the memory. (Step 3)
(4) Based on the structural distance calculation program, the control unit calculates a distance between a set of higher order structures stored in the memory using a distance function for quantifying the degree of similarity of the higher order structure. . (Step 4)
(5) When a mutation is given to each base of the base sequence input in this way, the degree to which the original higher-order structure changes is quantified and calculated, and the result is displayed on the display unit. (Step 5)
[0054]
【Example】
EXAMPLES The present invention will be specifically described below with reference to examples, but the present invention is not limited to these examples. All documents cited in this specification are incorporated as a part of this specification.
[0055]
As an example of the present invention, an example is shown in which the importance of the functional site and SNPs is predicted using the base sequence of human U2 snRNA without comparing with the base sequence information of other species. U2 snRNA is an RNA without protein information and is thought to play an important function in intron splicing.
SEQ ID NO: 5 shows the base sequence of human U2 snRNA. First, a sequence in which the 12th base g was artificially substituted with a (SEQ ID NO: 6) and the 157th base g with a (SEQ ID NO: 7) was artificially substituted for this sequence.
FIG. 4 shows the result of calculating the secondary structure from these three types of base sequences using the Zucker algorithm. As is clear from FIG. 4, the mutation at the 12th base (2) greatly changes the overall secondary structure, whereas the mutation at the 157th base (3) causes little structural change. I understand that.
In the following examples, such a process is performed for all base substitutions of all bases, and changes in the structure in each case are quantified, and based on the evaluation, functional sites and SNPs are evaluated.
[0056]
FIG. 2 is a flowchart of the RNA functional site prediction method according to the present invention.
In step 1, the analysis target base sequence is input. Here, the base sequence of human U2 snRNA (SEQ ID NO: 1) is given as an input.
In step 2, a base sequence with a mutation added is prepared. In this example, the mutation to be added is limited to single base substitution. Since there are three types of single-base substitution for one base, here, as shown in FIG. 5, all bases are prepared by substituting each base with three types of bases other than itself. Here, since the sequence of U2 snRNA consists of 187 base pairs, a total of 187 * 3 = 561 types of base sequences are prepared by replacing each base in three ways. A flow chart of the mutant sequence creation program used at this time is shown in FIG.
[0057]
In step 3, higher order structures are calculated for all the generated base sequences. In this example, a secondary structure is considered as a higher order structure, and a secondary structure prediction algorithm of Michael Zucker is used as a higher order structure calculation program (M. Zucker & P. Stiegler Nucleic Acids Research 9, 133-148, 1981). ). Zucker's secondary structure prediction program comprehensively finds the secondary structure of RNA that minimizes energy based on a given base sequence and energy parameters.
[0058]
In step 4, the secondary structure expressed as a tree structure as shown in FIG. 6 and FIG. 10 is converted into an array as a secondary structure inter-structure distance calculation program, and the secondary structure is aligned and compared. An algorithm was used to calculate the distance between. Here, the tree structure is recursively defined as a structure composed of a single node or a node having a plurality of tree structures as children ordered in the left and right directions. In addition, the root node of the tree structure refers to the only node in the tree structure in which there is no node whose child is the node. Further, the rank of a node refers to the number of child nodes immediately below the node. Each node of the tree structure corresponds to a loop structure of a secondary structure of RNA, and a path connecting each node corresponds to a stem structure.
The tree structure can uniquely represent the ranks of all nodes included in the tree structure by arranging them in order from the root node, from parent to child, and from left to right. This sequence representation has the structural features of the secondary structure of RNA as a numerical sequence, and comparisons between secondary structures result in comparisons between the sequences.
[0059]
Comparison between sequences is performed by a method of dynamic programming.
The distance d (n, m) between the arrays Sa (n) and Sb (m) defines the cost between two nodes as the absolute value of the difference between the ranks of each node, as shown in Equation 1, When the cost when there is only one node is defined as 1, it is defined as the sum of the costs when the arrays are aligned so as to minimize the cost, as shown in Equation 2.
[0060]
[Expression 1]
(Formula 1)

[Expression 2]
(Formula 2)

[0061]
This process aligns the sequences so that the homologous parts of the two tree structures match, and calculates the distance between the structures that do not match by taking the difference in the rank of the nodes. Is calculated.
[0062]
FIG. 7 is a graph showing the positions of the bases obtained when the method described in the present invention is applied to U2 snRNA and the degree of structural change when the bases are mutated. At the stage of step 5 in FIG. 2 of the flowchart of this method, prediction of functional parts and evaluation of SNPs are performed based on this graph. These evaluation methods are explained below.
[0063]
The distance that changes from the original structure when a mutation is artificially added to each base is represented by c (i). Here, the variable i represents the position of the base, and in the case of U2 snRNA, takes a value of i = 1, 2, 3. Assuming that the distribution of c (i) values such that C (i) ≠ 0 follows a normal distribution, if N represents the total number of bases i such that C (i) ≠ 0, then these The average value Ca and the standard deviation σ are determined as follows.
[Equation 3]
(Formula 3)

[0064]
[Expression 4]
(Formula 4)

At this time, it is determined that the base i whose c (i) takes the following values is a site that greatly changes the structure or a site that does not affect the structure.
[0065]
[Equation 5]
(Formula 5)

At the significance level of 10%, base i does not affect the structure
[0066]
[Formula 6]
(Formula 6)

At the significance level of 30%, base i does not affect the structure
[0067]
[Expression 7]
(Formula 7)

Base i changes structure at a significance level of 30%
[0068]
[Equation 8]
(Formula 8)

At a significance level of 10%, base i changes structure.
Among these, a site where a base that changes the structure exists is judged as a functional site, and SNPs in such a site are evaluated to possibly interfere with RNA function. On the other hand, SNPs to bases that do not change the structure are evaluated to have a low effect on the function.
[0069]
FIG. 7 is a graph of the value of c (i). In addition, the average value Ca and the values used as the determination criteria are also shown. From FIG. 7, it can be read that the structure is changed when mutations are made at sites from 10 to 30, 50 to 60, 100 to 110 bases from the 5 ′ end. In FIG. 7, these regions are indicated by circles. On the other hand, it can be seen that the structure from 150 to 180 bases does not change relatively even when a mutation is given. These results are consistent with the results of the structural changes in the cases of the 12th and 157th mutations shown in FIG. 4, and it can be seen that the quantitative evaluation of the structural changes is correctly performed.
[0070]
Alternatively, from the value of c (i), a base showing a predetermined significant difference depending on the presence or absence of mutation can be extracted for each significance level. Table 1 shows the positional information of the bases extracted in this way.
[Table 1]

[0071]
That is, by reading the graph of FIG. 7, mutations (SNPs) to base sites such as 10-30, 50-60, 100-110, etc. are more effective than mutations (SNPs) to 150-180 base sites. It is expected to have a significant impact on function and is likely to be fatal. Similarly, it can be seen that the 10-30 base site is a relatively flexible structural site, whereas the 150-180 base site forms a stable and firm structure.
Next, these prediction results are compared with information obtained from observation of actual RNA base sequences and structures, and the validity of the prediction results is examined.
[0072]
FIG. 8 shows the results of calculating the conservation rate between species at each base site by comparing homology between nucleotide sequences from natural U2 snRNAs of multiple species (25 species such as Drosophila, nematode, Arabidopsis thaliana and Xenopus laevis). . The closer the conservation ratio is to 1, the stronger the tendency of the base to be conserved in all species. From this, it can be seen that the base sites such as 10-30, 50-60, and 100-110 are highly conserved, and the base sites 150-180 are not so conserved. This fact indicates that the addition of mutations to the functional site judged from the graph of FIG. 7 has not been allowed evolutionarily, and that SNPs to this site will be lethal. It can be considered that the prediction result according to the invention was valid.
[0073]
Furthermore, according to biological experiments (Madhani HD, Guthrie C, Annu. Rev. Genet. 28: 1-26, 1994.), among the four hairpin sites (shown in FIG. 4 (1)) constituting U2 snRNA, It is known that hairpin I maintains a hairpin structure at the initial stage of the splicing process, but later, the hairpin structure opens greatly and a complementary base-to-base bond is formed with U6 snRNA. This fact is consistent with the prediction result according to the present invention that 10-30 base sites corresponding to hairpin I form a flexible structure. According to other experiments, it is known that RNA binding protein binds to hairpin IV (Scherly, D. et.al., Nature, 345, 502-506, 1990.). This is considered to correspond to the fact that the base site of 150-180 corresponding to hairpin IV forms a relatively stable structure, and shows the validity of the prediction result according to the present invention.
[0074]
By the way, it is known that a gene encoding human U2 snRNA exists at a plurality of positions on DNA. Most of these sequences are homologous, but one gene lacks the base U located at position 47 from the 5 'end (Hausner, TP, Giglio, LM and Weiner, AM, Evidence for base-pairing between mammalian U2 and U6 small nuclear ribonucleoprotein particles, Genes Dev. 4 (12A), 2146-2156, 1990.).
On the other hand, from FIG. 7 obtained from this method, it can be read that the structure change distance of the 47th base is c (47) = 0. That is, since there is an RNA that actually has a mutation at the position of the SNPs predicted to be not harmful from the present method and still has a normal function, this method is useful for SNPs and other sudden changes. It is thought to be useful for mutation evaluation.
[0075]
Next, an example in which this method is applied to human mitochondrial diseases in which SNPs are considered to cause actual diseases will be described.
Mitochondrial disease is one of the most common genetic diseases, with symptoms appearing in the muscles and central nervous system, and often symptoms such as convulsions and dementia. The patients often have mutations in the mitochondrial tRNA gene, and the symptoms are known to vary depending on the type of tRNA with the mutation (Schon EA et.al., J. Bioenerg. Biomembr. 29: 131). -149, 1997.).
[0076]
In this example, lysine tRNA (tRNA-Lys) is targeted. The nucleotide sequence of tRNA-Lys is shown in SEQ ID NO: 8. In tRNA-Lys, mutation of base g located at the 19th position from the 5 'end to a and mutation of base g located at the 48th position to a have been identified in patients with mitochondrial disease (Verma, A. et.al., Pediatric Research 42 (4): 448-454, 1997. (Tiranti, V. et.al., Neuromuscular Disorders 9 (2): 66-71, 1999).
Just as in the case of U2 snRNA, FIG. 9 shows a graph showing the degree of structural change in the case where each base is mutated, calculated by the process of FIG. From FIG. 9, the structural change distances at the positions of the mutations identified in patients with mitochondrial disease are both 3, and it is judged that SNPs to these positions are not safe.
Thus, it was confirmed that this technique is effective as a technique for predicting SNPs that actually cause diseases.
[0077]
【The invention's effect】
The present invention can identify a functional site in a base sequence based on one base sequence to be analyzed. For example, in a known method for identifying a base sequence conserved among species as a functional site, it is necessary to clarify the base sequences of homologous genes of a plurality of species. In addition, since identification of functional sites requires that the base sequences differ to some extent, it is difficult to accumulate base sequence information necessary for analysis. On the other hand, the method of the present invention that can identify the functional site based on a single base sequence can be carried out much more easily.
[0078]
The present invention is also useful as a method for evaluating the influence of single base substitution on functional sites. As the structure of the genome is being elucidated, the prediction of the effect of SNPs on the functional site of RNA is one of the important research subjects. In order to evaluate the influence of human SNPs, which have an enormous amount of information, on the functional site of RNA, a method with excellent analysis ability is essential. According to the present invention, regardless of the presence / absence of information on homologous genes of other species, the influence of specific SNPs on the functional site of RNA can be easily evaluated. Furthermore, based on the evaluation results according to the present invention, SNPs can be classified according to the magnitude of the effect on RNA function.
[0079]
Alternatively, functional sites on RNA can be identified with high resolution by substituting bases for all of the bases constituting a gene and evaluating the effect of base substitution on higher-order structures. By superimposing the information of SNPs that are revealed one after another on the functional part of RNA revealed through the analysis of the present invention, SNPs that may affect the functional part are easily identified. can do.
[0080]
[Sequence Listing]

[Brief description of the drawings]
FIG. 1 is a schematic diagram showing a hardware configuration of the present invention.
FIG. 2 is a flowchart of RNA functional site prediction according to the present invention.
FIG. 3 is a flowchart showing an example of a mutation addition program for bases used in the present invention.
[Fig. 4] Human U2 snRNA (1) and two calculated from the base sequence when the 12th base is changed from g to a (2) and the 157th base is changed from g to a (3). It is a figure which shows the next structure. The positions of hairpins I-IV of human U2 snRNA are also shown.
FIG. 5 is a diagram showing a process of artificially adding a single base substitution mutation in the present invention.
FIG. 6 is a schematic diagram showing one embodiment of a method for calculating a distance between higher order structures used in the present invention.
FIG. 7 is a graph showing the results of predicting the functional site of human U2 snRNA according to the present invention. The vertical axis indicates the distance of change in secondary structure caused by mutation, and the horizontal axis indicates the position of the base with 1 at the 5 ′ end. Plots corresponding to bases that produced significant differences due to mutations and bases that did not produce significant differences were circled.
FIG. 8 is a graph showing the conservation ratio between species at each base position, obtained by comparing homology between nucleotide sequences of multiple types of U2 snRNA.
FIG. 9 is a graph showing the results obtained when the present invention was applied to human mitochondrial tRNA-Lys. The vertical axis indicates the distance of change in secondary structure caused by mutation, and the horizontal axis indicates the position of the base with 1 at the 5 ′ end.
FIG. 10 shows a secondary structure obtained when mutations are made to the 12th and 157th bases among the bases constituting human U2 snRNA, and a tree structure corresponding to this structure.

Claims

A base sequence for analysis of RNA to identify a functional site and a numerical sequence obtained by quantifying the secondary structure of the RNA, and all mutant base sequences that can be artificially generated for the base sequence for analysis A memory unit, a control unit, and a position i of a mutated base in an identified functional site. And a system for identifying a functional site of RNA, comprising at least a display unit that displays the difference C (i) in the secondary structure between the RNA whose functional site should be identified and the RNA of the mutant base sequence. ,
a) A single nucleotide polymorphic variant nucleotide sequence obtained by artificially substituting each of the constituting bases with three different bases from the base to the analysis target nucleotide sequence read from the memory unit Means to generate,
b) reading out said analysis from the memory unit the target nucleotide sequence and the mutant nucleotide sequence, a functional module that predicts the secondary structure of the RNA with the analyzed nucleotide sequence or the mutant nucleotide sequence, each secondary predicted the structure was represented as a tree structure consisting of the loop structure and stem structures, by using the function module that converts uniquely in numerical sequence based on the node and rank in the tree structure, the analyzed nucleotide sequence or the mutant base means for digitizing the secondary structure of RNA having the sequence,
c) The numerical sequence ( Sa (n) ) for the base sequence to be analyzed obtained in b) and the numerical sequence ( Sb (m) ) for each mutated base sequence are sequentially compared so that the homologous parts match. By using the following formula, the distance between the non-matching parts is calculated by taking the difference in the rank of the nodes, and the difference in the secondary structure is obtained as the difference in the numerical values, respectively . means for calculating m)),
(Formula 2)
d (n, m) =
min {d (n-1, m-1) + Costs (n, m), d (n, m-1) + Costs (0), d (n-1, m) + Costs (0)}
However, in the above formula 2,
(Formula 1)
Costs (n, m) = | Sa (n) −Sb (m) |
Costs (0) = 1
It is.
d) 2 between the RNA of the base sequence to be analyzed and the RNA of each mutant base sequence defined by the inter-structure distance (d (n, m)) acquired as the difference in the secondary structure in c) The difference C (i) in the secondary structure (where the variable i represents the position i of the base having a mutation in the mutant base sequence) is a standard deviation from the average value Ca of the difference C (i) in the secondary structure. means for identifying the position i of the base having a mutation in the mutant base sequence as being a part of the functional site of the base sequence to be analyzed when greater than or equal to σ ; and
e) means for outputting to the display unit the base position i of the functional site identified in d) and the difference C (i) in the secondary structure;
The system for identifying a functional site of RNA, wherein the control unit includes each of the means a) to e) .

When the difference C (i) in d) is larger than the average value Ca of C (i) by a standard deviation of 1.65σ or more, the position i of the base having a mutation in the mutant base sequence is determined as the base sequence to be analyzed. The system for identifying a functional site of RNA according to claim 1, wherein the system is a means for identifying as a part of the functional site.