JP4557609B2

JP4557609B2 - How to display splice variant sequence mapping

Info

Publication number: JP4557609B2
Application number: JP2004170276A
Authority: JP
Inventors: 宏一木村; 哲夫西川; 啓一永井; 知弘安田; 潤一上地
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2004-06-08
Filing date: 2004-06-08
Publication date: 2010-10-06
Anticipated expiration: 2024-06-08
Also published as: JP2005352590A

Description

本発明は、遺伝子配列の情報解析に係わり、特に、大量のスプライスバリアント配列のエクソン・イントロン構造と、発現情報、及び、転写開始位置との相関関係の解析にかかわる。 The present invention relates to information analysis of gene sequences, and more particularly, to analysis of correlation between exon / intron structure of a large amount of splice variant sequences, expression information, and transcription start position.

ヒトを含む真核生物では、ゲノム上の一つの遺伝子領域からの同じ一次転写産物が、互いに異なるスプライシング過程を経ることにより、互いに異なる配列をもつ種々のｍＲＮＡが生成される。それらはスプライスバリアントとよばれる。それらは、互いに異なる配列を持つ蛋白質を生成し、従って互いに異なる機能を生体内で果たす。また、ゲノム配列上で遺伝子領域の上流側にはプロモータ領域や調節領域があり、それらの領域は、その遺伝子をいつどのような条件下でどの塩基位置から転写を開始するかを制御している。一つの遺伝子は、その転写開始位置を変化させることにより、異なった配列を持つ蛋白質を生成し、異なった機能を生体内で果たすことがある。これらは、発現する組織の違い、または、発達段階の違いなどにより、その遺伝子に多様性を持たせるための一つの大切な生体内の機構となっている。 In eukaryotes including humans, various mRNAs having different sequences are generated by the same primary transcript from one gene region on the genome through different splicing processes. They are called splice variants. They produce proteins with different sequences and thus perform different functions in vivo. In addition, there are a promoter region and a regulatory region upstream of the gene region on the genome sequence, and these regions control when and under what conditions the gene starts transcription. . One gene may generate a protein having a different sequence by changing its transcription start position, and may perform different functions in vivo. These are one important in vivo mechanism for giving diversity to the genes due to differences in the tissues to be expressed or developmental stages.

生体内で発現しているｍＲＮＡの配列データは、ｃＤＮＡライブラリを作成し、シーケンサで塩基配列を読み取ることによって得られる。シーケンサで一度に読むことができる塩基数には限度があるため、通常、ｍＲＮＡ全長に渡る配列を得るためには繰り返しシーケンシングを行い、コストがかかる。そこで、発現する組織の違い、または、発達段階の違いなどにより、どのような遺伝子が発現しているかに関する情報を簡便に得るために、ｍＲＮＡ配列の一部分だけをシーケンサで一度だけ読み取って得られたＥＳＴ(Expressed Sequence Tags)配列、またはワンパス (single-pass) 配列が大量に知られている。 Sequence data of mRNA expressed in vivo can be obtained by preparing a cDNA library and reading the base sequence with a sequencer. Since there is a limit to the number of bases that can be read at one time by a sequencer, it is usually costly to repeatedly sequence in order to obtain a sequence over the entire length of mRNA. Therefore, in order to easily obtain information on what genes are expressed due to differences in the tissues to be expressed or developmental stages, only a part of the mRNA sequence was read once with a sequencer. There are many known EST (Expressed Sequence Tags) sequences or single-pass sequences.

互いにスプライスバリアントとして異なるｍＲＮＡ配列どうしは、異なるエクソン・イントロン構造を持ち、また、場合によっては、異なる転写開始位置をもつこともある。ｍＲＮＡ配列のエクソン・イントロン構造や転写開始位置は、配列類似性検索プログラムを用いて、その配列をゲノム配列上にマッピングすることにより求められる（非特許文献４）。エクソン配列はゲノム配列とｍＲＮＡ配列との相同な区間として、また、イントロン部分はエクソン配列に挟まれたゲノム配列内の区間として、それぞれ、同定される。また、転写開始位置は、ｍＲＮＡ配列の５’末端側がゲノム上にマッピングされた位置として同定される。 MRNA sequences that are different from each other as splice variants have different exon-intron structures, and in some cases, may have different transcription start positions. The exon / intron structure and transcription start position of the mRNA sequence are determined by mapping the sequence onto the genome sequence using a sequence similarity search program (Non-patent Document 4). The exon sequence is identified as a homologous section between the genomic sequence and the mRNA sequence, and the intron part is identified as a section in the genomic sequence sandwiched between the exon sequences. The transcription start position is identified as the position where the 5 'end of the mRNA sequence is mapped on the genome.

ｍＲＮＡ配列のゲノム配列へのマッピングの結果は、通常、ゲノム配列上に沿ってエクソン配列を並べることにより表示される（非特許文献１、２、３）。この表示は、スプライスバリアントとしての違いを理解するために役立っている。また、一般に、イントロン配列はエクソン配列に比べ非常に長いので、より効果的にスプライスバリアントとしての違いを可視化するために、共通のイントロン配列を圧縮した上で、ゲノム配列上に沿ってエクソン配列を並べる表示法もある（特許文献１）。 The result of mapping the mRNA sequence to the genome sequence is usually displayed by arranging exon sequences along the genome sequence (Non-Patent Documents 1, 2, and 3). This display helps to understand the difference as a splice variant. In general, intron sequences are very long compared to exon sequences, so in order to visualize differences as splice variants more effectively, compress the common intron sequence, and then align the exon sequence along the genome sequence. There is also a display method of arranging them (Patent Document 1).

特開２００３−２５６４３４号公報JP 2003-256434 A The NCBI Handbook, 2002 Nov. Part3, chapter 20, p20-16http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=BooksThe NCBI Handbook, 2002 Nov. Part3, chapter 20, p20-16http: //www.ncbi.nlm.nih.gov/entrez/query.fcgi? Db = Books A User's Guide to the Human Genome. Wolfsberg, T.G., Wetterstrand, K.A., Guyer, M.S., Collins, F.S., and Baxevanis, A.D. (2004). Nature Genetics35(supplement):1-79A User's Guide to the Human Genome.Wolfsberg, T.G., Wetterstrand, K.A., Guyer, M.S., Collins, F.S., and Baxevanis, A.D. (2004). Yutaka Suzuki, Riu Yamashita, Sumio Sugano, and Kenta Nakai, DBTSS, DataBase of Transcriptional Start Sites: progress report 2004, Nucl. Acids. Res. 2004 32: D78-D81Yutaka Suzuki, Riu Yamashita, Sumio Sugano, and Kenta Nakai, DBTSS, DataBase of Transcriptional Start Sites: progress report 2004, Nucl. Acids. Res. 2004 32: D78-D81 Florea, L. et al., A Computer Program for Aligning a cDNA Sequence with a Genomic DNA Sequence, Genome Research 8:967-974, 1998.Florea, L. et al., A Computer Program for Aligning a cDNA Sequence with a Genomic DNA Sequence, Genome Research 8: 967-974, 1998.

生体内の多くの組織、病気の有無、発達段階、外的刺激を加えた場合の経過時間、などの違いにより、遺伝子の発現の仕方がどのように変化するかを調べるために、既に数百万本以上の大量のＥＳＴ配列データが知られている。これらをゲノム上にマッピングすると、ゲノム上の特定の箇所に数千本から数万本のＥＳＴ配列がマッピングされることがある。このような箇所では、従来のマッピング結果の可視化法では、ＥＳＴ配列の数に比例して大量の線分を描かなければならず、表示コストがかさむ上に、人が大量に表示されたデータを見て、どのようなスプライスバリアントのタイプがあるかを理解することが困難になる。また、生物学的な解釈のためには、遺伝子の発現の条件の違いにより、スプライシングまたは転写開始位置が異なることを調べることが重要であるが、従来の可視化法では、スプライスバリアントとしての分類や転写開始位置と発現情報との関連が明示されていないため、そのような関連を発見することが困難であった。 In order to investigate how gene expression changes due to differences in many tissues in the body, the presence or absence of disease, the stage of development, the elapsed time when external stimuli are applied, etc. A large amount of EST sequence data of 10,000 or more is known. When these are mapped on the genome, thousands to tens of thousands of EST sequences may be mapped to specific locations on the genome. In such places, the conventional mapping result visualization method has to draw a large number of line segments in proportion to the number of EST sequences, which adds to the display cost and increases the amount of data displayed by humans. At first glance, it becomes difficult to understand what types of splice variants exist. In addition, for biological interpretation, it is important to examine the splicing or transcription start position depending on the gene expression conditions, but in the conventional visualization method, classification as a splice variant or Since the relationship between the transcription start position and the expression information is not clearly shown, it has been difficult to find such a relationship.

本発明は、スプライスバリアントのタイプや転写開始位置と発現情報との関連を理解しやすく表示することのできるスプライスバリアント表示方法を提供することを目的とする。 An object of the present invention is to provide a splice variant display method capable of easily displaying the type of splice variant and the relationship between transcription start position and expression information.

本発明は、大量のＥＳＴ配列や全長ｍＲＮＡ配列に対して、スプライスバリアントとしてのタイプの違いを可視化し、各タイプ別の発現情報を可視化し、また、転写開始位置別の発現情報を可視化する、以下の処理ステップから構成される。以後、ＥＳＴ配列または全長ｍＲＮＡ配列を、纏めてｃＤＮＡ配列とよぶ。 The present invention visualizes the type difference as a splice variant for a large amount of EST sequences and full-length mRNA sequences, visualizes expression information for each type, and visualizes expression information for each transcription start position. It consists of the following processing steps. Hereinafter, the EST sequence or the full-length mRNA sequence is collectively referred to as a cDNA sequence.

ｃＤＮＡ配列とゲノム配列との配列類似性検索によりマッピングを行い、各ｃＤＮＡ配列のエクソン・イントロン構造と転写開始位置を決めるステップ。 mapping by a sequence similarity search between the cDNA sequence and the genome sequence, and determining the exon / intron structure and transcription start position of each cDNA sequence.

ｃＤＮＡ配列どうしのエクソン・イントロン構造を比較し、同じスプライスバリアントに由来すると推定されるタイプに、ｃＤＮＡ配列を分類するステップ。 comparing exon-intron structures of cDNA sequences and classifying the cDNA sequences into types presumed to be derived from the same splice variant.

同じタイプに属するｃＤＮＡ配列を纏めて、タイプごとにエクソン・イントロン構造をゲノム配列に沿って可視化するステップ。 The step of collecting cDNA sequences belonging to the same type and visualizing the exon / intron structure along the genome sequence for each type.

同じタイプに分類されたｃＤＮＡ配列の発現情報を集め、タイプごとに発現情報の内訳を表示するステップ。 Collecting expression information of cDNA sequences classified into the same type and displaying a breakdown of the expression information for each type;

５’末端を含むｃＤＮＡ配列の転写開始位置を求め、ゲノム配列上の各位置で、そこを転写開始位置とするｃＤＮＡ配列を集めるステップ。 Obtaining a transcription start position of a cDNA sequence containing a 5 'end, and collecting cDNA sequences having the transcription start position at each position on the genome sequence;

ゲノム配列上の各位置で、そこを転写開始位置とするｃＤＮＡ配列の発現情報を集め、転写開始位置ごとに発現情報の内訳を表示するステップ。 A step of collecting expression information of a cDNA sequence having a transcription start position at each position on the genome sequence and displaying a breakdown of the expression information for each transcription start position.

マッピング結果を可視化する際の描画量が、ｃＤＮＡ配列の数ではなくｃＤＮＡ配列を分類したタイプの数に比例するため、例えば数百万本のｃＤＮＡ配列データに対しては、従来の表示法による描画量の数十分の一程度に抑えられる。その結果、大量の配列データに対する解析結果をコンパクトに表示でき、人が見るべきデータ量が抑えられ理解しやすくなる。また、タイプごとに発現情報の違い及び転写開始位置ごとの発現情報の違いが明示的に示されるため、生物学的な解釈を行うこと、即ち、遺伝子の発現の条件の違いによりスプライシングまたは転写開始位置が変化することを調べることが容易になる。 Since the amount of plotting when visualizing the mapping result is proportional to the number of types that classify the cDNA sequences, not the number of cDNA sequences, for example, drawing for millions of cDNA sequence data using the conventional display method The amount is suppressed to a few tenths. As a result, analysis results for a large amount of sequence data can be displayed in a compact manner, and the amount of data that should be viewed by humans is reduced, making it easier to understand. In addition, since the difference in expression information for each type and the difference in expression information for each transcription start position are explicitly indicated, splicing or transcription initiation is performed according to biological interpretation, that is, the difference in gene expression conditions. It becomes easy to check that the position changes.

以下、本発明の実施の形態を、図を用いて詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図1に、大量のｃＤＮＡ配列に対して、本発明によりスプライスバリアント解析した結果を表示した例を示す。 FIG. 1 shows an example of displaying the result of splice variant analysis according to the present invention for a large amount of cDNA sequences.

図1において、101は一つの遺伝子領域に由来するｃＤＮＡ配列をスプライスバリアントとして分類したときの、その各々のタイプのエクソン・イントロン構造を、公知の方法（特許文献１）に従って可視化したものである。102は、スプライスバリアントに共通のイントロン配列を、ゲノム配列内から除去することにより圧縮されたゲノム配列座標である。103は、スプライスバリアントとして分類された各々のタイプを並べるための座標である。104は、一つのタイプに含まれる一つのエクソンを表す線分である。一つのタイプのエクソン・イントロン構造は、エクソンを表す104の線分を幾つか、座標102に沿って並べることにより表現される。その際、連結された線分同士を区別できるように、線分の両端には端点を明示する。105のような、線分104に挟まれた間隙は、他のタイプのスプライスバリアントでは、そこに別のエクソンが挿入されるか、または、左右の線分104で表されるエクソンの一方または両方でスプライスサイトの変化が生じることにより、塩基配列が挿入されうることを示している。全てのタイプに対するこのような表示は、座標103に沿って並べられる。スプライスサイトを全く持たないｃＤＮＡ配列は纏めて、座標102に沿って取り得る範囲を106で示す。 In FIG. 1, 101 is a visualization of each type of exon / intron structure when a cDNA sequence derived from one gene region is classified as a splice variant according to a known method (Patent Document 1). 102 is the genome sequence coordinate compressed by removing the intron sequence common to the splice variant from within the genome sequence. 103 is a coordinate for arranging each type classified as a splice variant. A line segment 104 represents one exon included in one type. One type of exon-intron structure is represented by arranging several 104 line segments representing exons along the coordinates 102. At that time, end points are clearly shown at both ends of the line segment so that the connected line segments can be distinguished from each other. A gap between line segments 104, such as 105, is another type of splice variant where another exon is inserted there, or one or both of the exons represented by the left and right line segments 104 This indicates that the base sequence can be inserted by changing the splice site. Such representations for all types are aligned along the coordinates 103. The cDNA sequences having no splice sites are collectively shown as 106 with a range that can be taken along the coordinate 102.

107は各タイプ別の発現情報を、103の座標に沿って表す。108は、発現強度を表す座標である。109は、一つのタイプに対する発現情報を表す。網掛けまたは色により発現情報の分類ごとの発現強度の内訳を示すと同時に、その総計によりそのタイプの発現強度を示す。同様に、110はスプライスサイトを持たないｃＤＮＡ配列全体に対する発現情報を表す。 Reference numeral 107 represents expression information for each type along the coordinates of 103. 108 is a coordinate indicating the expression intensity. 109 represents the expression information for one type. The breakdown of the expression intensity for each classification of the expression information is indicated by shading or color, and at the same time, the expression intensity of that type is indicated by the total. Similarly, 110 represents expression information for the entire cDNA sequence without a splice site.

111は各転写開始位置別の発現情報を、102の座標に沿って表す。112は、発現強度を表す座標である。113は、一つの転写開始位置に対する発現情報を表す。網掛けまたは色により発現情報の分類ごとの発現強度の内訳を示すと同時に、その総計によりその転写開始位置の発現強度を示す。 111 represents the expression information for each transcription start position along the coordinates of 102. 112 is a coordinate representing the expression intensity. 113 represents expression information for one transcription start position. The breakdown of the expression intensity for each classification of the expression information is indicated by shading or color, and at the same time, the expression intensity of the transcription start position is indicated by the total.

114は、101内に表示される一つの遺伝子領域を選択するためのリストボックスである。リストの各要素には、115で示す遺伝子領域名（クラスタ名）と116で示す遺伝子領域アノテーション情報を表示する。遺伝子領域アノテーションとしては、その遺伝子領域の染色体上での詳細な位置情報、その遺伝子領域（クラスタ）に属する全長mRNA配列に対するアノテーションを集めたもの、などがある。117は現在選択しているリストボックスの要素を示すマーカーである。リストボックス114には、118に示す検索ウィンドウ内に記述した条件により、絞込み、並べ替え、または選択を行ったクラスタを表示する。検索の条件としては、染色体番号や染色体上の位置、または、クラスタに含まれるべき遺伝子の名前やそのアノテーション情報の一部、或いは、クラスタまたは転写開始位置の発現情報に関する条件などがある。117の選択は、ユーザが検索ウィンドウ118に入力した結果表示されるリストボックス114の内容を見て、必要に応じてさらにユーザがリストボックスの項目を再選択することで、決定される。117で選択された遺伝子領域が101に表示される。 Reference numeral 114 denotes a list box for selecting one gene region displayed in 101. In each element of the list, a gene region name (cluster name) indicated by 115 and gene region annotation information indicated by 116 are displayed. Examples of gene region annotation include detailed positional information on the chromosome of the gene region, and a collection of annotations for full-length mRNA sequences belonging to the gene region (cluster). 117 is a marker indicating the element of the currently selected list box. In the list box 114, clusters that have been narrowed down, rearranged, or selected according to the conditions described in the search window 118 are displayed. The search condition includes a chromosome number, a position on the chromosome, a name of a gene to be included in the cluster, a part of the annotation information, or a condition regarding expression information of the cluster or the transcription start position. The selection of 117 is determined by looking at the contents of the list box 114 displayed as a result of the user's input to the search window 118 and reselecting the items in the list box as necessary. The gene region selected in 117 is displayed in 101.

図5は、発明によるスプライスバリアント解析・表示装置の概略構成図である。このスプライスバリアント解析・表示装置は、処理部503に対する操作や入力を行う操作・入力部501、処理結果を表示する表示部502、処理部503で処理するためのデータや情報を格納した記憶部504を備える。 FIG. 5 is a schematic configuration diagram of a splice variant analysis / display device according to the invention. This splice variant analysis / display device includes an operation / input unit 501 that performs operations and inputs to the processing unit 503, a display unit 502 that displays processing results, and a storage unit 504 that stores data and information to be processed by the processing unit 503. Is provided.

記憶部504には、多数のｃＤＮＡ配列を集めたｃＤＮＡ配列データ202、ゲノム配列データ203、各ｃＤＮＡ配列の配列末端に関する情報を集めたｃＤＮＡ配列の端点情報207、各ｃＤＮＡ配列の発現に関する情報を集めたｃＤＮＡ配列の発現情報209、各ｃＤＮＡ配列のアノテーションに関する情報を集めた配列アノテーション情報212が格納されている。 The storage unit 504 collects cDNA sequence data 202 that collects a large number of cDNA sequences, genomic sequence data 203, end information 207 of cDNA sequences that collect information about the sequence ends of each cDNA sequence, and information about the expression of each cDNA sequence. The expression information 209 of the cDNA sequence and the sequence annotation information 212 that collects information on the annotation of each cDNA sequence are stored.

処理部503は、ｃＤＮＡ配列データ202の各ｃＤＮＡ配列をゲノム配列データ203で示されるゲノム配列へのマッピングを行うマッピング処理部511、マッピングの結果に基づいてｃＤＮＡ配列のクラスタを構成するクラスタリング処理部512、スプライトサイトの組み合わせが同じｃＤＮア配列を一つのバリアントタイプとして纏めるバリアントタイプ分類処理部513、転写開始位置が同じｃＤＮＡ配列同士をまとめる転写開始位置分類処理部514、各バリアントタイプに属するｃＤＮＡ配列の発現情報を集めて分類するバリアントタイプ別の発現情報分類処理部515、各転写開始位置に属するｃＤＮＡの発現情報を集めて分類する転写開始位置別の発現情報分類処理部516、処理部503による解析結果を表示部502の表示画面に例えば図1に示したようにして表示する表示処理部517を備える。 The processing unit 503 is a mapping processing unit 511 that maps each cDNA sequence of the cDNA sequence data 202 to the genome sequence indicated by the genomic sequence data 203, and a clustering processing unit 512 that forms a cluster of cDNA sequences based on the mapping result. , A variant type classification processing unit 513 that combines cDN sequences having the same sprite site combination as one variant type, a transcription start position classification processing unit 514 that collects cDNA sequences having the same transcription start position, and a cDNA sequence belonging to each variant type Expression information classification processing unit 515 for collecting and classifying expression information, analysis by expression type classification processing unit 516 for each transcription start position for collecting and classifying expression information of cDNA belonging to each transcription start position, and processing unit 503 The result is displayed on the display screen of the display unit 502, for example, as shown in FIG. The display processing unit 517 is provided.

図2を用いて、図1に示した表示を得るための処理手順を説明する。201では、公知のゲノムマッピング技術（非特許文献４）を用いて、記憶部504から処理部503に入力されたｃＤＮＡ配列データ202とゲノム配列データ203より、各ｃＤＮＡ配列のゲノム配列へのマッピング処理を行う。この処理はマッピング処理部511によって行われる。その結果、各ｃＤＮＡ配列がゲノム配列上のどの位置にある塩基から転写されたものであるかが確定する。そこで、クラスタリング処理部512は、ゲノム上で塩基位置を1塩基または数塩基共有するようなｃＤＮＡ配列を全て纏めてクラスタ（遺伝子領域に相当する）を構成することにより、ｃＤＮＡ配列のクラスタリング処理204を行う。このクラスタリング処理を効率的に行うためには、ｃＤＮＡ配列を構成する各エクソンの両端点のゲノム配列上での座標を求めておき、それらのゲノム配列座標をソートして、その座標を含むエクソン、さらに、それを含むｃＤＮＡ配列を同じクラスタに入れて行けばよい。 A processing procedure for obtaining the display shown in FIG. 1 will be described with reference to FIG. In 201, a known genome mapping technique (Non-patent Document 4) is used to map each cDNA sequence to a genome sequence from the cDNA sequence data 202 and the genome sequence data 203 input from the storage unit 504 to the processing unit 503. I do. This processing is performed by the mapping processing unit 511. As a result, it is determined which position on the genome sequence each cDNA sequence is transcribed from. Therefore, the clustering processing unit 512 performs clustering processing 204 of the cDNA sequence by composing a cluster (corresponding to a gene region) by collecting all the cDNA sequences that share one base or several bases in the genome. Do. In order to efficiently perform this clustering process, the coordinates on the genome sequence of both end points of each exon constituting the cDNA sequence are obtained, the genome sequence coordinates are sorted, and the exons including the coordinates, Furthermore, the cDNA sequence containing it may be put in the same cluster.

また、201のマッピング処理の結果により、各ｃＤＮＡ配列のエクソン・イントロン構造が確定するので、ゲノム上のスプライスサイトの位置が確定する。バリアントタイプ別分類処理部513は、スプライスサイトの組み合わせが同じｃＤＮＡ配列を一つのバリアントタイプとして纏めることにより、ｃＤＮＡ配列のバリアントタイプ分類処理205を行う。一般に各ｃＤＮＡ配列はスプライスサイトを複数もつが、その全てのスプライスサイトのゲノム配列上での座標が一致するか若しくは予め指定した数塩基程度以下の違いに収まるとき、スプライスサイトの組み合わせが同じとする。また、スプライスサイトを持たないｃＤＮＡ配列は纏めて一つのタイプとして分類する。 Moreover, since the exon / intron structure of each cDNA sequence is determined based on the result of the mapping process 201, the position of the splice site on the genome is determined. The variant type classification processing unit 513 performs variant type classification processing 205 of cDNA sequences by collecting cDNA sequences having the same combination of splice sites as one variant type. Generally, each cDNA sequence has a plurality of splice sites, but when all the splice sites have the same coordinates on the genome sequence or within a specified difference of about several bases or less, the combination of splice sites is the same. . In addition, cDNA sequences having no splice site are collectively classified as one type.

また、ｃＤＮＡ配列には、シーケンシングを行った際の手順の違いにより、(1)全長をシーケンシングしたもの、(2)５’末端側のみシーケンシングしたもの、(3)３’末端側のみシーケンシングしたもの、(4)任意の一部の断片をシーケンシングしたものなどがあり、このうち、(1)と(2)は完全な５’末端を持ち、(1)と(3)は完全な３’末端を持つ。転写開始位置分類処理部514は、このような配列末端に関する情報を集めた、ｃＤＮＡ配列の端点情報207を読み込む。完全な５’末端を持つｃＤＮＡ配列であって、マッピング処理201においてその５’末端がゲノム配列上にマッピングされるとき、そのｃＤＮＡ配列に対しては、転写開始位置が決まる。206では、転写開始位置分類処理部514は、転写開始位置が決まるｃＤＮＡ配列を集め、それらを転写開始位置が同じグループに分類する。ここで、転写開始位置が同じとは、ゲノム配列座標上で完全に一致するか、または、予め指定した塩基数以下の違いに収まることとする。 In addition, due to differences in the procedure when sequencing, the cDNA sequence was (1) sequenced over the entire length, (2) sequenced only on the 5 ′ end, and (3) only on the 3 ′ end. Sequencing, (4) Sequencing of any fragment, (1) and (2) have a complete 5 'end, (1) and (3) Has a complete 3 'end. The transcription start position classification processing unit 514 reads the end point information 207 of the cDNA sequence, which is a collection of information about such sequence ends. When the cDNA sequence has a complete 5 'end and the 5' end is mapped onto the genome sequence in the mapping process 201, the transcription start position is determined for the cDNA sequence. In 206, the transcription start position classification processing unit 514 collects cDNA sequences for which the transcription start position is determined, and classifies them into a group having the same transcription start position. Here, the same transcription start position is assumed to be completely coincident with the genome sequence coordinates, or to be within the difference of the number of bases designated in advance.

各ｃＤＮＡ配列には、その元となるｍＲＮＡをどのような条件下のどのような個体のどのような器官・組織のどのような部位から採取したかの様々な情報がある。これをｃＤＮＡ配列の発現情報209とよぶ。バリアントタイプ別の発現情報分類処理部515によるバリアントタイプ別の発現情報分類処理208では、205で得られた各バリアントタイプに対して、そのタイプに属するｃＤＮＡ配列の発現情報を209から集め、それらを調べようとする観点に従って分類する。例えば、組織別に分類したり、正常細胞由来か癌細胞由来か或いは他の病気に罹患しているかで分類したり、胎児や成人などの発達段階の違いで分類したり、薬剤投与などの外的刺激の有無とその後の経過時間の違いなどにより分類する。転写開始位置別の発現情報分類処理部516による転写開始位置別の発現情報分類処理210では、206で得られた各転写開始位置に対して、その位置に属するｃＤＮＡ配列の発現情報を209から集め、それらを調べようとする観点に従って208と同様に分類する。 Each cDNA sequence has a variety of information on the original mRNA collected from which site in what organ / tissue of which individual under what conditions. This is called cDNA sequence expression information 209. In the expression information classification processing 208 by variant type by the expression information classification processing unit 515 by variant type, for each variant type obtained in 205, the expression information of cDNA sequences belonging to that type is collected from 209, and these are collected. Classify according to the point of view to be examined. For example, classification according to tissue, classification based on whether it is derived from normal cells, cancer cells, or other diseases, classification based on differences in developmental stages such as fetuses and adults, and external administration such as drug administration Categorize by the presence or absence of stimulation and the difference in elapsed time thereafter. In the expression information classification processing 210 for each transcription start position by the expression information classification processing unit 516 for each transcription start position, for each transcription start position obtained in 206, expression information of cDNA sequences belonging to that position is collected from 209. They are classified in the same way as 208 according to the viewpoint of examining them.

表示処理部517による表示・ＧＵＩ（Graphic User Interface）処理211では、ユーザによって選択された一つの遺伝子領域（クラスタ）に対して、公知のスプライスバリアント比較表示法（特許文献１）によって、図１の101に示した表示を行う。ただし、一つのバリアントタイプに属するｃＤＮＡ配列どうしは、スプライスサイトは共通であっても配列の両端には違いがあるため、最も伸びている端点を使って表示する。また、各タイプに対して208で求めた発現情報の分類結果を、103の座標に従って107内に110に示すように表示する。分類結果を、発現強度を示す座標108に沿って数値化する際は、例えば、その分類に属するＥＳＴ配列の数を用いる。ただし、全ＥＳＴ配列に対して発現情報に偏りがある場合には、その偏りを補正するために、一つのクラスタの一つの分類に属するＥＳＴ配列数を全ＥＳＴの中でその分類に属するＥＳＴ配列の数で正規化して、発現強度とする。同様に、各転写開始位置に対して210で求めた発現情報の分類結果を、102の座標に従って111内に113に示すように表示する。 In a display / GUI (Graphic User Interface) process 211 by the display processing unit 517, a known splice variant comparison display method (Patent Document 1) is applied to one gene region (cluster) selected by the user in FIG. The display shown in 101 is performed. However, cDNA sequences belonging to one variant type are displayed using the most extended end points because there are differences at both ends of the sequences even if the splice sites are common. Further, the classification result of the expression information obtained in 208 for each type is displayed as indicated by 110 in 107 according to 103 coordinates. When the classification result is digitized along the coordinates 108 indicating the expression intensity, for example, the number of EST sequences belonging to the classification is used. However, if the expression information is biased with respect to all EST sequences, in order to correct the bias, the number of EST sequences belonging to one class of one cluster is the number of EST sequences belonging to that class among all EST sequences. The expression intensity is normalized by the number of. Similarly, the classification result of the expression information obtained in 210 for each transcription start position is displayed as indicated by 113 in 111 according to the coordinates of 102.

また、表示・ＧＵＩ処理211では、配列アノテーション情報212を読み込んで、各クラスタに属する配列のアノテーション情報を纏めてリストボックス114内に表示し、ユーザが遺伝子領域（クラスタ）を選択するための支援情報とする。 In the display / GUI processing 211, the sequence annotation information 212 is read, the annotation information of the sequences belonging to each cluster is collectively displayed in the list box 114, and support information for the user to select a gene region (cluster). And

以下、本発明の第2の実施の形態を、図を用いて説明する。
図３に、大量のｃＤＮＡ配列に対して、本発明によりスプライスバリアント解析した結果を表示した例を示す。 Hereinafter, a second embodiment of the present invention will be described with reference to the drawings.
FIG. 3 shows an example of displaying the result of splicing variant analysis according to the present invention for a large amount of cDNA sequences.

図３において、101〜113は図1と同様である。301は、101内に表示されるｃＤＮＡ配列を選択するためのリストボックスである。リストの各要素には、302で示す配列名と303で示す配列アノテーション情報を表示する。304は現在選択しているリストボックスの要素を指し示すマーカーである。リストボックス301には、305に示す検索ウィンドウ内に記述した条件により、絞込み、並べ替え、または選択を行ったｃＤＮＡ配列を表示する。304の選択は、ユーザが検索ウィンドウ305に入力した結果表示されるリストボックス301の内容を見て、必要に応じてさらにユーザがリストボックスの項目を再選択することで、決定される。306と307は、304で選択されたｃＤＮＡ配列が属するバリアントタイプを示すと同時に、それらの転写開始位置と転写終結位置を示すマーカーである。308は、タイプ別の発現情報を示すウィンドウ107の中で、304で選択されたｃＤＮＡ配列が属するバリアントタイプの発現情報を指し示すマーカーである。309は、転写開始位置別の発現情報を表すウィンドウ111の中で、304で選択されたｃＤＮＡ配列の転写開始位置における発現情報を指し示すマーカーである。 3, reference numerals 101 to 113 are the same as those in FIG. 301 is a list box for selecting a cDNA sequence displayed in 101. In each element of the list, an array name indicated by 302 and array annotation information indicated by 303 are displayed. 304 is a marker indicating the element of the currently selected list box. The list box 301 displays cDNA sequences that have been narrowed down, rearranged, or selected according to the conditions described in the search window indicated by reference numeral 305. The selection of 304 is determined by looking at the contents of the list box 301 displayed as a result of the user's input to the search window 305 and reselecting the items in the list box as necessary. 306 and 307 are markers indicating the variant type to which the cDNA sequence selected in 304 belongs, and at the same time indicating their transcription start position and transcription termination position. 308 is a marker indicating the expression information of the variant type to which the cDNA sequence selected in 304 belongs in the window 107 showing the expression information by type. 309 is a marker indicating the expression information at the transcription start position of the cDNA sequence selected at 304 in the window 111 representing the expression information for each transcription start position.

以下、本発明の第３の実施の形態を、図を用いて説明する。
図４に、大量のｃＤＮＡ配列に対して、本発明によりスプライスバリアント解析した結果を表示した例を示す。101、103、106〜113は図1と同様である。401は、一つのタイプに含まれる一つのエクソンを表すボックスである。一つのタイプのエクソン・イントロン構造は、エクソンを表す401のボックスを幾つか、ゲノム配列座標402に沿って並べることにより表現される。ボックス401を連結する折れ線403はイントロンを表す。ただし、一つのタイプの両端にあるエクソンを表すボックスは、そのタイプに属するその端点側のエクソンの中で最も伸びているものを表す。これは、一つのタイプに属するｃＤＮＡ配列どうしは、スプライスサイトは共通であっても両端点は異なるためである。 Hereinafter, a third embodiment of the present invention will be described with reference to the drawings.
FIG. 4 shows an example of displaying the result of splice variant analysis according to the present invention for a large amount of cDNA sequences. 101, 103, 106 to 113 are the same as those in FIG. 401 is a box representing one exon included in one type. One type of exon-intron structure is represented by arranging several 401 boxes representing exons along genome sequence coordinates 402. A broken line 403 connecting the boxes 401 represents an intron. However, a box representing an exon at both ends of one type represents the most extended exon belonging to that type on that end point side. This is because the end points of cDNA sequences belonging to one type are different even if the splice site is common.

以上説明したように、本発明は、スプライスバリアントの異常を伴う病気に対する診断、スプライスバリアントの異常を引き起こす薬剤に対する薬効・毒性の評価などへの応用が可能である。 As described above, the present invention can be applied to diagnosis for diseases accompanied by abnormal splice variants, evaluation of drug efficacy / toxicity for drugs causing abnormal splice variants, and the like.

大量のｃＤＮＡ配列に対するスプライスバリアントとしての分類結果と、分類別の発現情報、および、転写開始位置ごとの発現情報を、統合的に表示する方法を示す図。The figure which shows the method of displaying collectively the classification result as a splice variant with respect to a large amount of cDNA sequence, the expression information according to classification, and the expression information for every transcription start position. 図１に示す表示を行うための計算手順を示す図。The figure which shows the calculation procedure for performing the display shown in FIG. 大量のｃＤＮＡ配列に対するスプライスバリアントとしての分類結果と、分類別の発現情報、および、転写開始位置ごとの発現情報を、統合的に表示する第２の方法を示す図。The figure which shows the 2nd method of displaying collectively the classification result as a splice variant with respect to a lot of cDNA sequences, the expression information according to classification, and the expression information for every transcription start position. 大量のｃＤＮＡ配列に対するスプライスバリアントとしての分類結果と、分類別の発現情報、および、転写開始位置ごとの発現情報を、統合的に表示する第３の方法を示す図。The figure which shows the 3rd method of displaying collectively the classification result as a splice variant with respect to a large amount of cDNA sequence, the expression information according to classification, and the expression information for every transcription start position. 発明によるスプライスバリアント解析・表示装置の概略構成図。The schematic block diagram of the splice variant analysis and display apparatus by invention.

Explanation of symbols

102：ゲノム配列座標
104：エクソンを表す線分
107：各タイプ別の発現情報
111：各転写開始位置別の発現情報
114：遺伝子領域を選択するためのリストボックス
118：検索ウィンドウ
301：リストボックス
305：検索ウィンドウ
401：エクソンを表すボックス
403：イントロンを表す折れ線 102: Genome sequence coordinates
104: A line representing an exon
107: Expression information for each type
111: Expression information for each transcription start position
114: List box for selecting gene regions
118: Search window
301: List box
305: Search window
401: Box representing an exon
403: Line representing intron

Claims

A method for displaying a splice variant by a processing unit that reads and processes a genomic sequence, a plurality of cDNA sequences, and expression information of each cDNA sequence from a storage unit,
The processing unit is
Mapping each of the plurality of cDNA sequences read from the storage unit to the genome sequence read from the storage unit;
A clustering step of classifying cDNA sequences sharing the same base on the genome sequence into one cluster based on information of transcription positions on the genome sequence of each cDNA determined by the mapping process;
A step of classifying cDNA sequences having the same splicing site coordinates on the genome sequence among the cDNAs classified into the clusters or having a difference in coordinates within several bases into one variant type;
Classifying the cDNA sequence classified into the variant type based on expression information;
A step of displaying the display of the exon and intron structure for each variant type of the cluster and the breakdown of the expression information regarding the cDNA sequences classified into each variant type on the display unit,
A method of displaying a splice variant, characterized in that:

A method for displaying a splice variant by a processing unit that reads and processes a genomic sequence, a plurality of cDNA sequences, expression information of each cDNA sequence, and information on the sequence end of each cDNA sequence from a storage unit,
The processing unit is
Mapping each of the plurality of cDNA sequences read from the storage unit to the genome sequence read from the storage unit;
A clustering step of classifying cDNA sequences sharing the same base on the genome sequence into one cluster based on information of transcription positions on the genome sequence of each cDNA determined by the mapping process;
A step of classifying cDNA sequences having the same splicing site coordinates on the genome sequence among the cDNAs classified into the clusters or having a difference in coordinates within several bases into one variant type,
Classifying the cDNA sequence classified into the variant type based on expression information;
Transcription of cDNA sequences that are determined based on the result of the mapping process and information on the end of the sequence of the cDNA sequence, the coordinates of the transcription start position on the genome sequence are the same, or the difference in coordinates is within several bases. Categorizing into start position,
Classifying the cDNA sequence classified into the transcription start position based on expression information;
The display of the exon / intron structure for each variant type of the cluster and the breakdown of the expression information regarding the cDNA sequences classified into each variant type are displayed on the display unit in a one-to-one correspondence, and the exon / intron structure is displayed. A step of displaying a breakdown of expression information on the cDNA sequence classified at each transcription start position on the display unit along the genome sequence coordinates used when displaying,
A method of displaying a splice variant, characterized in that: