JP2825948B2

JP2825948B2 - Genome analysis processing equipment

Info

Publication number: JP2825948B2
Application number: JP2188364A
Authority: JP
Inventors: 洋文土居
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1990-07-17
Filing date: 1990-07-17
Publication date: 1998-11-18
Anticipated expiration: 2013-11-18
Also published as: JPH0475582A

Description

【発明の詳細な説明】〔概要〕ゲノムを規定する核酸配列に対応した文字配列に関し
て，ウインドウを通して抽出した文字順列がもつ特徴を
見出す処理を行うゲノム解析処理装置に関し，ゲノムを規定する文字配列の中で，出現頻度の高い文
字順列を見出すことを目的とし，ゲノムにおけるより大きい領域内での当該文字順列の
出現頻度（本明細書において大領域出現頻度と呼ぶ）
と，より小さい領域（ローカル）内での当該文字順列
（本明細書においてローカル内文字順列と呼ぶ）の出現
頻度とを比較する処理などを行い，当該ローカル内文字
順列がもつ特徴を抽出するよう構成する。DETAILED DESCRIPTION OF THE INVENTION [Summary] The present invention relates to a genomic analysis processing device that performs processing to find out the characteristics of a character sequence extracted through a window with respect to a character sequence corresponding to a nucleic acid sequence defining a genome. In order to find a character sequence with a high frequency of appearance, the frequency of occurrence of the character sequence within a larger region in the genome (referred to as a large region frequency in this specification)
And a process of comparing the frequency of occurrence of the character permutation (hereinafter referred to as a local character permutation) in a smaller area (local) with the appearance frequency of the local character permutation. Configure.

[Industrial applications]

本発明は，ゲノムを規定する核酸に対応した文字配列
に関して，ウインドウを通して抽出した文字順列がもつ
特徴を見出す処理を行うゲノム解析処理装置に関する。The present invention relates to a genomic analysis processing apparatus that performs processing for finding a characteristic of a character sequence extracted through a window with respect to a character sequence corresponding to a nucleic acid defining a genome.

例えばエイズ・ウイルスのゲノム上でのローカルな領
域がもつ特徴を抽出することなどを行うことが期待され
ている。For example, it is expected to extract features of a local region on the genome of the AIDS virus.

[Conventional technology]

ゲノムは,4種類の核酸，即ち，アデニン（Ａ）と，チ
ミン（Ｔ）と，グアニン（Ｇ）と，シトシン（Ｃ）との
配列で規定される。また従来から，上記核酸について，
アデニン（Ａ）とグアニン（Ｇ）とをまとめてプリン
（ｕ）とし，さらにチミン（Ｔ）とシトシン（Ｃ）とを
まとめてピリミジン（ｙ）とし，当該２種類の文字の配
列を用いて，当該ゲノムについて研究することが行われ
ている。The genome is defined by the sequence of four nucleic acids: adenine (A), thymine (T), guanine (G), and cytosine (C). Conventionally, for the above nucleic acids,
Adenine (A) and guanine (G) are collectively referred to as purine (u), and thymine (T) and cytosine (C) are collectively referred to as pyrimidine (y). Research is being conducted on the genome.

上記配列について研究するに当って,n個の文字の順列
（ｎは１以上の整数）を抽出し，当該抽出された文字順
列がもつ特徴を調べることが行われる。言うまでもな
く，上記４種類の核酸に対応する文字を用いる場合に
は，上記ｎ個分の文字順列の種類はn⁴通り存在し，上記
２種類の文字を用いる場合には，当該ｎ個分の文字順列
の種類はn²通り存在することとなる。In studying the above arrangement, a permutation of n characters (n is an integer of 1 or more) is extracted, and the characteristics of the extracted character permutation are examined. Needless to say, when using the characters corresponding to the above four types of nucleic acids, there are n ^four types of character permutations for the n characters, and when using the two types of characters, the n characters are used. There are n ² types of character permutations.

[Problems to be solved by the invention]

上記文字順列について処理するに当って，当該文字順
列が，ゲノム全体内あるいは一部の大領域内で出現する
大領域出現頻度と，ローカルな領域内で出現する出現頻
度とを比較することが今回考慮されることとなったが，
そのための効果的な処理機能をもつ装置が存在しなかっ
た。In processing the above-mentioned character permutation, it is now necessary to compare the frequency of occurrence of a large region in the entire genome or a part of a large region with the frequency of appearance in a local region. Was taken into account,
There is no device having an effective processing function for that purpose.

本発明は，ゲノムを規定する文字配列の中で，出現頻
度の高い文字順列を見出すことを目的としている。An object of the present invention is to find a character sequence having a high appearance frequency in a character sequence defining a genome.

[Means for solving the problem]

第１図は本発明の原理構成図を示す。図中の符号１は
文字順列解析装置,2は与えられた文字配列,3は本発明に
いうウインドウ機能部を表し図示の場合には文字配列２
をスキャンしつつｎ個分の文字に対応する文字順列を取
り込むもの,4は本発明にいう群化機能部,5は本発明にい
う頻度処理機能部,6は出力例,7は文字順列を表してい
る。FIG. 1 shows a principle configuration diagram of the present invention. In the figure, reference numeral 1 denotes a character permutation analyzer, 2 denotes a given character array, 3 denotes a window function unit according to the present invention, and in the case shown in FIG.
, The character permutation corresponding to n characters is fetched, 4 is a grouping function unit according to the present invention, 5 is a frequency processing function unit according to the present invention, 6 is an output example, and 7 is a character permutation. Represents.

上記文字順列解析装置１は，上記ウインドウ機能部３
と，上記群化機能部４と，上記頻度処理機能部５とを少
なくともそなえている。The character permutation analyzer 1 includes the window function unit 3
And the grouping function unit 4 and the frequency processing function unit 5 at least.

上記ウインドウ機能部３は，ウインドウの大きさ（サ
イズ）を,nが１以上の整数をとるように変化させること
ができる。The window function unit 3 can change the size (size) of the window so that n takes an integer of 1 or more.

群化機能部４は，抽出した文字順列について円順列の
下で同じものとなる文字順列を１つの群にまとめる働き
をもつ。即ち，例えばｎ＝３のウインドウを通して得ら
れる文字順列には,2文字（ｕとｙと）を用いる場合に
は， uuu,uuy,uyu,yuu,uyy,yuy,yyu,yyy の８通りのものが存在するが，円順列を考慮すると，例
えばuuyとuyuとは同じものとなり，〔uuu〕〔uuy,uyu,yuu〕〔uyy,yyu,yuy〕〔yyy〕の４つのサイクリック・セットに分類することができ
る。The grouping function unit 4 has a function of grouping the extracted character permutations that are the same under the circular permutation into one group. That is, for example, when two characters (u and y) are used in a character permutation obtained through a window of n = 3, eight types of uuu, uuy, uyu, yuu, uyy, yuy, yyu, yyy are used. However, considering the circular permutation, for example, uuy and uyu are the same, and are classified into four cyclic sets of [uuu] [uuy, uyu, yuu] [uyy, yyu, yuy] [yyy] can do.

頻度処理機能部５は，例えばｎ＝３のウインドウをも
って上記文字配列２をスキャンした際に生じる各文字順
列７の出現頻度を，全体の中でのパーセントで表示する
働きをもつ。The frequency processing function unit 5 has a function of displaying the appearance frequency of each character permutation 7 generated when the character array 2 is scanned with, for example, n = 3 windows as a percentage of the whole.

(Operation)

文字順列解析装置１においては，ウインドウの大きさ
をｎ＝1,n＝2,n＝3,…に変更して，与えられた文字配列
２をスキャンし，文字順列７を取り込む。そして，例え
ばｎ＝３の大きさであった場合の当該取り込まれた文字
順列７について，上述の如く，〔uuu〕〔uuy,uyu,yuu〕〔uyy,yyu,yuy〕〔yyy〕のサイクリック・セットに分類すると共に，各文字順列
７の出現頻度をパーセントでもって表示する。第１図図
示の場合の出力例６における棒グラフは出現頻度のパー
セントを表している。In the character permutation analyzer 1, the window size is changed to n = 1, n = 2, n = 3,..., The given character array 2 is scanned, and the character permutation 7 is fetched. Then, for example, as described above, for the fetched character permutation 7 when the size is n = 3, the cyclic [uuu] [uuy, uyu, yuu] [uyy, yyu, yuy] [yyy] · Classify into sets and display the appearance frequency of each character sequence 7 as a percentage. The bar graph in the output example 6 in the case of FIG. 1 shows the percentage of the appearance frequency.

当該出現頻度は，与えられた文字配列２を母集団とし
た場合の，当該母集団内での出現の割り合いを表してお
り，第１図図示の場合では，〔uuu〕の出現頻度が大で
あり，〔yyy〕の出現頻度が小であることが判る。The appearance frequency indicates the proportion of appearances in the population when the given character array 2 is used as a population. In the case of FIG. 1, the appearance frequency of [uuu] is large. It can be seen that the appearance frequency of [yyy] is low.

〔Example〕

第２図は本発明の一実施例構成を示す。図中の符号１
−１と１−２とは文字順列解析装置の部分装置を表して
いる。2Aはローカルな領域における文字配列であって当
該ローカルな領域内に存在するローカル内文字順列がも
つ特徴を抽出するようにするもの,2Bは与えられたゲノ
ム全体に対応するかあるいはゲノム内での十分に大きい
領域に対応する文字配列,3はウインドウ機能部,6Aはロ
ーカル内文字順列の出現頻度出力,6Bは大領域出現頻度
出力を表している。また6Cはローカル内文字順列の特徴
表示出力を表している。FIG. 2 shows an embodiment of the present invention. Symbol 1 in the figure
-1 and 1-2 represent partial units of the character permutation analyzer. 2A is a character sequence in a local region that extracts the features of a local character permutation that exists in the local region, and 2B corresponds to the entire genome or is included in the genome. A character array corresponding to a sufficiently large area, 3 is a window function unit, 6A is an appearance frequency output of a local character permutation, and 6B is a large area appearance frequency output. 6C represents the characteristic display output of the local character permutation.

文字順列解析装置１は，与えられたゲノムに対応する
文字配列2Bにもとづいて，上記大領域出現頻度を，例え
ば予め計算しており，当該大領域出現頻度を保持してい
る。図示6Bがその出力を表している。この状態の下で，
ローカルな領域における図示の文字配列２にもとづい
て，ローカル内文字順列の出現頻度を計算する。図示6A
がその出力を表している。The character permutation analyzer 1 calculates, for example, the large region appearance frequency in advance based on the character sequence 2B corresponding to the given genome, and holds the large region appearance frequency. FIG. 6B shows the output. Under this condition,
Based on the character array 2 shown in the local area, the appearance frequency of the local character permutation is calculated. 6A shown
Represents the output.

次いで，文字順列解析装置１は，部分装置１−２を用
いて，出力6Aと6Bとを比較する処理を行う。即ち，例え
ば，文字順列uuyについて，出力6Bに示す大領域出現頻
度から出力6Aに示す出現頻度を減算するようにする。そ
の減算の結果が図示出力6C内に示されている。Next, the character permutation analyzer 1 performs a process of comparing the outputs 6A and 6B using the partial device 1-2. That is, for example, for the character sequence uuy, the appearance frequency shown in the output 6A is subtracted from the large area appearance frequency shown in the output 6B. The result of the subtraction is shown in the illustrated output 6C.

各文字順列毎に同様な減算を行い，その結果が図示出
力6Cに示されている。図示の場合には，例えば文字順列
uuuについて言えば，当該与えられたローカルな領域内
でみる場合に，当該文字順列uuuの出現頻度が，全体に
くらべて（文字配列2B内での出現頻度にくらべて）高い
ものとなっている。Similar subtraction is performed for each character permutation, and the result is shown in the illustrated output 6C. In the case shown, for example, character permutation
In the case of uuu, when viewed within the given local area, the frequency of occurrence of the character permutation uuu is higher than the overall frequency (compared with the frequency of occurrence in the character array 2B) .

文字順列解析装置１は，図示出力6Cに関して,1つのサ
イクリック・セット内の文字順列のすべてがプラス方向
のパーセント値をもつ場合に「正の特徴」があるとし，
すべてがマイナス方向のパーセント値をもつ場合に「負
の特徴」があるとする。そして，更に,1つの文字順列で
も他と異なる方向のパーセント値を持つか値が零のパー
セント値を持つ場合に「特徴でない」とみなすようにす
る。The character permutation analyzer 1 regards the illustrated output 6C as having a “positive feature” when all character permutations in one cyclic set have a positive percentage value,
It is assumed that there is a “negative feature” when all have negative percentage values. Further, if one character sequence has a percentage value in a direction different from that of the other, or if the value has a percentage value of zero, the character sequence is regarded as “not a feature”.

図示出力6Cの場合には，サイクリック・セット〔uu
u〕と〔uuy,uyu,yuu〕とが正の特徴を持ち，サイクリッ
ク・セット〔uyy,yyu,yuy〕が負の特徴を持ち，サイク
リック・セット〔yyy〕は「特徴でない」とされる。In the case of the illustrated output 6C, the cyclic set [uu
u] and [uuy, uyu, yuu] have positive characteristics, the cyclic set [uyy, yyu, yuy] has negative characteristics, and the cyclic set [yyy] is "not a feature". You.

第３図は本発明の他の実施例構成を示す。図中の符号
１−1,1−2,2B,3は第２図に対応しており,1−３は文字
順列解析装置の部分装置,2A−1,2A−2,2A−３は夫々異
なるローカル領域における文字配列,6C−1,6C−2,6C−
３は夫々異なるローカル領域におけるローカル内文字順
列の特徴表示出力,6Dはより大きいローカル領域におい
て注目すべき特徴を有する文字順列群表示出力を表して
いる。FIG. 3 shows another embodiment of the present invention. In the figure, reference numerals 1-1, 1-2, 2B, and 3 correspond to FIG. 2, 1-3 is a partial device of the character permutation analyzer, and 2A-1, 2A-2, and 2A-3 are respectively. Character arrays in different local areas, 6C-1, 6C-2, 6C-
Numeral 3 denotes a characteristic display output of a character permutation in a local area in a different local area, and reference numeral 6D denotes a character permutation group display output having a remarkable characteristic in a larger local area.

第３図図示の場合には，第２図図示の実施例において
抽出された特徴表示出力を，より大きいローカル領域に
おける特徴的文字順列にまとめる処理を行うようにされ
ている。In the case shown in FIG. 3, the processing for combining the characteristic display outputs extracted in the embodiment shown in FIG. 2 into a characteristic character sequence in a larger local area is performed.

即ち，互に異なる複数個の各ローカル領域における文
字配列2A−1,2A−2,2A−３が夫々与えられる。そして，
文字順列解析装置１は，第２図を参照して説明した如
く，部分装置１−1,1−２を動作せしめて，上記複数個
の各ローカル領域毎のローカル文字順列の特徴表示出力
6C−1,6C−2,6C−３を得る。That is, character arrangements 2A-1, 2A-2, 2A-3 in a plurality of local areas different from each other are provided, respectively. And
As described with reference to FIG. 2, the character permutation analyzer 1 operates the partial devices 1-1 and 1-2 to output the characteristic display of the local character permutation for each of the plurality of local areas.
6C-1,6C-2,6C-3 are obtained.

次いで，文字順列解析装置１は，部分装置１−３の働
きによって，注目すべき特徴を有する文字順列群を抽出
する。即ち，特徴表示出力6C−1,6C−2,6C−３の夫々に
おいて上記「正の特徴」を有するローカル内文字順列を
抽出する。図示矢印は「正の特徴」を有することを指摘
している。図示の文字配列2A−１に対してはサイクリッ
ク・セット〔uuu〕と〔uuy,uyu,yuu〕とが正の特徴を持
ち，文字配列2A−２に対してはサイクリック・セット
〔yyy〕が正の特徴を持ち，文字配列2A−３に対しては
サイクリック・セット〔uyy,yyu,yuy〕が正の特徴を持
っている。このことから，上記各文字配列2A−1,2A−2,
2A−３を与えている各ローカル領域が例えば隣接してい
るローカル領域であるとすると，これら３個のローカル
領域をまとめた「より大きいローカル領域」について言
えば，当該「より大きいローカル領域」に対応する「注
目すべき特徴を有する文字順列群」は，（ｉ）〔uuu〕＆〔uuy,uyu,yuu〕（ii）〔yyy〕（iii）〔uyy,yyu,yuy〕であるとされる。このようにすることによって，より大
きいローカル領域における特徴ある文字順列群を抽出す
ることが可能となる。なお，図示の場合には，文字順列
群が明瞭に分離して現れているものとして示している。
しかし，上記個々のローカル領域が細かすぎる場合に
は，図示のようには明瞭に分離できないことがある。こ
のような場合には，個々のローカル領域の大きさをより
大きくとるようにされることとなろう。Next, the character permutation analyzer 1 extracts a character permutation group having a remarkable feature by the operation of the partial device 1-3. That is, in each of the characteristic display outputs 6C-1, 6C-2, and 6C-3, the local character permutation having the above "positive characteristic" is extracted. The illustrated arrow points out that it has a "positive feature." The cyclic set [uuu] and [uuy, uyu, yuu] have positive features for the character array 2A-1 shown, and the cyclic set [yyy] for the character array 2A-2. Has a positive characteristic, and the cyclic set [uyy, yyu, yuy] has a positive characteristic for the character array 2A-3. From this, the above character arrays 2A-1, 2A-2,
Assuming that each of the local areas giving 2A-3 is, for example, an adjacent local area, a "larger local area" that combines these three local areas is referred to as the "larger local area". The corresponding “character permutation group having remarkable features” is (i) [uuu] & [uuy, uyu, yuu] (ii) [yyy] (iii) [uyy, yyu, yuy] . By doing so, it is possible to extract a characteristic character permutation group in a larger local area. In the case shown in the figure, the character permutation groups are shown as clearly appearing separately.
However, if the individual local areas are too fine, they may not be clearly separated as shown. In such a case, the size of each local area will be made larger.

ちなみに，第３図図示の中段に示すローカル文字順列
の特徴表示出力６′Ｃ−1,6′Ｃ−2,6′Ｃ−３は夫々,n
＝２の大きさを有するウインドウにもとづいて抽出され
た文字順列に対応したサイクリック・セット〔uu〕，
〔uy,yu〕，〔yy〕についての特徴を模擬的に図示した
ものである。当該ｎ＝２のウインドウを通した場合のサ
イクリック・セットについては，第３図に関連して説明
した「注目すべき特徴を有する文字順列群」は抽出でき
ないものであったことが示されている。Incidentally, the characteristic display outputs 6'C-1, 6'C-2, 6'C-3 of the local character permutation shown in the middle part of FIG.
= Cyclic set [uu] corresponding to the character permutation extracted based on the window having the size of = 2,
This is a diagram schematically illustrating the features of [uy, yu] and [yy]. As for the cyclic set when the window is passed through the window of n = 2, it is shown that the “character permutation group having a remarkable feature” described with reference to FIG. 3 cannot be extracted. I have.

第４図（ａ）（ｂ）（ｃ）（ｄ）（ｅ）は夫々エイズ
・ウイルス・ゲノムにおける文字順列の出現頻度を表す
図である。FIGS. 4 (a), (b), (c), (d), and (e) are diagrams showing the frequency of appearance of character permutations in the AIDS virus genome.

第４図（ａ）の場合には，当該エイズ・ウイルス・ゲ
ノムにおける４種類の核酸に対応する文字配列，即ち，
アデニン（Ａ）と，チミン（Ｔ）と，グアニン（Ｇ）
と，シトシン（Ｃ）との４種類の核酸に対応する文字配
列にもとづいて,n＝３のウインドウを通して抽出した文
字順列の出現頻度（大領域出現頻度と考えてよい）を表
している。図示から判る如く。文字順列AAAはエイズ・
ウイルス・ゲノム全体の中で4.6％程度出現する。また
文字順列CGTの出現頻度は十分に小さい。In the case of FIG. 4 (a), the character sequences corresponding to the four nucleic acids in the AIDS virus genome, that is,
Adenine (A), thymine (T), and guanine (G)
And the frequency of appearance of the character permutation extracted through the window of n = 3 (which may be considered as the large region appearance frequency) based on the character sequences corresponding to the four types of nucleic acids, namely, and cytosine (C). As can be seen from the illustration. Character permutation AAA is AIDS
Appears about 4.6% in the whole virus genome. The appearance frequency of the character permutation CGT is sufficiently low.

第４図（ｂ）は，上記エイズ・ウイルス・ゲノムにお
いて，プリン（ｕ）とピリミジン（ｙ）とに対応する文
字配列にもとづいて,n＝３のウインドウを通して抽出し
た文字順列の出現頻度（大領域出現頻度）を表してい
る。FIG. 4 (b) shows the appearance frequency (large) of the character permutation extracted through the window of n = 3 based on the character sequence corresponding to purine (u) and pyrimidine (y) in the AIDS virus genome. Area appearance frequency).

第４図（ｃ）は，同じく,n＝４のウインドウを通して
抽出した文字順列の出現頻度（大領域出現頻度）を表し
ている。FIG. 4C similarly shows the appearance frequency (large-area appearance frequency) of the character permutation extracted through the window of n = 4.

第４図（ｄ）は，同じく,n＝５のウインドウを通して
抽出した文字順列の出現頻度（大領域出現頻度）を表し
ている。FIG. 4D similarly shows the appearance frequency (large area appearance frequency) of the character permutation extracted through the window of n = 5.

更に第４図（ｅ）は，同じく,n＝６のウインドウを通
して抽出した文字順列の出現頻度（大領域出現頻度）を
表している。FIG. 4E also shows the appearance frequency (large-area appearance frequency) of the character permutation extracted through the window of n = 6.

言うまでもなく，第４図（ａ）（ｂ）（ｃ）（ｄ）
（ｅ）の夫々は，第１図図示の構成に対応して得られた
結果である。Needless to say, FIGS. 4 (a) (b) (c) (d)
Each of (e) is a result obtained corresponding to the configuration shown in FIG.

エイズ・ウイルス・ゲノムは基本的に３つの遺伝子を
持っている。gagとpolとenvである。このうち,gagとpol
とは保存的な遺伝子であって変異率は小さい。これに対
し,envには，変異率の大きい領域（高変異領域）と変異
率の小さい領域とがある。envはgp120と呼ばれる蛋白質
と,gp41と呼ばれる蛋白質とをコードし，高変異領域はg
p120をコードする領域の中にある。したがって,gp120を
コードする領域は高変異領域とそうでない領域（低変異
領域）とに分けられる。高変異領域はワクチンの開発な
どを困難にしているエイズ・ウイルスの特徴である。ga
gとpolとgp120の高変異領域とgp120の低変異領域とgp41
との５つの領域について，第３図に関連した処理を行
い,gp120の高変異領域の特徴を表すサイクリック・セッ
トとエイズ・ウイルス・ゲノムにとって特徴的な部分配
列の長さを求めた。The AIDS virus genome basically has three genes. gag, pol and env. Gag and pol
Is a conserved gene and has a low mutation rate. In contrast, env has a region with a high mutation rate (high mutation region) and a region with a low mutation rate. env encodes a protein called gp120 and a protein called gp41.
in the region encoding p120. Therefore, the region encoding gp120 can be divided into a hypermutated region and a non-mutated region (low-mutated region). Hypermutated regions are characteristic of the AIDS virus, which makes vaccine development difficult. ga
g, pol and gp120 hypermutated regions and gp120 low mutant regions and gp41
The above five regions were subjected to the processing related to FIG. 3, and the lengths of the cyclic sets that characterize the highly mutated region of gp120 and the partial sequences characteristic of the AIDS virus genome were determined.

第５図（ａ）（ｂ）（ｃ）（ｄ）は処理結果を表す図
である。FIGS. 5 (a), (b), (c), and (d) are diagrams showing processing results.

第５図（ａ）（ｂ）（ｃ）（ｄ）は共に，上記５つの
領域において，各サイクリック・セットがどのような特
徴をもっているかを示している。そして，第５図（ａ）
はｎ＝３のウインドウを通して抽出したサイクリック・
セットの場合を示している。同様に，第５図（ｂ）はｎ
＝４のウインドウを通して抽出したサイクリック・セッ
トの場合，第５図（ｃ）はｎ＝５のウインドウを通して
抽出したサイクリック・セットの場合，第５図（ｄ）は
ｎ＝６のウインドウを通して抽出したサイクリック・セ
ットの場合を夫々示している。FIGS. 5 (a), (b), (c), and (d) show the characteristics of each cyclic set in the above five areas. And FIG. 5 (a)
Is the cyclic extracted through the window of n = 3
This shows the case of a set. Similarly, FIG.
FIG. 5 (c) shows a cyclic set extracted through the window of n = 5, and FIG. 5 (d) shows a cyclic set extracted through the window of n = 6. Each of the illustrated cyclic sets is shown.

gp120の高変異領域に関して言えば，サイクリック・
セット〔uuy〕や〔uuyuuy〕に「正の特徴」が存在して
いる。gp120の低変異領域に関して言えば，サイクリッ
ク・セット〔uyuy〕や〔uyuyuy〕に「正の特徴」が存在
している。gp41に関して言えば，サイクリック・セット
〔uyyyy〕や〔uyyuyy〕に「正の特徴」が存在してい
る。更にgagに関して言えば，サイクリック・セット〔u
uu〕，〔uuuu〕，〔uuuuu〕，〔uuuyy〕，〔uuuuuu〕に
「正の特徴」が存在している。またpolに関して言え
ば，サイクリック・セット〔uuu〕，〔uuuu〕，〔uuuu
u〕，〔uuuuy〕，〔uuuuuu〕に「正の特徴」が存在して
いる。得にgp120の高変異領域におけるサイクリック・
セット〔uuyuuy,uyuuyu,yuuyuu〕が特徴的なサイクリッ
ク・セットであることが判る。そして，エイズ・ウイル
ス・ゲノムにとってｎ＝６の部分配列（文字順列）が重
要であることが判る。As for the highly mutated region of gp120, cyclic
There are "positive features" in the set [uuy] and [uuyuuy]. As for the low mutation region of gp120, "positive features" exist in cyclic sets [uyuy] and [uyuyuy]. As for the gp41, there are "positive features" in cyclic sets [uyyyy] and [uyyuyy]. As for gag, cyclic sets [u
uu], [uuuu], [uuuuu], [uuuyy], [uuuuuu] have "positive features". As for pol, cyclic sets [uuu], [uuuu], [uuuu
u], [uuuuy], and [uuuuuu] have "positive features". In particular, the cyclic
It turns out that the set [uuyuuy, uyuuyu, yuuyuu] is a characteristic cyclic set. And it turns out that the partial sequence (character permutation) of n = 6 is important for the AIDS virus genome.

上記の説明においては，まとめた形の文字配列をｕと
ｙとを用いた場合を説明したが，まとめた形の文字配列
としては，上記ｕとｙの代わりに，（ｉ）アデニン（Ａ）とチミン（Ｔ）とをまとめて表現
したｗ（ii）シトシン（Ｃ）とグアニン（Ｇ）とをまとめて表
現したｘとを用いることができる。なお上記ｗとｘとの表記は，
当該表記に格別の意味を持たせているものではなく，言
わば単なる符丁であると考えてよい。In the above description, the case where u and y are used as the combined character array is described. As the combined character array, instead of the above u and y, (i) adenine (A) And thymine (T) are collectively expressed as w (ii) cytosine (C) and guanine (G) are expressed collectively as x. The notation of w and x above is
This notation does not give any special meaning, and it can be considered as a mere sign.

〔The invention's effect〕

以上説明した如く，本発明によれば，ゲノムを規定す
る文字配列の中で，あるいは部分的な文字配列の中で，
出現頻度の高い文字順列を見出すことが可能となる。ま
たローカルな領域内でのローカル内文字順列であって他
の領域にくらべて出現頻度の高いものを見出すことが容
易となり，更により大きいローカル領域に関して注目す
べき特徴を有する文字順列群を抽出することもできる。As described above, according to the present invention, in a character sequence defining a genome or in a partial character sequence,
It is possible to find a character sequence having a high appearance frequency. In addition, it is easy to find a character permutation in a local area that has a higher frequency of appearance than other areas, and extracts a character permutation group having a remarkable feature for a larger local area. You can also.

なお，上記説明においては，例えばエイズ・ウイルス
・ゲノムにおける文字配列を第２図や第３図図示の文字
配列2Bとし，当該エイズ・ウイルス・ゲノムにおける文
字配列の一部を第２図や第３図図示の文字配列2Aとし
た。しかし，一般に，文字配列2Aが文字配列2Bの中に含
まれていればよくかつ文字配列2Bは上記エイズ・ウイル
ス・ゲノムに対応する文字配列に限られるものではな
い。In the above description, for example, the character sequence in the AIDS virus genome is the character sequence 2B shown in FIGS. 2 and 3, and a part of the character sequence in the AIDS virus genome is shown in FIG. The character array 2A shown in the figure was used. However, in general, it is sufficient that the character sequence 2A is included in the character sequence 2B, and the character sequence 2B is not limited to the character sequence corresponding to the AIDS virus genome.

[Brief description of the drawings]

第１図は本発明の原理構成図，第２図は本発明の一実施
例構成，第３図は本発明の他の実施例構成，第４図
（ａ）（ｂ）（ｃ）（ｄ）（ｅ）は夫々エイズ・ウイル
ス・ゲノムにおける文字順列の出現頻度を表す図，第５
図（ａ）（ｂ）（ｃ）（ｄ）は処理結果を表す図であ
る。図中の符号１は文字順列解析装置,1−1,1−2,1−３は夫
々文字順列解析装置の部分装置,2,2A,2B,2A−1,2A−2,2
A−３は夫々文字配列,3はウインドウ機能部,4は群化機
能部,5は頻度処理機能部,6,6A,6B,6C,6D,6C−1,6C−2,6
C−3,6′Ｃ−1,6′Ｃ−2,6′Ｃ−３は夫々出力例,7は文
字順列を表す。FIG. 1 is a diagram illustrating the principle of the present invention, FIG. 2 is a diagram illustrating the configuration of one embodiment of the present invention, FIG. 3 is a diagram illustrating the configuration of another embodiment of the present invention, and FIGS. 4 (a), (b), (c), and (d). (E) shows the frequency of occurrence of character permutations in the AIDS virus genome, respectively.
(A), (b), (c), and (d) are diagrams showing processing results. In the figure, reference numeral 1 is a character permutation analyzer, 1-1, 1-2, and 1-3 are partial units of the character permutation analyzer, respectively, 2, 2A, 2B, 2A-1, 2A-2, and 2
A-3 is a character array, 3 is a window function unit, 4 is a grouping function unit, 5 is a frequency processing function unit, 6, 6A, 6B, 6C, 6D, 6C-1, 6C-2, 6
C-3,6'C-1,6'C-2,6'C-3 are output examples, and 7 is a character permutation.

Claims

(57) [Claims]

1. A character to be processed by extracting information of a character sequence (7) obtained by inputting information about a character sequence (2) defining a genome and observing the character sequence (2) through a window of a predetermined size. In the genome analysis processing device having the permutation analysis device (1), the character sequence analysis device (1) is a window function unit that sequentially scans and subdivides the character sequence (2) and extracts the character sequence (7). (3) a grouping function unit (4) that combines the extracted character permutations (7) under the circular permutation into one group, and a grouping function unit (4) for each of the extracted groups. And a frequency processing function unit (5) that outputs the appearance frequency of the character permutation in the overall configuration of the character array (2) and / or in a partial configuration of the character array (2) in the form of a ratio. That sentence A genomic analysis apparatus characterized in that the character permutation analyzer (1) performs processing on character permutations (7) of a plurality of sizes obtained as a result of observation after changing the size of the window. .

2. The character permutation analyzer according to claim 1, wherein, for each character permutation (7) within the entire configuration or a part of a large region on the given genome, the large region appearance frequency of the character permutation (7) is determined. In addition to the above, the frequency of occurrence of the local character sequence is extracted from the local character sequence in a smaller region on the given genome, and the local frequency is obtained from the result of comparing the appearance frequencies of the two. The genome analysis processing device according to claim 1, wherein the feature is extracted.

3. The character permutation analyzer extracts the character sequence within a local region having a remarkable feature at each of a plurality of local locations and extracts the characteristic character sequence in a larger local area. The genome analysis processing device according to claim 2, wherein:

4. A character sequence (2) defining a genome,
Adenine (A), thymine (T), and guanine (G)
And a character sequence corresponding to four types of nucleic acids, namely, cytosine (C), and the character permutation analyzer (1) extracts a characteristic permutation of the character sequence using the four types of nucleic acids. The genome analysis processing device according to claim (1).

5. The character sequence (2) defining the genome comprises purine (u) expressing adenine (A) and guanine (G) together, and thymine (T) and cytosine (C). An arrangement of two types of characters of pyrimidine (y) expressed collectively, wherein the character permutation analyzer (1) extracts a characteristic permutation of a nucleic acid sequence expressed by the two types of characters. The genome analysis processing device according to claim (1).

6. A character sequence (2) defining a genome,
W expressing adenine (A) and thymine (T) together
(Here, w), and x (here, x) expressing the cytosine (C) and guanine (G) collectively
The character permutation analyzer (1)
2. The genome analysis processing device according to claim 1, wherein the step of extracting a characteristic permutation of the nucleic acid sequence represented by the two types of characters.

7. A character permutation grouped under the circular permutation as a group, wherein all character permutations in the grouped group have the same tendency as the large region appearance frequency. 3. The genome analysis processing device according to claim 2, wherein a state in which the genome is present is extracted.