JP2007319016A

JP2007319016A - Method for specifying or classifying target bacterium or phage as specific genus, species or serum type

Info

Publication number: JP2007319016A
Application number: JP2006149797A
Authority: JP
Inventors: Arnold J Levine; ジェイ．リバインアーノルド; Harlan Robins; ロビンスハーラン
Original assignee: INST ADVANCED STUDY; INST FOR ADVANCED STUDY
Current assignee: INST ADVANCED STUDY; INST FOR ADVANCED STUDY
Priority date: 2006-05-30
Filing date: 2006-05-30
Publication date: 2007-12-13

Abstract

<P>PROBLEM TO BE SOLVED: To provide a relative-entropy algorithm for genomic fingerprinting capturing host-phage similarity. <P>SOLUTION: The method comprises the classification of target bacterium or phage by the following steps; (a) a step to specify a target genome sequence; (b) a step to form a randomized background genome derived from the target genome; (c) a step to specify an oligonucleotide influencing the difference between the background genome and the target genome by executing the iterative argorithm; and (d) a step to compare the oligonucleotide sequence selected by the step (c) with known bacterium or phage sequence. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

ゲノム指紋についての相対的エントロピーアルゴリズムは、宿主−ファージ類似性を捕捉する。 The relative entropy algorithm for genomic fingerprints captures host-phage similarity.

ゲノム分析は、生物の中で多くの配列差異を網羅していない。モノヌクレオチドおよびジヌクレオチド含有量の両方、並びにコドン用法は、ゲノムの中で広範に変化する（６）。均等に小型の細菌ゲノムのサイズは、各生物を説明する、実質的に豊富な集団の配列に基づいた特徴を決定するのに統計的に十分である。 Genome analysis does not cover many sequence differences in organisms. Both mononucleotide and dinucleotide content, as well as codon usage, vary widely within the genome (6). The size of an equally small bacterial genome is statistically sufficient to determine features based on a substantially abundant population of sequences that describe each organism.

しかし、これらの特徴の多くは、特に、複雑な制約により、コーディング領域で、分かりにくいままであった。各（タンパク質コーディング）遺伝子は、それの有望なヌクレオチド配列を制約する特定のタンパク質をコードする。遺伝子コードが変性しているので、この制約は、さらに、各遺伝子についての膨大な数の有望なＤＮＡ配列に対処する。さらに、各遺伝子における全体的なコドン用法は、おそらく、ｔＲＮＡ存在度を同等に許容することによって決定される強力な生物学上の成果を示すことが知られている（５）。コーディング領域内の新たな特徴を単離するために、これらの制約は、考慮から除外されなければならない。 However, many of these features remained unintelligible, especially in the coding domain, due to complex constraints. Each (protein coding) gene encodes a specific protein that constrains its potential nucleotide sequence. Since the genetic code is degenerate, this constraint further addresses the vast number of promising DNA sequences for each gene. Furthermore, the overall codon usage in each gene is probably known to show a strong biological outcome determined by equally allowing tRNA abundance (5). These constraints must be excluded from consideration in order to isolate new features within the coding region.

この問題を解決するために、発明者らは、実際のゲノムと、上述制約を正確に共有するが、しかしそうでなければランダムであるバックグランド・ゲノムを作成する（４）。バックグランド・ゲノムは、同じタンパク質を全てコードし、そしてコドン用法を、各遺伝子について厳密に合致させる。我々が探している隠れた特徴は、バックグランド・ゲノムと実際のゲノムとの間の差異に含まれる。問題は、これらの差異を引き出すまで減少される。 To solve this problem, the inventors create a background genome that shares exactly the above constraints with the actual genome, but is otherwise random (4). The background genome encodes all the same proteins, and the codon usage is closely matched for each gene. The hidden features we are looking for are included in the difference between the background genome and the actual genome. The problem is reduced until these differences are drawn.

発明者らは、情報理論を、バックグランドのものに比べて実際のゲノムで過剰および過少表示されたヌクレオチドの記号列（ワード）を組織的にコンピュータ処理するアルゴリズムに組込んだ(詳細について「材料および方法」を参照)。これらのワードを見出す上での主要な困難は、それらが独立でないことである。例えば、ワードＡＣＧＴが、過少表示される場合、それによりＡＣＧＴＡも、ＡＣＧなどと同様に、過少表示される。その仮定は、これらのワードの内のたった１つが、生物学上の有為性を示すことである一方で、他方は、「尻馬に乗る」である。 The inventors have incorporated information theory into an algorithm that systematically computes a string of nucleotides (words) that are over and under-represented in the actual genome compared to those in the background (for details see Materials And methods). The main difficulty in finding these words is that they are not independent. For example, when the word ACGT is under-displayed, the ACGTA is also under-displayed in the same manner as ACG. The assumption is that only one of these words is biologically significant while the other is "ride the horse".

この問題は、全てのワードに及ぶ。所定の長さのワードの集合は、限界があり、そしてゲノムもそうであるので、いずれか１つのワードの頻度は、全てのほかのものの頻度に影響する。発明者らは、実際およびバックグランド・ゲノムの間の差に最も寄与するワードを選択する情報論理測定を使用する反復アルゴリズムを修正した。各段階で、リストにこのワードを加え、そしてその後、バックグランド・ゲノムを適正な規模にし直すことによってそれの効果を考慮から除外した。この点で、我々は、ワードのリストを得て、そしてその各々は、実際およびバックグランド・ゲノムの間の差に独立に寄与する生物学上の有為性を示しそうである。ゲノムのサイズは、解決する統計学上の力を有するワードの長さを決定する。エッシェリキア・コリ（Escherichia coli、大腸菌）のような典型的細菌については、７ヌクレオチドまでの長さを保存的に含みうる。遺伝子によるアミノ酸順序およびコドン用法を、固定しており、それにより、本発明のアルゴリズムが網羅しない特徴は、モノヌクレオチド含有量およびコドン用法に相補的である。典型的細菌については、アルゴリズムは、長さ２および７ヌクレオチドの間の１００から２００までの配列を見出す（表１）。これらの先に隠されたシグナルは、生物学上の情報の価値を含む。 This problem extends to all words. Since the set of words of a given length is limited and so is the genome, the frequency of any one word affects the frequency of all others. The inventors have modified an iterative algorithm that uses information logic measures to select the words that most contribute to the difference between the actual and background genomes. At each stage, this word was added to the list, and then its effect was excluded from consideration by resizing the background genome. In this regard, we have a list of words, each of which is likely to exhibit biological significance that contributes independently to the difference between the actual and background genomes. The size of the genome determines the length of the word that has the statistical power to solve. For typical bacteria such as Escherichia coli, they can conservatively contain up to 7 nucleotides in length. The amino acid order by gene and codon usage are fixed, so that features not covered by the algorithm of the present invention are complementary to mononucleotide content and codon usage. For typical bacteria, the algorithm finds 100 to 200 sequences between 2 and 7 nucleotides in length (Table 1). These previously hidden signals contain the value of biological information.

材料及び方法
相対的エントロピーアルゴリズム。 Materials and Methods Relative entropy algorithm.

ゲノムのコーディング領域で最も有為に過剰および過少表示されるワードを発見するために、我々は、最初に、我々が実際のゲノムとの比較に使用したランダム化バックグランド・ゲノムを創り出した。これは、すべての遺伝子内に各アミノ酸に対応するコドンをランダムに置換することによって達成される（４）。 In order to find the most significantly over- and under-represented words in the coding region of the genome, we first created a randomized background genome that we used to compare with the actual genome. This is accomplished by randomly substituting the codon corresponding to each amino acid within all genes (4).

実際のゲノムと遺伝子当たり同じアミノ酸含有量およびコドン用法を示すが、そうでなければランダムである新たなコーディング配列を作成した。その後、我々は、このランダム化ゲノム中の長さ２から７までの各ワード、ｗについての発生の数を計数した（最大長のワードとみなされる７の我々の選択は、目的のゲノム中のコーディング配列の総長さによって読み取られた。各ワードの発生の平均数は、我々のアルゴリズムが確固不動であるために、＞０であるべきである。）。ランダムゲノムを発生する手段を、３０回反復し、そしてその時点で、発生の数における標準偏差は、そのワードに収束する。その後、我々は、各ワードｗの平均Ｎ_B（ｗ）バックグランド総数をコンピュータ処理した。下に明らかにされる理由のため、我々は、長さ７のワードのみとみなし、部分列を計測することにより短い長さのワードについての価値を生じさせることによって、Ｎ_B（ｗ）を決定することを選択した。我々は、Ｌ（ｗ）をワードｗの長さと等しくし、そしてＣ（Ｗⁱ ₇，ｗ）を、記号列ｗが長さ７の記号列Ｗⁱ ₇に含有される回数に等しくした。例として、ｗがＡＡＣであり、そしてＷ²⁵⁷ ₇がＡＡＣＡＡＡＣである場合、それによりＬ（ｗ）は、３に等しく、そしてＣ（Ｗ²⁵⁷ ₇，Ｗ）は２に等しい。 New coding sequences were created that show the same amino acid content and codon usage per gene as the actual genome, but are otherwise random. We then counted the number of occurrences for each word, w, of length 2 to 7 in this randomized genome (our selection of 7 considered the longest word is Read by the total length of the coding sequence, the average number of occurrences of each word should be> 0 because our algorithm is robust. The means for generating a random genome is repeated 30 times, at which point the standard deviation in the number of occurrences converges to that word. We then computerized the average N _B (w) background total for each word w. For reasons explained below, we determine N _B (w) by considering only a 7-word word and generating value for a short-length word by measuring a subsequence. Chose to do. We made L (w) equal to the length of the word w and C (W ⁱ ₇ , w) equaled the number of times the symbol string w was contained in the symbol string W ⁱ ₇ of length 7. As an example, if w is AAC and W ²⁵⁷ ₇ is AACAAAAC, then L (w) is equal to 3 and C (W ²⁵⁷ ₇ , W) is equal to 2.

Ｎ_B（Ｗⁱ ₇）＝１／３０×（３０バックグランド・ゲノム全てにおけるＷⁱ ₇の計測数の総計） N _B (W ⁱ ₇ ) = 1/30 × (total number of measurements of W ⁱ ₇ in all 30 background genomes)

同様に、我々は、Ｎ_R（ｗ）を、実際のゲノムでのｗの計数と等しくさせた。続くものの中で、我々は、計数よりむしろ頻度（または等価に確率）で作業し、それにより、我々は、式Ｐ_B（ｗ）＝Ｎ_B（ｗ）／ＬおよびＰ_R（ｗ）＝Ｎ_R（ｗ）／Ｌ（式中、Ｌは、我々のコーディング配列の総長さである）を用いて各ワードについての頻度を形成した。

Similarly, we made N _R (w) equal to the count of w in the actual genome. In what follows, we work with frequency (or equivalently probabilities) rather than counts, so that we use the formulas P _B (w) = N _B (w) / L and P _R (w) = N _R (w) / L (where L is the total length of our coding sequence) was used to form the frequency for each word.

２つの頻度分布Ｐ_BおよびＰ_Rは、ワード検索アルゴリズムについての出発点であった。このアルゴリズムは、順に繰返された２つの段階より構成される。第一段階では、バックグランド分布から実際のものを最も明らかに分離するワードは、下に記述されるべき有為さの測定値に基づいて選択された。第二段階で、バックグランド確率分布を、段階１で見られるワードによる差を考慮から除外するように適正な規模に直した。これらの２段階を、固定した回数、またはバックグランド分布が実際のものに十分に近くなるまで繰返した。 The two frequency distributions P _B and P _R were the starting point for the word search algorithm. This algorithm consists of two stages repeated in sequence. In the first stage, the word that most clearly separates the actual from the background distribution was selected based on the measure of significance to be described below. In the second stage, the background probability distribution was resized to an appropriate size to exclude the word differences seen in stage 1 from consideration. These two steps were repeated a fixed number of times or until the background distribution was close enough to the actual one.

実際およびバックグランド確率分布の間のＫｕｌｌｂａｃｋ−Ｌｅｉｂｌｅｒ距離は、 The Kullback-Leibler distance between actual and background probability distribution is

によって示される。

Indicated by.

その後、我々は、２から７までの長さのいずれかのワードｗが、Ｄ_KLに与える範囲を測定する実体の数字を得たかった。自然の測定値は、 Then we either word w of length from 2 to 7, wanted to obtain a number of entities that measures the range to give the D _KL. Natural measurements are

によって示される。

Indicated by.

これも、２つの確率分布、特に、我々が、所定のワードがｗであるかないかどうかのみを知っている粗雑な実際およびバックグランド分布の間のＫｕｌｌｂａｃｋ−Ｌｅｉｂｌｅｒ距離と考えられうる（１２）。反復の第一段階では、我々は、有為な測定値Ｓ（ｗ）を最大にする長さ２から７までのワードｗを選択した。 This can also be thought of as a Kullback-Leibler distance between two probability distributions, in particular between a rough actual and background distribution that we only know if a given word is w or not (12). In the first stage of the iteration, we selected a word w of length 2 to 7 that maximized the significant measurement S (w).

次の段階は、ｗの寄与が、実測およびバックグランド分布の両方で一致になるように、最小手段でバックグランド分布を適正な規模に直すことであった。最小に規模を直すことについては、同じ回数、ｗを含む長さ７のワードＷⁱ ₇の頻度の比は、変更すべきでない。つまり、我々は、全てのワードＷⁱ ₇を、等しい因子による同じものＣ（Ｗⁱ ₇、ｗ）に規模を改めたかった。したがって、我々は、適切なきめの粗さの詳細な確率分布で作業する必要があった。バックグランドについての我々の分布は、確率Ｐ_B（Ｗⁱ ₇）で長さ７のワードＷⁱ ₇の集合と定義された。我々は、この集合のＷⁱ ₇を、所定の部分集合の各要素が、ワードｗを等しい回数含むばらばらの部分集合に分けた。これらの集合は、 The next step was to rescale the background distribution to an appropriate scale with minimal means so that the contribution of w was consistent in both the measured and background distributions. For resizing to a minimum, the ratio of the frequency of the length 7 words W ⁱ ₇ containing w the same number should not be changed. That is, we wanted to resize all the words W ⁱ ₇ to the same C (W ⁱ ₇ , w) with equal factors. Therefore, we had to work with a detailed probability distribution with the appropriate texture. Our distribution for the background was defined as a set of ₇ words W ⁱ ₇ of length _{7 with} probability P _B (W ⁱ ₇ ). We divided W ⁱ _{7 of} this set into discrete subsets where each element of a given subset contains the word w an equal number of times. These sets are

である。
そしてＪ＝｛０、．．．、６｝および

It is.
And J = {0,. . . , 6} and

我々は、実際およびバックグランド分布で所定の部分集合中にある確率が等しいように、これらのばらばらの部分集合Ｋ_J（ｗ）を適切な規模に直したかった。

We wanted to rescale these disjoint subsets K _J (w) so that the probability of being in a given subset in the actual and background distributions is equal.

それらが、古い確率分布（およびそれらの確率を加えた）からグループ分けされる要素であるので、十分に定義された確率分布がある。確率を保存しつつ、ｗの寄与を除外する規模修正は、

Because they are elements that are grouped from the old probability distribution (and their probabilities added), there is a well-defined probability distribution. The scale correction that excludes the contribution of w while preserving the probability is

（式中、全てのｉに関して、Ｗⁱ ₇ ∈Ｋ_J）
によって示される。この規模修正で、ｗの実体の数字は、ここでＳ規模修正（ｗ）＝０で、それによりＤ_KLに対するｗの寄与は、除かれたことをに特に言及する。その後、我々は、次のワードｗ’などを得る段階１を繰返した。

(W ⁱ ₇ ∈K _{J for} all i)
Indicated by. In this scale modification, the numbers w entities, here the S-scale modification (w) = 0, whereby the contribution of the w for D _KL is particularly mentioned that it was removed. Then we repeated step 1 to get the next word w ′ and so on.

バックグランド分布Ｐ_B（Ｗⁱ ₇）が、実際の分布Ｐ_R（Ｗⁱ ₇）に収束するこの反復アルゴリズムで見出すことは困難でない。これは、Ｄ_KLが、単調に減少しているからである（下記を参照）。Ｄ_KLが負でない、そして２つの分布が一致する場合、またはただ一致するだけで、０であることは周知である。全てのｗについての方程式Ｓ（ｗ）＝０も、実際およびバックグランド分布が一致することを暗示するので、アルゴリズムは、収束が達成される前には、停止しない。最終的に、Ｄ_KLは、その後、その値より下にそれを減少させるであろうワードを見出すことができるであろうから、正の値に収束できない。 It is not difficult to find with this iterative algorithm where the background distribution P _B (W ⁱ ₇ ) converges to the actual distribution P _R (W ⁱ ₇ ). This is because D _KL decreases monotonously (see below). It is well known that D _KL is not negative and is zero if the two distributions match or just match. The equation S (w) = 0 for all w also implies that the actual and background distributions match, so the algorithm does not stop before convergence is achieved. Eventually, D _KL will not be able to converge to a positive value since it would then be able to find a word that would reduce it below that value.

アプリケーションについては、我々は、反復が、もはや、リストに統計的に有為なワードを与えないときに結論を下さなければならなかった。この中断は、可能性変動が、複数の仮説について適切に修正された最も有為な残りのワード［長さＬ（ｗ）の全てのワードの集合］を創造することになりそうな時点である。中断は、選択ワードｗが For the application we had to conclude when the iteration no longer gave the list a statistically significant word. This interruption is when the likelihood variation is likely to create the most significant remaining word [the set of all words of length L (w)] appropriately modified for multiple hypotheses. . Interruption is selected word w

（式中、Δ（ｗ）は、ｗについてのバックグランド計数の標準偏差である）
を満足するときに起こる。本文献でのアプリケーションについては、我々は、１００回反復の後停止し、そしてそれは、実質的に中断より下である。

(Where Δ (w) is the standard deviation of the background count for w)
Happens when you are satisfied. For the application in this document, we stop after 100 iterations, which is substantially below the interruption.

Ｄ_KLが、規模修正で単調に減少することの証明。ｊ∈ＳおよびＳ有望な結果の集合と共に、２つの確率分布｛ｐ_j｝および｛ｑ_j｝を考慮すると、Ｋｕｌｌｂａｃｋ−Ｌｅｉｂｌｅｒ距離は、 Proof that D _KL decreases monotonically with scale modification. Considering the two probability distributions {p _j } and {q _j }, together with jεS and the set of promising results, the Kullback-Leibler distance is

である。これは、負でなく、そして、分布が一致する場合のみゼロである。

It is. This is not negative and is zero only if the distributions match.

Ｓのばらばらの区分を、ｒ集合に考慮すると、Ｓ₁．．．Ｓ_r、すなわち、
ｋ≠ｌおよび∪_iＳ_i＝Ｓである場合、Ｓ_k・Ｓ_i＝・ Considering the disjoint division of S in the r set, S ₁ . . . S _r , ie
If k ≠ l and ∪ _i S _i = S, then S _k · S _i = ·

次に、きめが粗い確率を定義すると、

Next, if you define the coarse-grained probability,

Ｑ_iは、全てのｉについて＞０であると推定して、我々は、Ｐ_iおよびＱ_iの両方が、それら自体、確率分布であることに特に言及する。

Estimating that Q _i is> 0 for all i, we specifically mention that both P _i and Q _i are themselves probability distributions.

規模を適正に直した分布を定義すると、 If we define a distribution that is scaled appropriately,

そして、新たなＫｕｌｌｂａｃｋ−Ｌｅｉｂｌｅｒ距離は、Ｐ_iは、全てのｉについてＱ_iに等しい場合のみ対等である。

And the new Kullback-Leibler distance is comparable only if P _i is equal to Q _i for all i.

スコア・アルゴリズム。長さｇのゲノムＧに関して長さｓのコーディング配列を記録するために、我々は、最初に、以下の修飾を伴って、上に記述されるとおりＧについてのワードリストを作成した。それらが長さｓの配列に有為である場合のみに、ワードを、リストに加えた。この有為性は、各ワードについての総数および標準偏差を、規模ｓに直すことによって決定された。我々は、バックグランド・ゲノムおよび実際のゲノム中の各ワードの総数に、配列Ｓに関する予測される総数Ｎ_bおよびＮ_rを示すｓ／ｇを掛けた。標準偏差は、√ｓ／ｇにより適正な規模に直し、そしてΔ^sを示す。ワードが、式｜Ｎ_r−Ｎ_b｜＞３×Δ^sを満足する場合、それにより、それは、リストに含まれた；そうでなければ、それは飛ばされる。ｓは、ｇよりかなり少ないので、この基準は、上に記述される複数の仮説で修正された中断より実質的にいっそう厳密であった。バックグランド分布の規模を改めることを含めた残りの反復手段は、上に記述されるものと同じであった。

Score algorithm. To record a coding sequence of length s for a length G of genome G, we first created a word list for G as described above with the following modifications. Words were added to the list only if they were significant for an array of length s. This significance was determined by converting the total number and standard deviation for each word to scale s. We multiplied the total number of each word in the background and actual genomes by s / g which indicates the expected total number N _b and N _r for sequence S. Standard deviation, fix the proper scale by √s / g, and shows the Δ ^s. If the word satisfies the expression | N _r −N _b |> 3 × Δ ^s , then it was included in the list; otherwise it is skipped. Since s is significantly less than g, this criterion was substantially more rigorous than the interruptions corrected with the multiple hypotheses described above. The remaining iterations, including changing the size of the background distribution, were the same as described above.

この新たなリストＬは、ワードＸの数を用いたスコア付けテンプレートを形成した。スコアを得るために、我々は、上で記述されたものと同じモンテ・カルロシャッフル手段によって配列ＳのバックグランドＢを形成した。その後、我々は、以下の反復アルゴリズムを履行した。各段階で、我々は、指示されたリストＬからワードＷを引き出した。その後、我々は、配列ＳとバックグランドＢでのそのワードの総数を比較し、そしてＳとＢとの間のＷについての偏りの方向が、ゲノムＧとそれのバックグランドとの間のＷついてのものと同じである場合のみに、すなわち、Ｗが、それらの個々のバックグランドに比べてＧおよびＳの両方で過剰表示されるか、または両方で過少表示される場合のみに、我々のスコアに１を加えた。その後、我々は、Ｗの効果を除外する、上に記述される手段でＢの規模を改め、そして次の段階に進んだ。全リストＬを見返して、我々は、ゲノムと配列の間に一致がある有望なワードＸの内の数Ｙを得た。最終スコアは、Ｃ×（Ｘ−Ｙ／２）／√Ｙであり、そしてＣは一定であった。各短配列については、スコア付けは、ＮＣＢＩデータベース（ｈｔｔｐ：／／ｗｗｗ．ｎｃｂｉ．ｎｌｍ．ｎｉｈ．ｇｏｖ／ｅｎｔｒｅｚ／ｑｕｅｒｙ．ｆｃｇｉ？ｄｂ＝Ｇｅｎｏｍｅ）中の１６４の細菌種全てについて行われ、そしてそれは、２５３クロモソームを含んだ。 This new list L formed a scoring template using the number of words X. To obtain the score, we formed a background B of sequence S by the same Monte Carlo shuffle procedure as described above. We then implemented the following iterative algorithm: At each stage we pulled the word W from the indicated list L. We then compare the total number of that word in sequence S and background B, and the direction of bias for W between S and B is for W between genome G and its background. Our score only if it is the same as that of, i.e. if W is over-displayed in both G and S or under-displayed in both compared to their individual background. 1 was added. We then resized B with the means described above, excluding the effects of W, and proceeded to the next stage. Looking back at the full list L, we got the number Y of promising words X with a match between the genome and the sequence. The final score was C × (X−Y / 2) / √Y and C was constant. For each short sequence, scoring is performed on all 164 bacterial species in the NCBI database (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome) and it is 253 chromosomes were included.

系統発生学の樹についての測定基準。測定基準は、５０ｋｂスライス及び上に記述されるスコア付け方法を利用した。２つのゲノムＡおよびＢの距離を、三段階で計算した。第一に、ゲノムＡの５０ｋｂスライスの全てを、全ゲノムＢに対してスコア付けし、そしてその後、そのスコアを平均した。同じ工程を、ゲノムＡに対してスコア付けされたゲノムＢの５０ｋｂスライスについて繰返した。第二に、２つの平均を、対称的にした。最後に、対称的スコアを、最大の有望なスコアから減じた。三角不等式に従わないが、この距離は、ＡがＢに等しい場合にのみ、測定基準−対称的な正の明確なゼロの特性のほとんどを示す。我々は、最も近隣のクラスター群を使用し、そしてその樹を出力するＰＨＹＬＩＰソフトウエア・パッケージを使用した（３）。 Metrics for phylogenetic trees. The metric utilized a 50 kb slice and the scoring method described above. The distance between the two genomes A and B was calculated in three steps. First, all 50 kb slices of genome A were scored against whole genome B and then the scores were averaged. The same process was repeated for a 50 kb slice of genome B scored against genome A. Second, the two averages were made symmetric. Finally, the symmetric score was subtracted from the largest promising score. Although not following the triangle inequality, this distance exhibits most of the metric-symmetric positive positive zero property only when A is equal to B. We used the nearest cluster group and used the PHYLIP software package that outputs the tree (3).

結果
ワードリストは、全細菌ゲノムから生じるが、それらは、ゲノムを通して均質であるＤＮＡ配列の特徴に対応する。我々は、２つの別個の方法でこれを確認した。第一に、例としてイー．コリ（Ｅ．ｃｏｌｉ）を使用して、我々は、ゲノムを半分に分け、そしてその２つの半分に、独立にアルゴリズムを作動させた。生じたリストは、統計的変動まで同じであった。１００ワードのリストについては、上位８０ワードが、両方のリストにあった。その工程を、様々な分配で複数回繰返し、そして結果は類似であった。 Results The word list arises from the entire bacterial genome, but they correspond to DNA sequence features that are homogeneous throughout the genome. We confirmed this in two distinct ways. First, e.g. Using E. coli, we divided the genome in half and independently operated the algorithm on the two halves. The resulting list was the same until statistical variation. For the 100 word list, the top 80 words were in both lists. The process was repeated multiple times with various distributions and the results were similar.

ワードの過少および過剰表示が、ゲノムの局所特徴であることの我々の二次調査のために、我々は、各ゲノムから得られるワードリストに基づいたコーディングＤＮＡの配列を記録する基本的アルゴリズムを創り出した。このアルゴリズムは、それの入力として、コーディングＤＮＡ配列、およびワードのリストを利用し、そしてその配列に、配列中のワードの過少および過剰表示に基づいてスコアを割り当てる（材料および方法を参照）。ＮＣＢＩデータベース中で長さ１００ｋｂより大きな２５３細菌染色体を、５０ｋｂおよび１００ｋｂスライスに分けた。これらの配列を、１６４種全てに対して別個にスコア付けした。１００ｋｂスライスの９２パーセントは、それら自身の種で最高にスコア付けされた。５０ｋｂ配列を使用して、８６％は、それら自身の種で最高にスコア付けされた。これは、ワードが、各細菌ゲノムを通して均質である特徴に対応することを確認する。ＧＣ含有量もコドン用法のいずれも、等質性のこの特性を示さない；両方が、単独ゲノム内で実質的に変化する。さらに、この成果は、これらの隠れワードに基づいたスコア付け手段が、配列の有用な選別器であることを示唆する。例えば、Ｖｅｎｔｅｒら（９）によって記述されるサルガッソー海にいる微生物から獲得される配列を、同類の遺伝子を必要とすることなく、既知細菌と比較しうる。（最も知られている細菌ゲノム選別器は、ＫａｒｌｉｎおよびＣａｒｄｏｎ［６］により開発されたオリゴヌクレオチド・アプローチである。スコア付けアルゴリズムの我々の無垢な第一版を用いてさえ、５０ｋｂと１００ｋｂについての結果は、４までの長さを示すオリゴヌクレオチドの頻度を比較することに関与する最も包括的なオリゴヌクレオチド・アプローチを用いたものよりわずかに優れている。我々のスコア付けは、Ｖｅｎｔｅｒら［９］により使用されるジヌクレオチド・アプローチのものより実質的に優れていた。）
我々のアプローチは、ウイルスとそれらの宿主との間の関係を研究するのに十分に適してもいる。ウイルスＤＮＡをコピーし、そして宿主の内側で発現させるので、ウイルスおよびそれらの宿主が、ある程度の進化圧力を共有することが予想されるかもしれない。しかし、モノヌクレオチド含有量およびコドン用法は、宿主とファージとの間で劇的に異なる。ある種の情報は、オリゴヌクレオチド比較から得られたが、しかし「材料および方法」で記述される我々スコア・システムは、６０％より多く優れている。ＮＣＢＩウエブサイトでの配列決定されたＤＮＡファージの集合の内、１８５ファージが、既知一次宿主を有する。そのファージの多くは、知られているか、または同じ属内の複数の宿主種を有することが推測される。この理由に関して、我々は、属レベルで宿主標的を考慮した。１６４種は、１０８の異なる属に分かれる。我々のアルゴリズムについては、正しい宿主属は、１８５ファージの内９３について最高のスコアを付け、そして１３１ファージは、上位３つのスコアで正しい宿主を有した（表２）。比較に関しては、最高のオリゴヌクレオチド・スコアシステムは、１８５の正しい宿主属の内５８を示した。コドン用法とモノヌクレオチド含有量の両方が、ファージ宿主のわずかな予測変数である。 For our secondary investigation that under- and over-representation of words is a local feature of the genome, we have created a basic algorithm that records the sequence of coding DNA based on the word list obtained from each genome. It was. This algorithm takes as its input a coding DNA sequence and a list of words and assigns a score to that sequence based on the under- and over-representation of the words in the sequence (see Materials and Methods). 253 bacterial chromosomes larger than 100 kb in length in the NCBI database were divided into 50 kb and 100 kb slices. These sequences were scored separately against all 164 species. 92 percent of the 100 kb slices scored highest in their own species. Using the 50 kb sequence, 86% scored highest in their own species. This confirms that the word corresponds to a feature that is homogeneous throughout each bacterial genome. Neither GC content nor codon usage show this property of homogeneity; both vary substantially within a single genome. Furthermore, this result suggests that the scoring means based on these hidden words is a useful sorter of sequences. For example, sequences obtained from microorganisms in the Sargasso Sea described by Venter et al. (9) can be compared to known bacteria without the need for similar genes. (The best known bacterial genome sorter is the oligonucleotide approach developed by Karlin and Cardon [6]. Even with our innocent first version of the scoring algorithm, The results are slightly better than those using the most comprehensive oligonucleotide approach involved in comparing the frequency of oligonucleotides showing lengths up to 4. Our scoring is based on Venter et al. [9 ] Substantially better than that of the dinucleotide approach used by
Our approach is also well suited to study the relationship between viruses and their hosts. Since the viral DNA is copied and expressed inside the host, it may be expected that the viruses and their hosts will share some evolutionary pressure. However, mononucleotide content and codon usage vary dramatically between the host and the phage. Some information was obtained from oligonucleotide comparisons, but our score system described in “Materials and Methods” is better than 60%. Of the population of DNA phage sequenced on the NCBI website, 185 phage has a known primary host. Many of the phage are known or assumed to have multiple host species within the same genus. For this reason we considered host targets at the genus level. The 164 species fall into 108 different genera. For our algorithm, the correct host genus gave the highest score for 93 out of 185 phage, and 131 phage had the correct host with the top three scores (Table 2). For comparison, the best oligonucleotide score system showed 58 out of 185 correct host genera. Both codon usage and mononucleotide content are few predictors of the phage host.

我々の解析を、既知ファージの大部分を包含する二本鎖ＤＮＡ（ｄｓＤＮＡ）ファージに限定することによって、我々の宿主予測は、明らかに改善した。３５の一本鎖ＤＮＡファージを取り除くことが、スコア付けを、上位スコアについて８７／１５０すなわち５８％まで、そして上位３つのスコアについて１２３／１５０すなわち８２％まで改善した。ファージは、さらに、温度と溶解素に階層化されうる（１）。なお配列決定されたファージの大半を構成する温度ｄｓＤＮＡファージに関しては、宿主についての我々の予測は、きわめて良好であった（上位３点で９３％、そして上位のスコアで７０％）。溶解素ファージは、なお、上位３つで５０％より優れているが、同様の得点は得られず、そしてそれらのＤＮＡが、宿主のものと同じ進化圧力がかかっていないことを示している。 By limiting our analysis to double-stranded DNA (dsDNA) phage that encompass most of the known phage, our host prediction has clearly improved. Removing 35 single-stranded DNA phage improved scoring to 87/150 or 58% for the top score and 123/150 or 82% for the top three scores. Phages can be further stratified into temperature and lysine (1). Note that for the temperature dsDNA phages that make up the majority of the sequenced phages, our prediction for the host was very good (93% for the top 3 and 70% for the top score). Lysine phage are still better than 50% in the top three, but similar scores are not obtained, indicating that their DNA is not under the same evolutionary pressure as that of the host.

我々は、我々のスコア・アルゴリズムを、ゲノム間の距離を形成するのに適合させた（「材料および方法」を参照）。階層クラスター群を、１６４の細菌種の集合の距離の行列に使用して、我々は、系統樹を作成した（図１ａ）。この樹は、標準細菌分類学の大半を捕捉する。例えば、図１ｂを参照すると、エンテロバクテリア（Enterobacteria）を、同じ分岐群にグループ分けされることを示している。これは、ワードリストによりコーディングされた特性が、進化的に保存されることを示唆する。距離は、全ゲノム特性に基づいているので、我々は、遺伝子水平移入のような系統樹木を作成する上で共通の落とし穴のいくつかを避けた。さらに、この方法は、均質な遺伝子または多量の配列決定されたゲノムを必要せずに、この樹での新たな種の追加が出来た。

We adapted our scoring algorithm to form a distance between genomes (see “Materials and Methods”). Using the hierarchical cluster group in the distance matrix of a set of 164 bacterial species, we created a phylogenetic tree (FIG. 1a). This tree captures most of the standard bacterial taxonomy. For example, referring to FIG. 1b, it shows that Enterobacteria are grouped into the same branch group. This suggests that the characteristics coded by the word list are evolutionarily preserved. Since distance is based on whole-genome traits, we avoided some of the common pitfalls in creating phylogenetic trees like horizontal gene transfer. Furthermore, this method allowed the addition of new species in this tree without the need for homogeneous genes or large amounts of sequenced genomes.

検討
我々は、各細菌ゲノムのコーディング領域中に１００を超える新たなシグナルを見出すアルゴリズムを導入した。表されるアプリケーションの集合は、選別器としてこれらのシグナル（ワード）の使用、およびファージとそれらの宿主の間のゲノム結合、並びに統計学上の樹の作成を含む。これらは、アルゴリズム潜在用途の部分集合にすぎない。真核生物のためのある種の有望な用法は、スライス部位検出、ｍＲＮＡ分解または安定化シグナル、組織特異性、および宿主−ウイルス関係が挙げられる。実際のエキソンは、エキソン分断エンハンサーのような過剰表示シグナルを有する（２）。我々のアルゴリズムは、実際のエキソン中の過剰および過少表示された配列の集約的リストを測定でき、そしてそれは、イントロン性配列を混同することから、実際のエキソンを切り離すために使用することが可能である。ｍＲＮＡ安定性については、２，３のグループが、ヒトを含めた多様な生物における多数のｍＲＮＡについての崩壊定数を測定した（８、１１）。半減期の範囲は、2桁単位となるが、しかし安定性におけるこの差を決定するシグナルまたは構造は、知られていない。我々のアルゴリズムを、１，０００の最も迅速に崩壊するｍＲＮＡと、１，０００の最も安定なｍＲＮＡの集合に使用する場合、２つのリストでの差は、重要なシグナルの集合を提供するにちがいない。組織特異性については、様々の組織で最初に発現される遺伝子が、固有の特性を示すことが最近２，３年で示されてきた；それらのコドン用法およびＧＣ含有量は、異なる（７、１０）。我々は、組織を区分する別のシグナルを見出すことができるはずである。これらのシグナルは、ウイルスに関する宿主組織についての情報を提供する可能性を示す。ファージおよび細菌宿主によって（またはヒト・ウイルスおよびそれらの宿主組織によって）共有されないコドン用法およびモノヌクレオチド含有量と異なり、我々のアルゴリズムは、ウイルス性宿主のきわめて良好な予測変数である。 We have introduced an algorithm that finds over 100 new signals in the coding region of each bacterial genome. The set of applications represented includes the use of these signals (words) as a sorter, and the genomic linkage between phage and their hosts, as well as the creation of statistical trees. These are just a subset of the algorithm potential uses. Certain promising uses for eukaryotes include slice site detection, mRNA degradation or stabilization signals, tissue specificity, and host-virus relationships. The actual exon has an over-representation signal like the exon split enhancer (2). Our algorithm can measure an aggregated list of excess and under-represented sequences in actual exons, and it can be used to dissociate actual exons from confusing intronic sequences. is there. For mRNA stability, a few groups measured decay constants for a number of mRNAs in diverse organisms including humans (8, 11). The half-life range is in the order of two orders of magnitude, but the signal or structure that determines this difference in stability is unknown. If our algorithm is used for the 1,000 most rapidly decaying mRNAs and the 1,000 most stable set of mRNAs, the difference between the two lists should provide an important set of signals. Absent. For tissue specificity, genes first expressed in various tissues have been shown to exhibit unique properties over the last few years; their codon usage and GC content are different (7, 10). We should be able to find another signal that separates the tissues. These signals indicate the possibility of providing information about the host tissue regarding the virus. Unlike codon usage and mononucleotide content not shared by phage and bacterial hosts (or by human viruses and their host tissues), our algorithm is a very good predictor of viral hosts.

このアルゴリズムは、転写因子結合部位を見出す助けにも使用しうる。ＤＰＩインタラクト・データベース（ｈｔｔｐ：／／ａｒｅｐ．ｍｅｄ．ｈａｒｖａｒｄ．ｅｄｕ／ｄｐｉｎｔｅｒａｃｔ／）から、我々は、イー．コリ（大腸菌）について列挙された１５またはそれより多くの結合部位を有する１３の転写因子についての既知結合部位の集合を抽出した。結合部位は、結合モチーフをスコア付けする重量マトリックスの集合を測定した。実際のイー．コリゲノム上を重量マトリックスに展開させ、そしてそれらを、バックグランド・イー．コリゲノムと比較することによって、我々は、１３のモチーフの内１２が、コーディング領域で明らかに（４標準偏差）過少表示されることが分かった。この手段は、一般に使用されるモチーフファインダーが、実際の転写因子結合モチーフでない過剰シグナルをはじき出すときに、即時利用性のものであるモチーフが現実であるかどうかを決定するフィルターとして使用されうる。 This algorithm can also be used to help find transcription factor binding sites. From the DPI interact database (http://arep.med.harvard.edu/dpinteract/), we A set of known binding sites for 13 transcription factors with 15 or more binding sites listed for E. coli was extracted. Binding sites were determined by the collection of weight matrices that score the binding motif. Actual e. The E. coli genome is developed into a weight matrix and they are added to the background e. By comparing with the coli genome we found that 12 of the 13 motifs are clearly under-represented (4 standard deviations) in the coding region. This measure can be used as a filter to determine if a motif that is readily available is real when a commonly used motif finder pops out an excess signal that is not an actual transcription factor binding motif.

図１（ａ）ＮＣＢＩデータベースで見られる１６４の細菌種についての系統学上の樹。三角形は、エンテロバクテリアの分岐群を含む。（ｂ）その樹のエンテロバクテリアの分岐群の引き伸ばし。アシネトバクター株ＡＤＰ１、ニトロソモナス・ヨーロピアエ、エルビニア・カロトボラ、エッシェリキア・コリ、サルモネラ・エンテリカ、サルモネラ・エンテリカ・セロバル・ティフィ、シゲラ・フレクスネリ、ホトロアブダス・ルミネッセンス、エルシニア・ペスチス、エルシニア・シュードツバキュロシス、イジオマリナ・ロイヒエンシス、シゲラ・オネイデンシス、ビブリオ・コレラエ、ビブリオ・パラハエモリチクス、およびビブリオ・ブルニフィカスについての結果が示される。この群から消えている唯一のエンテロバクテリアは、ブキネラ・アフィジコラである。

FIG. 1 (a) Phylogenetic tree for 164 bacterial species found in the NCBI database. The triangle contains a branch group of Enterobacteria. (B) Stretching the enterobacterial branch of the tree. Acinetobacter strain ADP1, Nitrosomonas europeiae, Ervinia carotobola, Escherichia coli, Salmonella enterica, Salmonella enterica ceroba tifi, Shigella flexneri, Hotroabdas luminescence, Yersinia pestisio Results are shown for Leuhiensis, Shigella Oneidensis, Vibrio cholerae, Vibrio parahaemoriticus, and Vibrio Brunificus. The only enterobacteria disappearing from this group is Bukinera aphidikola.

これらのシグナルの広範な使用については、我々は、真のコーディング領域を明白に表すバックグランド・モデルの開発を期待している。多くの生物情報学上の問題は、それを任意の基準と比較することによって、長いモチーフまたは配列を探すことを必要とする。これらの問題は、実際のゲノムでの基本原則の全てを含むバックグランド・モデルを発生する方法がないという、困難さを示している。我々のアルゴリズムは、短い包括的原則の全てを決定する。これらの原則を予想する基準・モデルを作成することで、多様な困難な生物情報学上の問題を追跡可能にさせる。 For the wide use of these signals, we expect the development of a background model that clearly represents the true coding region. Many bioinformatics issues require looking for long motifs or sequences by comparing it with arbitrary criteria. These problems indicate the difficulty that there is no way to generate a background model that includes all of the basic principles in the actual genome. Our algorithm determines all of the short global principles. The creation of standards and models that anticipate these principles will make it possible to track a variety of difficult bioinformatics issues.

Claims

A method for identifying or classifying a target bacterium or phage as a specific genus, species or serotype,
a) providing a genomic sequence of the target bacterium or phage;
b) providing a randomized background genome with the same amino acid content and codon usage per gene as the target genome;
c) Run an iterative algorithm to identify an oligonucleotide sequence of about 2 to about 7 nucleotides in the target genomic sequence that represents the sequence that most affects the difference between the background and target genomes in length The iterative algorithm comprises:
i) selecting the oligonucleotide sequence that most largely separates the target genome from the background genome;
ii) rescaling the background establishment distribution to factor out differences for the oligonucleotides selected in step (i);
iii) repeating the step (i) until the background distribution approaches the target distribution;
A step including:
d) comparing the oligonucleotide sequence selected in step (c) with a known bacterial or phage sequence;
A method characterized by comprising.