201126365 六、發明說明: 【發明所屬之技術領域】 本發明係關於一種於一資料中探勘雙向分群之方法,詳 言之,係關於利用條件列舉樹於DNA微陣列資料中探勘雙 向分群之方法。 【先前技術】 在基因微陣列中對基因及實驗條件進行子空間分群之分201126365 VI. Description of the Invention: [Technical Field of the Invention] The present invention relates to a method for exploring bidirectional grouping in a data, and more particularly to a method for exploring bidirectional grouping in a DNA microarray data using a conditional enumeration tree. [Prior Art] Subspace grouping of genes and experimental conditions in gene microarrays
群技術,已被證明能幫助理解如基因功能、基因調節、細 胞進程以及細胞亞型等。一般而言,在大部分的情況下, 一種疾病係由多筆基因構成,因此研究人員不斷設法去找 某二基因在某些條件下所具有的相似表現,以作為判斷 疾病之依據。 在習知技術中,支援基因微陣列中子空間分群的方法 裡’常見的方法如zCluster,其為找出某些基因 ^某些條件下有_致性表現的子空間分群。,然而,這兩個 知方法都包含很費時的步驟’也就是建構基因對的最大 合以及分佈其字首樹每個節點上的基因資訊。因 ’習知pcmusua zCluster之基因微陣列中子空間分群方 門其分佈過程複雜,且必須制大量之搜尋Μ及搜尋 問題二有要一 :rcr°c—^ 問題。—$要去解—個叫做Maximai⑶㈣的价 實有必要提供 樹於崎微陣❹進料之❹條件列舉 陣歹J貝枓中探勘雙向分群之方法,以解決上述 145764.doc 201126365 問題。 【發明内容】 本發明提供一種利用條件列舉樹於DNA微陣列資料中探 勘雙向分群之方法,其中該麵微陣列係由複數個基因及 每基因之複數個條件資訊所構成之陣列該等基因具有 順序之基因編號,該等條件資訊具有順序之條件編號’該 方法包括以下步驟:⑷計算複數個條件對最大維度集合及 =條件計數表,其中每—條件對最大維度集合包括二條件 貝Λ及複數個位元字串,該條件計數表包括條件資訊、條 件資訊於該等條件對最大維度集合出現之個數及條件資訊 於該等條件對最大維度集合之相關條件集合,依據一最 Η条件資5fl數值’於該條件計數表删除部分條件資訊;及 ⑷依據該等條件對最大維度集合及刪除後之該條件計數 表’利用-區域廣度優先及全域深度優先方法建立一條 件列舉樹,以計算條件列舉樹之節點之條件集合及相關基 因位元字串集合。 本發明利用條件列舉樹於DNA微陣列資料中探勘雙向分 群之方法,不需要去建構習知技術中基因對的最大维卢集 合,可避免複雜的分佈過程’並且利用雜湊結合的觀令來 有效率地找出雙向分群的結果。另外,利用區域廣度優先 及全域深度優先方法,建立條件列舉樹。藉此,本發明利 用條件列舉樹於DNA微陣列資料中探勘雙向分群之方法可 以避免複雜的分佈過程’且可大量地降低搜尋空間及搜尋 時間。 I45764.doc 201126365 =本發明利用條件列舉樹於DNA微陣列資料中探勘 雙向分群之方法可、态m 週用於架設於全球資訊網(WWW)上的 生物貝料庫刀析網站,使用者可透過本發明之方法對基因 微陣列貝料進仃分群分析,或者研究單位(例如:研究基 因微陣列的生醫科枯 十技A司)可以利用分群結果幫助了解基 因間的關係。 【實施方式】 以下參考圖式,埒aa I々 °月本發明利用條件列舉樹於DNA微陣 列資料中探勘雙向分群之方法。圖旧示本發明實施例之 DNA微陣列資料之示意圖。在本發明實施财,該腦微 陣列係由複數個基因 土 L1及母—基因之複數個條件資訊所構成 之陣列,該等基因且右,値广 ^ U具有順序之基因編號,該等條件資訊具 有順序之條件編號。 首先’本發明之方、、表 卜 法计异複數個條件對最大維度集合及 一條件計數表,l φ灰 ..α 〃中母一條件對最大維度集合包括二條件 杜^及複數個位元字串’該條件計數表包括條件資訊、條 貝=於該等條件對最大維度集合出現之個數及條件資訊 ;該專條件對最大維度集合之相關條件集合。 上述計算該等條件對最大維度集合之步驟,首先計算每 一編號之基因在| & μ & 母—條件編號下之條件資訊之差值,參考 圖2,其顯示計算备_始0备 ^ ^ 、扁號之基因在1及b二條件編號下之 條件負訊*之差值。技基 者’將该專差值由小至大排列為-差 值排序’且依據兮至# 4 锞孩差值排序排列其相應之基因,參考圖 145764.doc 1 ’其顯示將該等差信± , 寻差值由小至大排列為一差值排序。 201126365 接著,依據一容忍誤差值’將該差值排序分群,且依據 分群之差值排序其相應之基因編號,以位元字串表示分群 之差值排序。參考圖4,其顯示a及b條件資訊之差值為i之 群組之基因為基因編號0、2、4、6,故在相對應之位元字 串之位元0、2、4、6為1,其他位元為〇 ;在&及b條件資气 之差值為10之群組之基因為基因編號1、3、5,故在相對 應之位元字串之位元1、3、5為1,其他位元為〇。 冉依據一最小基因數值(NR)Group technology has been shown to help understand such things as gene function, gene regulation, cell progression, and cell subtypes. In general, in most cases, a disease is made up of multiple genes, so researchers are constantly trying to find a similar performance of a certain gene under certain conditions as a basis for judging the disease. In the conventional technique, a method for supporting subspace grouping in a gene microarray is a common method such as zCluster, which is to find out a certain sub-group of sub-spaces in which certain genes have singular expression. However, both of these methods involve very time consuming steps'—that is, constructing the largest pair of gene pairs and distributing the gene information at each node of the prefix tree. Because of the fact that the distribution of sub-space grouping in the gene microarray of pcmusua zCluster is complicated, it is necessary to make a large number of search and search problems. There is one problem: rcr°c-^. —$To solve the problem—a price called Maximai(3)(4) It is necessary to provide a method for the two-way clustering of the 于 歹 微 微 , , , , , 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 SUMMARY OF THE INVENTION The present invention provides a method for exploring two-way clustering in a DNA microarray data using a conditional enumeration tree, wherein the surface microarray is an array of a plurality of genes and a plurality of conditional information for each gene. Sequence gene number, the conditional information has a sequence of condition numbers' The method comprises the following steps: (4) calculating a plurality of conditional pair maximum dimension sets and a = conditional count table, wherein each of the conditional pair of maximum dimension sets includes two conditional beauties and a plurality of bit strings, the conditional count table including condition information, condition information, the number of occurrences of the conditions on the largest dimension set, and conditional information on the conditional set of the maximum dimension set, according to a last condition The 5fl value 'deletes part of the condition information in the conditional count table; and (4) establishes a conditional enumeration tree by using the -area breadth-first and global depth-first method for the maximum dimension set and the deleted conditional count table according to the conditions Calculate the condition set of the node of the conditional enumeration tree and the related gene bit string setThe invention utilizes the conditional enumeration tree to explore the two-way grouping method in the DNA microarray data, and does not need to construct the maximum Weilu set of the gene pair in the conventional technology, can avoid the complicated distribution process' and utilizes the hash combination to order Efficiently find the results of two-way grouping. In addition, a conditional enumeration tree is established using the area breadth priority and the global depth priority method. Thereby, the method of using the conditional enumeration tree in the DNA microarray data to explore the two-way grouping can avoid the complicated distribution process' and can greatly reduce the search space and the search time. I45764.doc 201126365=The invention utilizes the conditional enumeration tree to explore the two-way grouping method in the DNA microarray data, and the state m week is used for the biological shellfish database analysis website which is installed on the World Wide Web (WWW), and the user can The gene microarray beakers can be analyzed by the method of the present invention, or the research unit (for example, the biomedical system of the research microarray) can use the grouping results to help understand the relationship between genes. [Embodiment] Hereinafter, the present invention uses a conditional enumeration tree to explore a bidirectional grouping method in DNA microarray data by referring to the figure. The figure shows a schematic diagram of DNA microarray data of an embodiment of the present invention. In the practice of the present invention, the brain microarray is an array consisting of a plurality of genetic information of a plurality of genetic soils L1 and a mother-gene, and the genes and the right, the 値广^ U have a sequence of gene numbers, and the conditions The information has a conditional number in the order. First of all, the method of the present invention, the table method, the multiplicity of conditions, the maximum dimension set and a conditional count table, l φ gray..α 〃, the mother-condition-to-maximum dimension set includes two conditional sums and plural bits The meta-string includes the condition information, the bar = the number of occurrences of the condition to the largest dimension set, and the condition information; the set of conditions for the condition set to the maximum dimension set. The above steps of calculating the conditions for the maximum dimension set, first calculating the difference of the condition information of each numbered gene under the | & μ & parent-condition number, referring to Figure 2, which shows the calculation preparation_starting 0 ^ ^, the difference between the conditional negative signal of the gene of the squad number under the conditions 1 and b. The skill base 'arranges the discrepancies from small to large as - difference sorting' and ranks the corresponding genes according to the order of 兮 to # 4 锞 children, refer to Figure 145764.doc 1 'which shows the difference The letter ±, the difference value is arranged from small to large as a difference order. 201126365 Next, the difference is sorted according to a tolerance error value', and the corresponding gene number is sorted according to the difference of the group, and the difference order of the group is represented by the bit string. Referring to FIG. 4, the genes showing the difference between the a and b condition information is group number 0, 2, 4, and 6, so that the bits in the corresponding bit string are 0, 2, 4, 6 is 1, the other bits are 〇; the genes in the group where the difference between the & and b conditions is 10 are the gene numbers 1, 3, 5, so the bit 1 in the corresponding bit string 3, 5 is 1, and the other bits are 〇.冉 based on a minimum gene value (NR)
串,以計算該等條件對最大維度集合,每一條件對最大維 度集合包括-條件對及-最大維度集,每一條件對係為相 麟-差值之二條件資訊,每—最大維度集係為該等位元 字:。在本實施例中,最小基因數值陶為3,若該等位 疋子串之位TL為1之個數小於3,朗除該位元字_。上述 :位元字串及咖㈣)未小於3, ^ 條件對最大維度集合包括二條件#訊_及二位元1 = (1010101 及 〇101010)。 70 子串 依據上述之方法計算在本實施 件對最大維度集入,/ Μ 斤有一條件貧訊之條 維度集合計算該:件^實施例中’依據該等條件對最大 等條件對最大維度隼:表/如圖5所示。其中’依據該 大_合之條件對;件:=等=對最 於該等條件對最大維 及计算母-條件資訊 其相關條件集合。 、σ I、件對中之另一條件資訊為 於該等條件對最大條件數表包括條件資訊、條件資訊 集σ出現之個數及條件資訊於該等 I45764.doc 201126365 條件對最大維度隼人之柏關仪入 於該等停件對之相關條件集合。例如:條件資訊a 件集合為b、C、d :度集合出現之個數為4’其相關之條 接著’依據—最小料f訊數 刪除部分條件資訊。若^、遠條件計數表 合出現之個數小二:削於該等條件對最大維度集 件資訊及在兑他=2條件資訊數值減^則刪除該條 訊。參考圖6 Λ 之相關條件集合内之該條件資 之示意圖,在該條件計數表刪除部分條件資訊 因為條件資訊等最小條件資訊數值㈣為3, 小於該最小條件最大維度集合出現之個數為^ 及在其他條件資;=Γ,),則刪除該條件資訊d 如:條件資訊:=集合内之該條件資訊,例 哥條件集合内之該條件資訊d。 接者,依據該等條件對 計數表,利用-區域及刪除後之該條件 如渴如withh;=優先及全域深度優先方法a〇cal 樹,以計算條件二::;:::二建立一條件列舉 字串集合。參考圖7Γ及相關基因位元 合及删除後之該條料據該等條件對最大維度集 參考圖8及圖9,其i表,建立該條件列舉樹。 優先方法,建立嗲條利用【域廣度優先及全域深度 建立該條件列舉==舉樹之步驟及其階層。以下說明 件資訊為該條件列舉該條件計數表之該等條 1)。接著,選㈣!: ’如圖8及圖9之步驟 一節點為基本加入節點(j〇in_ 145764.doc 201126365a string to calculate the conditional pair of maximum dimensions, each conditional pair of maximum dimension sets including a -condition pair and a maximum dimension set, each condition pair being a conditional information of the phase-difference value, each-maximum dimension set Is the meta word:. In this embodiment, the minimum gene value is 3, and if the number of bits TL of the bit string is 1 is less than 3, the bit word _ is deleted. The above: bit string and coffee (4)) are not less than 3, ^ conditional pair maximum dimension set includes two conditions #__ and two bits 1 = (1010101 and 〇101010). The 70 substrings are calculated according to the above method, and the maximum dimension set in the present embodiment is calculated, and the size dimension of the conditionality is calculated in the embodiment: in the embodiment, the maximum dimension is the largest dimension according to the conditions. : Table / as shown in Figure 5. Where 'depending on the condition of the big_combination; piece:=etc=calculates the parental-conditional information for the maximum dimension of the condition. , σ I, another condition information of the pair is the conditional information on the maximum condition number table, the number of condition information sets σ appearing, and the condition information in the I45764.doc 201126365 condition for the largest dimension Bai Guanyi entered the set of conditions related to these stop pieces. For example, the conditional information a is a set of b, C, and d: the number of degrees is 4', and the related bar is followed by the 'by-minimum f-number of messages to delete part of the condition information. If the number of occurrences of the ^ and far condition counts is less than two, the information is deleted for the maximum dimension set information and the value of the information is subtracted from the =2 conditional information. Referring to the schematic diagram of the condition in the relevant condition set of FIG. 6 , part of the condition information is deleted in the condition count table because the minimum condition information value (4) such as condition information is 3, and the number of occurrences smaller than the minimum condition maximum dimension set is ^ And in other conditions; = Γ,), delete the condition information d such as: condition information: = the condition information in the collection, the condition information d in the case collection. Receiver, according to the conditions of the counter table, the use of the - region and the condition after the deletion such as thirst as with; = priority and global depth-first method a 〇 cal tree, to calculate the condition two: :;::: two to establish a Conditional enumeration string collection. Referring to Figure 7 and related gene bits and deleted, the article is based on the conditions for the maximum dimension set. Referring to Figures 8 and 9, the i-table, the conditional enumeration tree is established. The priority method is to establish the use of the domain breadth priority and the global depth to establish the conditional enumeration == the steps of the tree and its hierarchy. The following description information lists the conditions of the conditional count table for this condition 1). Then, choose (four)! : ' As shown in Figure 8 and Figure 9, a node is a basic join node (j〇in_ 145764.doc 201126365
base) ’其他節點置於其他加入節點(〇仆以8)中基本加入 郎點與其他加入節點中之其他節點配對以產生至少—延 伸節點(expansion),該延伸節點具有至少二條件資訊,且 延伸節點設置於該選擇節點之下一層。如圖8及圖9之步驟 ^step 2)所示,例如:3為基本加入節點(j〇in_base),^他 U (b c、e)置於其他加入節點(〇thers)中,基本加入節 點與其他加入節點中之其他節點配對,以產生三個延伸節 點(ab、ac、ae)’該延伸節點具有二條件資訊,且延伸節 點Ub、ac、ae)設置於該選擇節點(〇之下一層(第二層, level 2) 〇 選擇其中—延伸節點為基本加人節點,其他延伸節點置 於其他加入節點中,基本加入節點與其他加入節點中之其 他延伸節點配對,以產生至少一延伸節點,該延伸節點具 有至少二條件資訊,且延伸節點設置於該選擇延伸節點之 下一層。如圖8及圖9之步驟3(step 3)所示,例如:選擇 ⑽為基本加人節點,其他延伸節點(ae、ae)置於其他加入 即點中’基本加入節點與其他加入節點中之其他延伸節點 配對’以產生二延伸節點(abc、abe),每一延伸節 三個條件資訊’且二延伸節點咖、叫設置於該選擇延 伸卽點(ab)之下一層(第三層,ievei 3)。 以參考圖8及圖9,重複上述步驟,以建立該條件列舉 :,母一節,點,並計算得每-節點之條件資訊為條件集 接著’利㈣條件列舉樹之節點之條件集合内之條件資 I45764.doc 201126365 訊,及相對應條件資訊於該等條件對最大維度集合之複數 個位兀字Φ 條件列舉樹之節點之相關基因位元字 串集口在本霄鈀例中,可利用交互邏輯及(AND)運算(0) 相對應條件資㈣該㈣件對最A維錢合之複數個位元 字串,計算該條件列舉樹之節點之相關基因位元字串集 合0Base) 'other nodes are placed in other joining nodes (the servant 8) to be paired with other nodes in the other joining nodes to generate at least an extension node having at least two conditional information, and The extension node is placed below the selection node. As shown in steps 2 and 2 of FIG. 8 and FIG. 9, for example, 3 is a basic joining node (j〇in_base), and ^U (bc, e) is placed in other joining nodes (〇thers), and the node is basically joined. Pairing with other nodes in other joining nodes to generate three extended nodes (ab, ac, ae) 'The extended node has two conditional information, and the extended nodes Ub, ac, ae) are set under the selected node One layer (the second layer, level 2) 〇 selects one—the extended node is the basic adding node, the other extended nodes are placed in other joining nodes, and the basic joining node is paired with other extended nodes in other joining nodes to generate at least one extension. a node, the extended node has at least two conditional information, and the extended node is disposed below the selected extended node. As shown in step 3 of step 8 and FIG. 9, for example, selecting (10) is a basic adding node, The other extension nodes (ae, ae) are placed in other join points, 'the basic join node is paired with other extension nodes in other join nodes' to generate two extension nodes (abc, abe), and each extension section has three conditions. And the second extension node is called a layer below the selection extension point (ab) (third layer, ievei 3). With reference to FIG. 8 and FIG. 9, the above steps are repeated to establish the condition enumeration: The parent section, the point, and the condition information of each node are calculated as the condition set and then the conditions in the condition set of the node of the conditional enumeration tree are I45764.doc 201126365, and the corresponding condition information is the largest The plurality of bits of the dimension set 兀 the word Φ of the node of the conditional enumeration tree is in the palladium case, and the interaction logic and (AND) operation (0) can be used to match the condition (4) the (four) piece Calculating the set of related gene bit strings of the nodes of the conditional enumeration tree for a plurality of bit strings of the most A-dimensional money
配合參考圖5之該等條件對最大維度集合及圖1〇,以節 點abc為例,節點abc之相關基因位元字串集合可以下式表 不=G& ®GSA[.,其中仍讣為ab條件對最大維度集合 之該等位元字串(1G1G1G1、_1_),队為㈣件對最 大維度集合之該等位元字串(1〇1〇1〇〇、〇1〇l〇i〇),G&C為 be條件對最大維度集合之該等位元字串(1〇1〇1〇〇£、 (Η0ΗΗ0)。交互邏輯及(AND)運算⑻队、队及队以 計算得節點abc之相關基因位元字串集合gs…為 (1010100、0101010)。同理, 相關基因位元字串集合GSabe。 節點abe可以上述方法計算其 因本發明之方法利用區域廣度優先及全域深度優先方法 ❹該條件列舉樹’故可利用另—方法計算該條件列舉樹 之節點之相關基因位元字串集合。該條件列舉樹之節點之 條件集合内具有η個條件資訊,以交互邏輯及(and)運算 第一個條件貧訊至第n-1個條件資訊於該等條件對最大維 度集合之複數個位元字串及第一個條件資訊至第η。個條 件資訊及第η個條件資訊於該等條件對最大維度集合之複 數個位元字串,以及第n-丨個條件資訊及第〇個^件;訊^ 145764.doc 201126365 該等條件對最大維度集合之複數個位元字串,計算該條件 列舉樹之節點之相關基因位元字串集合。 如圖11所示’以節點讣以為例說明,節點讣以之相關基 因位元字串集合可以下式表示,因本 發明之方法利用區域廣度優先及全域深度優先方法建立該 條件列舉樹’在建立節點abed之前已經建立節點abc及節 ,·-占abd故可利用上述方法,不論節點之條件資訊有幾 • 個,交互邏輯及(and)運算均固定為二次,以降低運算時 間及運算複雜度。 由於習知之交互邏輯及(AND)運算需對運算之集合内之 所有位元字串進行交互運算,參考圖12所示,集合A内具 有四個位7L字串,集合B内具有四個位元字串,若集合A 及集合B進行交互邏輯及(AND)運算,則必須進行i6次邏 輯及(AND)運算。若將集合a内相似之位元字串分群,集 合B内相似之位元字串分群,再將集合内相似群組之位元 • 字串做交互邏輯及(AND)運算,如圖13所示,以集合a之 才乜群,,且及集合B之相似群組進行交互邏輯及(AND)運 算,則僅需進行8次邏輯及(and)運算。 利用上述相似群組之方法,本發明可利用另一方法計算 “條件列舉W之節點之相關基因位元字串集合。將相對應 條件資訊於該等條件對最大維度集合之複數個位元字串分 群,製作一簽章表(signature table),該簽章表具有至少一 簽章位元字串(Sig)及至少一位元字串集合(bss),二相對 '“条件貝讯之簽章表之該簽章位元字串做邏輯及⑽d)運 145764.doc 201126365 算,若邏輯及(AND)運算結果之1的個數大於一最小基因 數值’則二相對應條件資訊之簽章表之該位元字串集合内 之位元字串做交互邏輯及(AND)運算,計算該條件列舉樹 之郎點之相關基因位疋字事集合。 參考圖1 4,若某一 X節點於該等條件對最大維度集合具 有三個位元字串(1010101、1010010、〇101010),將三個位 元子串分群’製作一簽章表(signature table),首先第一個 位元字串(1010101)先置於第一個群組之簽章位元字串 (Sig)及位元字串集合(BSS)内。第二個位元字串(ι〇ι〇〇10) 與簽章位元字串(Sig ’即第一個位元字串1010101)做邏輯 或(OR)運算’並計算邏輯或(〇R)運算結果之1的個數,再 減去簽章位元字串(Sig)i丨的個數,若小於或等於一門檻 值T(例如T為1),則表示其可為同一群組,且邏輯或(〇R) 運算結果取代該簽章位元字串(Sig),並將第二個位元字串 (1010010)加在位元字串集合(BSS)内。 接著,第三個位元字串(0101010)與簽章位元字串(sig, 即101 0111)做邏輯或(〇R)運算,並計算邏輯或(〇R)運算結 果之1的個數,再減去簽章位元字串(Sig)2 i的個數,此時 大於該門檻值,則表示第三個位元字串不與第一個位元字 串及第一個位TL字串為同一群組,故第三個位元字串為另 一群組之簽章位元字串(Sig),且為另一群組之位元字串集 合(BSS)。 參考圖15,其顯示經分群組之二簽章表,當進行二簽章 表之運算時’第一簽章表S1之第一簽章位元字串 145764.doc 12 201126365 (1010101)與第二簽章 做邏輯及⑽D)運算::第一簽早位元字串(1〇1〇_ )運异右邏輯及(AND)運算結果之】的個數 大於該最小基因數值(NR),表示二者之群組相似則第一 簽早表及第二簽章表之該位元字串集合内之位元字串做交 互邏輯及(AND)運算。利用上述之分群方法,可減少邏輯 及(娜)運算之次數,降低運算時間及運算複雜度。 在建立該條件列舉樹時,本發明之方法另包括—第_限 制步驟,用以判斷甚兮铲 d斷右3亥即點下之所有節點之最大條件資 數小於該最小條件資1斜# 條件貝錢值,則不須建立該節點下 點。參考圖1 6所示,該筋點c下夕私士外 訊數為2,1小於本實二二所有郎點之最大條件資 巧 ^於本貫施例之該最小條件資訊數心 此,則不須建立該節點c下之節點ce。 一本發明之方法另包括一第二限制步驟,用以判斷 =斷㈣之所有條件資訊包含—第:判斷節點之所 件身訊’且該第-判斷節點之相關基因位元於 :第二判斷節點之相關基因位元字串集合,則不須 第一判斷郎點下之節點。參考圖"所示若 μ Υ之所有條件資訊(abce;)包含第_ # 即點 已3弟一判斷節點χ之所有 訊㈣’且該第一判斷節點¥之相關基因位 二 ⑽_〇、〇H)1〇10)等於該第二 集3 元字串集合(1〇1〇100、咖010),則不^之相關基因位 節點X下之節點ace。 、、立°玄第二判斷 本發明之方法另包括-第三限制步驟,用以判斷若 -判斷節點之所有條件資訊包含一第二判斷節點下之所有 145764.doc •13- 201126365 節點之所有條件資訊 雖然該第一判斷節點之相關基因位 兀字串集合不等於該第二判斷節點之相關基因位元字串集 合,但若該第一判斷節點之相關基因位元字串集合等於一 第三判斷節點之相關基因位元字串集合,且㈣二判斷節 點下之所有節點之所有條件資訊包含該第三判斷節點之所 有條件資訊,則不須建立該第二判斷節點下之節點。 參考圖18 ’若第-判斷節點丫之所有條件資訊如包含With reference to the conditions of FIG. 5 for the maximum dimension set and FIG. 1 , taking the node abc as an example, the set of related gene bit strings of the node abc can be expressed as follows: G& GSA [. The ab condition is the same as the maximum dimension set of the bit string (1G1G1G1,_1_), and the team is the (four) piece to the largest dimension set of the bit string (1〇1〇1〇〇,〇1〇l〇i〇 ), G&C is the bit string of the conditional versus maximum dimension set (1〇1〇1〇〇£, (Η0ΗΗ0). Interaction logic and (AND) operation (8) team, team and team to calculate the node The abc related gene bit string set gs... is (1010100, 0101010). Similarly, the related gene bit string set GSabe. The node abe can be calculated by the above method to utilize the region breadth priority and the global depth priority according to the method of the present invention. Method ❹ the conditional enumeration tree 'so another method can be used to calculate the related gene bit string set of the node of the conditional enumeration tree. The conditional enumeration tree has a conditional set of n conditional information in the node set to interactive logic and And) operation of the first condition of the poor to the n-1 condition And a plurality of bit strings and the first condition information to the nth conditional information and the nth condition information in the plurality of bit strings of the conditional pair of the largest dimension set And the n-th condition information and the second piece of information; the signal ^ 145764.doc 201126365 These conditions for the plurality of bit strings of the largest dimension set, calculate the relevant gene bit words of the node of the conditional enumeration tree As shown in Figure 11, the node 讣 is used as an example to illustrate that the set of related gene bit strings can be represented by the following formula. The method of the present invention uses the region breadth-first and global depth-first methods to establish the condition list. The tree 'has established the node abc and the node before establishing the node abed, and can take advantage of the above method, regardless of the condition information of the node, the interaction logic and the (and) operation are fixed to the second to reduce the operation. Time and computational complexity. Since the conventional interaction logic and (AND) operations need to interleave all the bit strings in the set of operations, as shown in Figure 12, there are four in the set A. Bit 7L string, set B has four bit string, if set A and set B perform interactive logic and (AND) operation, it must perform i6 logical AND (AND) operation. If the set a is similar The bit string is grouped, the similar bit string in the set B is grouped, and the bit string of the similar group in the set is used as an interactive logic and (AND) operation, as shown in FIG.乜 group, and similar groups of set B perform interactive logic and (AND) operations, only need to perform 8 logical AND operations. Using the above similar group method, the present invention can be calculated by another method "Conditions list the relevant gene bit string sets of nodes of W. And grouping a plurality of bit strings of the largest dimension set by the corresponding condition information to create a signature table having at least one signature bit string (Sig) and at least A set of metastrings (bss), two relative to the 'conditions of the signing of the signing of the signing string of the signing string of the string and (10) d) 145764.doc 201126365 calculation, if the logical AND (AND) operation results The number of 1 is greater than a minimum gene value', then the bit string in the set of the corresponding string of the corresponding condition information is subjected to an interactive logic and (AND) operation, and the conditional enumeration tree is calculated. The related gene is located in the word set. Referring to Figure 14, if an X node has three bit strings (1010101, 1010010, 〇101010) in the maximum dimension set, the three bit substrings Grouping 'making a signature table, first the first bit string (1010101) is placed first in the first group of signature bit string (Sig) and bit string set (BSS) Inside. The second bit string (ι〇ι〇〇10) and the signature bit string (Sig ' A bit string 1010101) performs a logical OR operation and calculates the number of 1 of the logical OR (〇R) operation result, and subtracts the number of the signature bit string (Sig) i丨, if Less than or equal to a threshold T (for example, T is 1) means that it can be the same group, and the logical OR (〇R) operation replaces the signature bit string (Sig) and the second bit The meta string (1010010) is added to the bit string set (BSS). Next, the third bit string (0101010) is logically ORed with the signature bit string (sig, ie 101 0111) (〇R Operate, and calculate the number of logical OR (〇R) operations, and subtract the number of signature bit strings (Sig) 2 i. If the threshold is greater than the threshold, the third bit is The meta string is not in the same group as the first bit string and the first bit TL string, so the third bit string is another group of signature bit strings (Sig), and The bit string set (BSS) of another group. Referring to FIG. 15, which shows the second signature table of the subgroup, when the operation of the two signature table is performed, the first of the first signature table S1 Signature bit string 145764.doc 1 2 201126365 (1010101) and the second signature to do logic and (10) D) operation: the number of the first sign of the early meta-string (1〇1〇_) and the right-hand logical (AND) operation is greater than the The minimum gene value (NR) indicates that the group of the two groups is similar, and the bit string in the set of the first string of the first signature table and the second signature table is subjected to an interactive logic and (AND) operation. The grouping method can reduce the number of logical and (na) operations, and reduce the computation time and computational complexity. When the conditional enumeration tree is established, the method of the present invention further includes a -th limiting step for determining that the maximum conditional value of all the nodes under the point of the right shovel d is less than the minimum condition 1 oblique# Conditional Bayesian value, you do not need to establish the point below the node. Referring to FIG. 16 , the number of external signals of the ribs c is 2, 1 is less than the maximum condition of all the lang points of the real 2nd nd. The minimum condition information of the present embodiment is the same. Then it is not necessary to establish the node ce under the node c. A method of the present invention further includes a second limiting step for determining that all condition information of = (4) includes - a: determining the body of the node and the associated gene of the first determining node is: Judging the node's related gene bit string set, it is not necessary to determine the node under the first point. All the condition information (abce;) of the reference figure " if μ 包含 contains the _ # ie, the point 3 has a judgment node χ all the messages (four) 'and the first judgment node ¥ related gene two (10) _ 〇 〇H)1〇10) is equal to the second set of 3 yuan string set (1〇1〇100, coffee 010), then the node ace of the relevant gene position node X is not. The second method of determining the present invention further includes a third limiting step for determining that all condition information of the if-determining node includes all of the 145764.doc •13-201126365 nodes under a second determining node. Conditional information, although the related gene position 兀 string set of the first determining node is not equal to the related gene bit string set of the second determining node, if the first determining node has a related gene bit string set equal to one If all the condition information of all the nodes under the (4) two judgment nodes includes all the condition information of the third judgment node, the node under the second judgment node is not required to be established. Referring to Figure 18, if all the condition information of the first-decision node is included
第二判斷節點X下之所有節點之所有條件資訊_,雖然該 第-判斷節點Y之相關基因位元字串集合⑽〇ι〇〇 ' 0101 010)不等於該第二判斷節點χ之相關基因位元字串集 口 (1111111),但若該第_判斷節點γ之相關基因位元字串 集合(1_1〇〇、〇1〇1〇10)等於第三判_節點2之相關基因 位元字串集合(1010100、0101010),且該第二判斷節點乂 下之所有節點之所有條件資訊aee包含該第三判斷節點2之 所有條件資訊ae,則不須建立該第三判斷節之節點 ace ° 本發明利用條件列舉樹於DNA微陣列資料中探勘雙向分 群之方法’不需要去建構習知技術巾基因對的最大維度集 合,可避免複雜的分佈過程,並且利用雜湊結合的觀冬來 有效率地找出雙向分群的結果。另外,利龍域廣度優先 及全域深度優先方法,建立條件列舉樹。藉此,本發明利 用條件列舉樹於DNA微陣列f料中探勘雙向分群之方法可 以避免複雜的分佈過程,且可大量地降低搜尋空間及搜 時間。 寸 145764.doc 14 201126365 並且’本發明利用條件列舉樹於DN A微陣列資料中探勘 雙向刀群之方法可適用於架設於全球資訊網(WWW)上的 生物資料庫分析網站,使用者可透過本發明之方法對基因 微陣列—貝料進行分群分析,或者研究單位(例如:研究基 因微陣列的生醫科技公司)可以利用分群結果幫助了解基 因間的關係。 ι 上述實施例僅為說明本發明之原理及其功效,並非限制Second, all condition information of all the nodes under the node X is judged, although the related gene bit string set (10) 〇ι〇〇 ' 0101 010 of the first judgment node Y is not equal to the related gene of the second judgment node Bit string set port (1111111), but if the relevant gene bit string set (1_1〇〇, 〇1〇1〇10) of the first judgment node γ is equal to the third gene_node 2 related gene bit a string set (1010100, 0101010), and all the condition information aee of all the nodes of the second judgment node include all the condition information ae of the third judgment node 2, and the node ace of the third judgment section is not required to be established. The method of the invention uses the conditional enumeration tree to explore the two-way grouping in the DNA microarray data, which does not need to construct the largest dimension set of the conventional technology towel gene pair, can avoid the complicated distribution process, and uses the hash combination to observe the winter. Efficiently find the results of two-way grouping. In addition, the Lilong domain breadth-first and global depth-first methods establish a conditional enumeration tree. Therefore, the method for exploring bidirectional grouping in the DNA microarray f material by using the conditional enumeration tree of the present invention can avoid complicated distribution process, and can greatly reduce the search space and the search time. Inch 145764.doc 14 201126365 and the method of using the conditional enumeration tree to explore the two-way knife group in the DN A microarray data can be applied to the biological database analysis website installed on the World Wide Web (WWW), the user can The method of the present invention performs a cluster analysis on the gene microarray-bean, or a research unit (for example, a biomedical company that studies gene microarrays) can use clustering results to help understand the relationship between genes. The above embodiments are merely illustrative of the principles and effects of the present invention and are not limiting.
本發明。HI此習於此技術之人士對上㉛實施 <列進行修改及 變化仍不脫本發明之精神。本發明之權利範圍應如後述之 申請專利範圍所列。 【圖式簡單說明】 圖1顯示本發明實施例之DNA微陣列資料之示意圖;this invention. HI, the person skilled in the art, does not deviate from the spirit of the invention by modifying and changing the implementation of the above. The scope of the invention should be as set forth in the appended claims. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a schematic view showing DNA microarray data of an embodiment of the present invention;
圖2’„|不计算每一編號之基因在3及b二條件編號下之 件資sfl之差值之示意圖; 一差值排序之示意 圖3顯示將該等差值由小至大排列為 圖; 圖4顯示依據差值排序產生相對應位元字申之示意圖. 圖51㈣料條件對最大維度集合計算 表之示意圖; τΤ数 示於該條件計數表刪除部分條件資訊之示意圖; 之示意圖; “条件樹建立該條件列舉樹 圖8及圖9顯示利用區域廣度優先及全域深度優先方法建 立該條件列舉樹之㈣及其階層^意圖;^方法建 圖1。顯示利用交互邏輯及(AND)運算計算相關基因位元 145764.doc -15- 201126365 字串集合之示意圖; 圖11顯示本發明利用二次交互邏輯及(and)運算計算相 關基因位元字串集合之示意圖; 圖丨2顯示習知利用交互邏輯及(AND)運算計算集合内所 有位元字串之示意圖; 圖13 -員示利用交互邏輯及(AND)運算計算集合内相似群 組所有位元字串之示意圖; 圖14顯示利用分群製作某一 「 一 X節點之簽章表之示意圖;Figure 2'|| does not calculate the difference between the number of the sfl of the numbered genes under the 3 and b conditional numbers; Figure 3 of a difference ranking shows that the differences are arranged from small to large Figure 4 shows a schematic diagram of the corresponding bit word generation according to the difference ordering. Figure 51 (4) Schematic diagram of the material condition to the maximum dimension set calculation table; τΤ number is shown in the conditional count table to delete part of the condition information; The conditional tree establishes the conditional enumeration tree. FIG. 8 and FIG. 9 show that the conditional enumeration tree (4) and its hierarchical structure intention are established by using the region breadth-first and the global depth-first method; The diagram shows the use of interactive logic and (AND) operation to calculate the related gene bit 145764.doc -15- 201126365 string set; Figure 11 shows the use of quadratic interaction logic and (and) operation to calculate the relevant gene bit string set Schematic diagram 2 shows a schematic diagram for calculating all bit strings in a set using interactive logic and (AND) operations; Figure 13 - Using interactive logic and (AND) operations to calculate all bits of similar groups in a set Schematic diagram of a string; Figure 14 shows a schematic diagram of making a "one X node signature table" by grouping;
145764.doc • 16 ·145764.doc • 16 ·