TW201126365A

TW201126365A - Method for mining biclusters from DNA microarray data using condition-enumeration tree

Info

Publication number: TW201126365A
Application number: TW99102695A
Authority: TW
Inventors: Ye-In Chang; Jiun-Rung Chen
Original assignee: Univ Nat Sun Yat Sen
Priority date: 2010-01-29
Filing date: 2010-01-29
Publication date: 2011-08-01
Also published as: TWI430128B

Abstract

The invention relates to a method for mining biclusters from DNA microarray data using condition-enumeration tree. Firstly, a plurality of condition-pair maximum dimension sets (MDSs) and a condition count table are calculated. Then, according to condition-pair maximum dimension sets (MDSs) and the condition count table, the condition-enumeration tree is expanded in a local breadth-first within global depth-first manner to efficiently find all pClusters. Whereby, the complicated distribution process is simplified and the searching space and searching time are reduced. Furthermore, user(s) can utilize the method of the invention to cluster DNA microarray data of biological data base on the World Wide Web (WWW), and can utilize the clustering results to realize relationship between genes.

Description

201126365 六、發明說明：【發明所屬之技術領域】本發明係關於一種於一資料中探勘雙向分群之方法，詳言之，係關於利用條件列舉樹於DNA微陣列資料中探勘雙向分群之方法。【先前技術】在基因微陣列中對基因及實驗條件進行子空間分群之分201126365 VI. Description of the Invention: [Technical Field of the Invention] The present invention relates to a method for exploring bidirectional grouping in a data, and more particularly to a method for exploring bidirectional grouping in a DNA microarray data using a conditional enumeration tree. [Prior Art] Subspace grouping of genes and experimental conditions in gene microarrays

群技術，已被證明能幫助理解如基因功能、基因調節、細胞進程以及細胞亞型等。一般而言，在大部分的情況下，一種疾病係由多筆基因構成，因此研究人員不斷設法去找某二基因在某些條件下所具有的相似表現，以作為判斷疾病之依據。在習知技術中，支援基因微陣列中子空間分群的方法裡’常見的方法如zCluster，其為找出某些基因 ^某些條件下有_致性表現的子空間分群。，然而，這兩個知方法都包含很費時的步驟’也就是建構基因對的最大合以及分佈其字首樹每個節點上的基因資訊。因 ’習知pcmusua zCluster之基因微陣列中子空間分群方門其分佈過程複雜，且必須制大量之搜尋Μ及搜尋問題二有要一 :rcr°c—^ 問題。—$要去解—個叫做Maximai⑶㈣的价實有必要提供樹於崎微陣❹進料之❹條件列舉陣歹J貝枓中探勘雙向分群之方法，以解決上述 145764.doc 201126365 問題。【發明内容】本發明提供一種利用條件列舉樹於DNA微陣列資料中探勘雙向分群之方法，其中該麵微陣列係由複數個基因及每基因之複數個條件資訊所構成之陣列該等基因具有順序之基因編號，該等條件資訊具有順序之條件編號’該方法包括以下步驟：⑷計算複數個條件對最大維度集合及 =條件計數表，其中每—條件對最大維度集合包括二條件貝Λ及複數個位元字串，該條件計數表包括條件資訊、條件資訊於該等條件對最大維度集合出現之個數及條件資訊於該等條件對最大維度集合之相關條件集合，依據一最 Η条件資5fl數值’於該條件計數表删除部分條件資訊；及 ⑷依據該等條件對最大維度集合及刪除後之該條件計數表’利用-區域廣度優先及全域深度優先方法建立一條件列舉樹，以計算條件列舉樹之節點之條件集合及相關基因位元字串集合。本發明利用條件列舉樹於DNA微陣列資料中探勘雙向分群之方法，不需要去建構習知技術中基因對的最大维卢集合，可避免複雜的分佈過程’並且利用雜湊結合的觀令來有效率地找出雙向分群的結果。另外，利用區域廣度優先及全域深度優先方法，建立條件列舉樹。藉此，本發明利用條件列舉樹於DNA微陣列資料中探勘雙向分群之方法可以避免複雜的分佈過程’且可大量地降低搜尋空間及搜尋時間。 I45764.doc 201126365 =本發明利用條件列舉樹於DNA微陣列資料中探勘雙向分群之方法可、态m 週用於架設於全球資訊網（WWW)上的生物貝料庫刀析網站，使用者可透過本發明之方法對基因微陣列貝料進仃分群分析，或者研究單位（例如：研究基因微陣列的生醫科枯十技A司）可以利用分群結果幫助了解基因間的關係。【實施方式】以下參考圖式，埒aa I々 °月本發明利用條件列舉樹於DNA微陣列資料中探勘雙向分群之方法。圖旧示本發明實施例之 DNA微陣列資料之示意圖。在本發明實施财，該腦微陣列係由複數個基因土 L1及母—基因之複數個條件資訊所構成之陣列，該等基因且右，値广 ^ U具有順序之基因編號，該等條件資訊具有順序之條件編號。首先’本發明之方、、表卜法计异複數個條件對最大維度集合及一條件計數表，l φ灰 ..α 〃中母一條件對最大維度集合包括二條件杜^及複數個位元字串’該條件計數表包括條件資訊、條貝=於該等條件對最大維度集合出現之個數及條件資訊 ;該專條件對最大維度集合之相關條件集合。上述計算該等條件對最大維度集合之步驟，首先計算每一編號之基因在| & μ & 母—條件編號下之條件資訊之差值，參考圖2，其顯示計算备_始0备 ^ ^ 、扁號之基因在1及b二條件編號下之條件負訊*之差值。技基者’將该專差值由小至大排列為-差值排序’且依據兮至# 4 锞孩差值排序排列其相應之基因，參考圖 145764.doc 1 ’其顯示將該等差信± , 寻差值由小至大排列為一差值排序。 201126365 接著，依據一容忍誤差值’將該差值排序分群，且依據分群之差值排序其相應之基因編號，以位元字串表示分群之差值排序。參考圖4，其顯示a及b條件資訊之差值為i之群組之基因為基因編號0、2、4、6，故在相對應之位元字串之位元0、2、4、6為1，其他位元為〇 ;在&及b條件資气之差值為10之群組之基因為基因編號1、3、5，故在相對應之位元字串之位元1、3、5為1，其他位元為〇。冉依據一最小基因數值（NR)Group technology has been shown to help understand such things as gene function, gene regulation, cell progression, and cell subtypes. In general, in most cases, a disease is made up of multiple genes, so researchers are constantly trying to find a similar performance of a certain gene under certain conditions as a basis for judging the disease. In the conventional technique, a method for supporting subspace grouping in a gene microarray is a common method such as zCluster, which is to find out a certain sub-group of sub-spaces in which certain genes have singular expression. However, both of these methods involve very time consuming steps'—that is, constructing the largest pair of gene pairs and distributing the gene information at each node of the prefix tree. Because of the fact that the distribution of sub-space grouping in the gene microarray of pcmusua zCluster is complicated, it is necessary to make a large number of search and search problems. There is one problem: rcr°c-^. —$To solve the problem—a price called Maximai(3)(4) It is necessary to provide a method for the two-way clustering of the 于歹微微 , , , , , 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 145 SUMMARY OF THE INVENTION The present invention provides a method for exploring two-way clustering in a DNA microarray data using a conditional enumeration tree, wherein the surface microarray is an array of a plurality of genes and a plurality of conditional information for each gene. Sequence gene number, the conditional information has a sequence of condition numbers' The method comprises the following steps: (4) calculating a plurality of conditional pair maximum dimension sets and a = conditional count table, wherein each of the conditional pair of maximum dimension sets includes two conditional beauties and a plurality of bit strings, the conditional count table including condition information, condition information, the number of occurrences of the conditions on the largest dimension set, and conditional information on the conditional set of the maximum dimension set, according to a last condition The 5fl value 'deletes part of the condition information in the conditional count table; and (4) establishes a conditional enumeration tree by using the -area breadth-first and global depth-first method for the maximum dimension set and the deleted conditional count table according to the conditions Calculate the condition set of the node of the conditional enumeration tree and the related gene bit string setThe invention utilizes the conditional enumeration tree to explore the two-way grouping method in the DNA microarray data, and does not need to construct the maximum Weilu set of the gene pair in the conventional technology, can avoid the complicated distribution process' and utilizes the hash combination to order Efficiently find the results of two-way grouping. In addition, a conditional enumeration tree is established using the area breadth priority and the global depth priority method. Thereby, the method of using the conditional enumeration tree in the DNA microarray data to explore the two-way grouping can avoid the complicated distribution process' and can greatly reduce the search space and the search time. I45764.doc 201126365=The invention utilizes the conditional enumeration tree to explore the two-way grouping method in the DNA microarray data, and the state m week is used for the biological shellfish database analysis website which is installed on the World Wide Web (WWW), and the user can The gene microarray beakers can be analyzed by the method of the present invention, or the research unit (for example, the biomedical system of the research microarray) can use the grouping results to help understand the relationship between genes. [Embodiment] Hereinafter, the present invention uses a conditional enumeration tree to explore a bidirectional grouping method in DNA microarray data by referring to the figure. The figure shows a schematic diagram of DNA microarray data of an embodiment of the present invention. In the practice of the present invention, the brain microarray is an array consisting of a plurality of genetic information of a plurality of genetic soils L1 and a mother-gene, and the genes and the right, the 値广^ U have a sequence of gene numbers, and the conditions The information has a conditional number in the order. First of all, the method of the present invention, the table method, the multiplicity of conditions, the maximum dimension set and a conditional count table, l φ gray..α 〃, the mother-condition-to-maximum dimension set includes two conditional sums and plural bits The meta-string includes the condition information, the bar = the number of occurrences of the condition to the largest dimension set, and the condition information; the set of conditions for the condition set to the maximum dimension set. The above steps of calculating the conditions for the maximum dimension set, first calculating the difference of the condition information of each numbered gene under the | & μ & parent-condition number, referring to Figure 2, which shows the calculation preparation_starting 0 ^ ^, the difference between the conditional negative signal of the gene of the squad number under the conditions 1 and b. The skill base 'arranges the discrepancies from small to large as - difference sorting' and ranks the corresponding genes according to the order of 兮 to # 4 锞 children, refer to Figure 145764.doc 1 'which shows the difference The letter ±, the difference value is arranged from small to large as a difference order. 201126365 Next, the difference is sorted according to a tolerance error value', and the corresponding gene number is sorted according to the difference of the group, and the difference order of the group is represented by the bit string. Referring to FIG. 4, the genes showing the difference between the a and b condition information is group number 0, 2, 4, and 6, so that the bits in the corresponding bit string are 0, 2, 4, 6 is 1, the other bits are 〇; the genes in the group where the difference between the & and b conditions is 10 are the gene numbers 1, 3, 5, so the bit 1 in the corresponding bit string 3, 5 is 1, and the other bits are 〇.冉 based on a minimum gene value (NR)

串，以計算該等條件對最大維度集合，每一條件對最大維度集合包括-條件對及-最大維度集，每一條件對係為相麟-差值之二條件資訊，每—最大維度集係為該等位元字:。在本實施例中，最小基因數值陶為3，若該等位疋子串之位TL為1之個數小於3，朗除該位元字_。上述 :位元字串及咖㈣）未小於3， ^ 條件對最大維度集合包括二條件#訊_及二位元1 = (1010101 及〇101010)。 70 子串依據上述之方法計算在本實施件對最大維度集入，/ Μ 斤有一條件貧訊之條維度集合計算該:件^實施例中’依據該等條件對最大等條件對最大維度隼：表/如圖5所示。其中’依據該大_合之條件對;件:=等=對最於該等條件對最大維及计算母-條件資訊其相關條件集合。、σ I、件對中之另一條件資訊為於該等條件對最大條件數表包括條件資訊、條件資訊集σ出現之個數及條件資訊於該等 I45764.doc 201126365 條件對最大維度隼人之柏關仪入於該等停件對之相關條件集合。例如：條件資訊a 件集合為b、C、d :度集合出現之個數為4’其相關之條接著’依據—最小料f訊數刪除部分條件資訊。若^、遠條件計數表合出現之個數小二:削於該等條件對最大維度集件資訊及在兑他=2條件資訊數值減^則刪除該條訊。參考圖6 Λ 之相關條件集合内之該條件資之示意圖，在該條件計數表刪除部分條件資訊因為條件資訊等最小條件資訊數值㈣為3, 小於該最小條件最大維度集合出現之個數為^ 及在其他條件資;=Γ，)，則刪除該條件資訊d 如：條件資訊：=集合内之該條件資訊，例哥條件集合内之該條件資訊d。接者，依據該等條件對計數表，利用-區域及刪除後之該條件如渴如withh；=優先及全域深度優先方法a〇cal 樹，以計算條件二：：；：：：二建立一條件列舉字串集合。參考圖7Γ及相關基因位元合及删除後之該條料據該等條件對最大維度集參考圖8及圖9,其i表，建立該條件列舉樹。優先方法，建立嗲條利用【域廣度優先及全域深度建立該條件列舉==舉樹之步驟及其階層。以下說明件資訊為該條件列舉該條件計數表之該等條 1)。接著，選㈣！： ’如圖8及圖9之步驟一節點為基本加入節點（j〇in_ 145764.doc 201126365a string to calculate the conditional pair of maximum dimensions, each conditional pair of maximum dimension sets including a -condition pair and a maximum dimension set, each condition pair being a conditional information of the phase-difference value, each-maximum dimension set Is the meta word:. In this embodiment, the minimum gene value is 3, and if the number of bits TL of the bit string is 1 is less than 3, the bit word _ is deleted. The above: bit string and coffee (4)) are not less than 3, ^ conditional pair maximum dimension set includes two conditions #__ and two bits 1 = (1010101 and 〇101010). The 70 substrings are calculated according to the above method, and the maximum dimension set in the present embodiment is calculated, and the size dimension of the conditionality is calculated in the embodiment: in the embodiment, the maximum dimension is the largest dimension according to the conditions. : Table / as shown in Figure 5. Where 'depending on the condition of the big_combination; piece:=etc=calculates the parental-conditional information for the maximum dimension of the condition. , σ I, another condition information of the pair is the conditional information on the maximum condition number table, the number of condition information sets σ appearing, and the condition information in the I45764.doc 201126365 condition for the largest dimension Bai Guanyi entered the set of conditions related to these stop pieces. For example, the conditional information a is a set of b, C, and d: the number of degrees is 4', and the related bar is followed by the 'by-minimum f-number of messages to delete part of the condition information. If the number of occurrences of the ^ and far condition counts is less than two, the information is deleted for the maximum dimension set information and the value of the information is subtracted from the =2 conditional information. Referring to the schematic diagram of the condition in the relevant condition set of FIG. 6 , part of the condition information is deleted in the condition count table because the minimum condition information value (4) such as condition information is 3, and the number of occurrences smaller than the minimum condition maximum dimension set is ^ And in other conditions; = Γ,), delete the condition information d such as: condition information: = the condition information in the collection, the condition information d in the case collection. Receiver, according to the conditions of the counter table, the use of the - region and the condition after the deletion such as thirst as with; = priority and global depth-first method a 〇 cal tree, to calculate the condition two: :;::: two to establish a Conditional enumeration string collection. Referring to Figure 7 and related gene bits and deleted, the article is based on the conditions for the maximum dimension set. Referring to Figures 8 and 9, the i-table, the conditional enumeration tree is established. The priority method is to establish the use of the domain breadth priority and the global depth to establish the conditional enumeration == the steps of the tree and its hierarchy. The following description information lists the conditions of the conditional count table for this condition 1). Then, choose (four)! : ' As shown in Figure 8 and Figure 9, a node is a basic join node (j〇in_ 145764.doc 201126365

base) ’其他節點置於其他加入節點（〇仆以8)中基本加入郎點與其他加入節點中之其他節點配對以產生至少—延伸節點（expansion)，該延伸節點具有至少二條件資訊，且延伸節點設置於該選擇節點之下一層。如圖8及圖9之步驟 ^step 2)所示，例如：3為基本加入節點（j〇in_base)，^他 U (b c、e)置於其他加入節點（〇thers)中，基本加入節點與其他加入節點中之其他節點配對，以產生三個延伸節點（ab、ac、ae)’該延伸節點具有二條件資訊，且延伸節點Ub、ac、ae)設置於該選擇節點（〇之下一層（第二層， level 2) 〇選擇其中—延伸節點為基本加人節點，其他延伸節點置於其他加入節點中，基本加入節點與其他加入節點中之其他延伸節點配對，以產生至少一延伸節點，該延伸節點具有至少二條件資訊，且延伸節點設置於該選擇延伸節點之下一層。如圖8及圖9之步驟3(step 3)所示，例如：選擇 ⑽為基本加人節點，其他延伸節點（ae、ae)置於其他加入即點中’基本加入節點與其他加入節點中之其他延伸節點配對’以產生二延伸節點(abc、abe)，每一延伸節三個條件資訊’且二延伸節點咖、叫設置於該選擇延伸卽點（ab)之下一層（第三層，ievei 3)。以參考圖8及圖9,重複上述步驟，以建立該條件列舉 :,母一節，點，並計算得每-節點之條件資訊為條件集接著’利㈣條件列舉樹之節點之條件集合内之條件資 I45764.doc 201126365 訊，及相對應條件資訊於該等條件對最大維度集合之複數個位兀字Φ 條件列舉樹之節點之相關基因位元字串集口在本霄鈀例中，可利用交互邏輯及（AND)運算（0) 相對應條件資㈣該㈣件對最A維錢合之複數個位元字串，計算該條件列舉樹之節點之相關基因位元字串集合0Base) 'other nodes are placed in other joining nodes (the servant 8) to be paired with other nodes in the other joining nodes to generate at least an extension node having at least two conditional information, and The extension node is placed below the selection node. As shown in steps 2 and 2 of FIG. 8 and FIG. 9, for example, 3 is a basic joining node (j〇in_base), and ^U (bc, e) is placed in other joining nodes (〇thers), and the node is basically joined. Pairing with other nodes in other joining nodes to generate three extended nodes (ab, ac, ae) 'The extended node has two conditional information, and the extended nodes Ub, ac, ae) are set under the selected node One layer (the second layer, level 2) 〇 selects one—the extended node is the basic adding node, the other extended nodes are placed in other joining nodes, and the basic joining node is paired with other extended nodes in other joining nodes to generate at least one extension. a node, the extended node has at least two conditional information, and the extended node is disposed below the selected extended node. As shown in step 3 of step 8 and FIG. 9, for example, selecting (10) is a basic adding node, The other extension nodes (ae, ae) are placed in other join points, 'the basic join node is paired with other extension nodes in other join nodes' to generate two extension nodes (abc, abe), and each extension section has three conditions. And the second extension node is called a layer below the selection extension point (ab) (third layer, ievei 3). With reference to FIG. 8 and FIG. 9, the above steps are repeated to establish the condition enumeration: The parent section, the point, and the condition information of each node are calculated as the condition set and then the conditions in the condition set of the node of the conditional enumeration tree are I45764.doc 201126365, and the corresponding condition information is the largest The plurality of bits of the dimension set 兀 the word Φ of the node of the conditional enumeration tree is in the palladium case, and the interaction logic and (AND) operation (0) can be used to match the condition (4) the (four) piece Calculating the set of related gene bit strings of the nodes of the conditional enumeration tree for a plurality of bit strings of the most A-dimensional money

配合參考圖5之該等條件對最大維度集合及圖1〇，以節點abc為例，節點abc之相關基因位元字串集合可以下式表不=G& ®GSA[.，其中仍讣為ab條件對最大維度集合之該等位元字串（1G1G1G1、_1_)，队為㈣件對最大維度集合之該等位元字串（1〇1〇1〇〇、〇1〇l〇i〇)，G&C為 be條件對最大維度集合之該等位元字串（1〇1〇1〇〇£、 (Η0ΗΗ0)。交互邏輯及（AND)運算⑻队、队及队以計算得節點abc之相關基因位元字串集合gs…為 (1010100、0101010)。同理，相關基因位元字串集合GSabe。節點abe可以上述方法計算其因本發明之方法利用區域廣度優先及全域深度優先方法 ❹該條件列舉樹’故可利用另—方法計算該條件列舉樹之節點之相關基因位元字串集合。該條件列舉樹之節點之條件集合内具有η個條件資訊，以交互邏輯及（and)運算第一個條件貧訊至第n-1個條件資訊於該等條件對最大維度集合之複數個位元字串及第一個條件資訊至第η。個條件資訊及第η個條件資訊於該等條件對最大維度集合之複數個位元字串，以及第n-丨個條件資訊及第〇個^件；訊^ 145764.doc 201126365 該等條件對最大維度集合之複數個位元字串，計算該條件列舉樹之節點之相關基因位元字串集合。如圖11所示’以節點讣以為例說明，節點讣以之相關基因位元字串集合可以下式表示，因本發明之方法利用區域廣度優先及全域深度優先方法建立該條件列舉樹’在建立節點abed之前已經建立節點abc及節 ,·-占abd故可利用上述方法，不論節點之條件資訊有幾 • 個，交互邏輯及（and)運算均固定為二次，以降低運算時間及運算複雜度。由於習知之交互邏輯及（AND)運算需對運算之集合内之所有位元字串進行交互運算，參考圖12所示，集合A内具有四個位7L字串，集合B内具有四個位元字串，若集合A 及集合B進行交互邏輯及（AND)運算，則必須進行i6次邏輯及（AND)運算。若將集合a内相似之位元字串分群，集合B内相似之位元字串分群，再將集合内相似群組之位元 • 字串做交互邏輯及（AND)運算，如圖13所示，以集合a之才乜群，，且及集合B之相似群組進行交互邏輯及（AND)運算，則僅需進行8次邏輯及（and)運算。利用上述相似群組之方法，本發明可利用另一方法計算 “條件列舉W之節點之相關基因位元字串集合。將相對應條件資訊於該等條件對最大維度集合之複數個位元字串分群，製作一簽章表（signature table)，該簽章表具有至少一簽章位元字串（Sig)及至少一位元字串集合（bss)，二相對 '“条件貝讯之簽章表之該簽章位元字串做邏輯及⑽d)運 145764.doc 201126365 算，若邏輯及（AND)運算結果之1的個數大於一最小基因數值’則二相對應條件資訊之簽章表之該位元字串集合内之位元字串做交互邏輯及（AND)運算，計算該條件列舉樹之郎點之相關基因位疋字事集合。參考圖1 4，若某一 X節點於該等條件對最大維度集合具有三個位元字串（1010101、1010010、〇101010)，將三個位元子串分群’製作一簽章表（signature table)，首先第一個位元字串（1010101)先置於第一個群組之簽章位元字串 (Sig)及位元字串集合（BSS)内。第二個位元字串（ι〇ι〇〇10) 與簽章位元字串（Sig ’即第一個位元字串1010101)做邏輯或（OR)運算’並計算邏輯或（〇R)運算結果之1的個數，再減去簽章位元字串（Sig)i丨的個數，若小於或等於一門檻值T(例如T為1)，則表示其可為同一群組，且邏輯或（〇R) 運算結果取代該簽章位元字串（Sig)，並將第二個位元字串 (1010010)加在位元字串集合（BSS)内。接著，第三個位元字串（0101010)與簽章位元字串（sig，即101 0111)做邏輯或（〇R)運算，並計算邏輯或（〇R)運算結果之1的個數，再減去簽章位元字串（Sig)2 i的個數，此時大於該門檻值，則表示第三個位元字串不與第一個位元字串及第一個位TL字串為同一群組，故第三個位元字串為另一群組之簽章位元字串（Sig)，且為另一群組之位元字串集合（BSS)。參考圖15，其顯示經分群組之二簽章表，當進行二簽章表之運算時’第一簽章表S1之第一簽章位元字串 145764.doc 12 201126365 (1010101)與第二簽章做邏輯及⑽D)運算::第一簽早位元字串(1〇1〇_ )運异右邏輯及（AND)運算結果之】的個數大於該最小基因數值（NR)，表示二者之群組相似則第一簽早表及第二簽章表之該位元字串集合内之位元字串做交互邏輯及（AND)運算。利用上述之分群方法，可減少邏輯及（娜)運算之次數，降低運算時間及運算複雜度。在建立該條件列舉樹時，本發明之方法另包括—第_限制步驟，用以判斷甚兮铲 d斷右3亥即點下之所有節點之最大條件資數小於該最小條件資1斜# 條件貝錢值，則不須建立該節點下點。參考圖1 6所示，該筋點c下夕私士外訊數為2,1小於本實二二所有郎點之最大條件資巧 ^於本貫施例之該最小條件資訊數心此，則不須建立該節點c下之節點ce。一本發明之方法另包括一第二限制步驟，用以判斷 =斷㈣之所有條件資訊包含—第：判斷節點之所件身訊’且該第-判斷節點之相關基因位元於 :第二判斷節點之相關基因位元字串集合，則不須第一判斷郎點下之節點。參考圖"所示若 μ Υ之所有條件資訊（abce；)包含第_ # 即點已3弟一判斷節點χ之所有訊㈣’且該第一判斷節點¥之相關基因位二 ⑽_〇、〇H)1〇10)等於該第二集3 元字串集合（1〇1〇100、咖010)，則不^之相關基因位節點X下之節點ace。、、立°玄第二判斷本發明之方法另包括-第三限制步驟，用以判斷若 -判斷節點之所有條件資訊包含一第二判斷節點下之所有 145764.doc •13- 201126365 節點之所有條件資訊雖然該第一判斷節點之相關基因位兀字串集合不等於該第二判斷節點之相關基因位元字串集合，但若該第一判斷節點之相關基因位元字串集合等於一第三判斷節點之相關基因位元字串集合，且㈣二判斷節點下之所有節點之所有條件資訊包含該第三判斷節點之所有條件資訊，則不須建立該第二判斷節點下之節點。參考圖18 ’若第-判斷節點丫之所有條件資訊如包含With reference to the conditions of FIG. 5 for the maximum dimension set and FIG. 1 , taking the node abc as an example, the set of related gene bit strings of the node abc can be expressed as follows: G& GSA [. The ab condition is the same as the maximum dimension set of the bit string (1G1G1G1,_1_), and the team is the (four) piece to the largest dimension set of the bit string (1〇1〇1〇〇,〇1〇l〇i〇 ), G&C is the bit string of the conditional versus maximum dimension set (1〇1〇1〇〇£, (Η0ΗΗ0). Interaction logic and (AND) operation (8) team, team and team to calculate the node The abc related gene bit string set gs... is (1010100, 0101010). Similarly, the related gene bit string set GSabe. The node abe can be calculated by the above method to utilize the region breadth priority and the global depth priority according to the method of the present invention. Method ❹ the conditional enumeration tree 'so another method can be used to calculate the related gene bit string set of the node of the conditional enumeration tree. The conditional enumeration tree has a conditional set of n conditional information in the node set to interactive logic and And) operation of the first condition of the poor to the n-1 condition And a plurality of bit strings and the first condition information to the nth conditional information and the nth condition information in the plurality of bit strings of the conditional pair of the largest dimension set And the n-th condition information and the second piece of information; the signal ^ 145764.doc 201126365 These conditions for the plurality of bit strings of the largest dimension set, calculate the relevant gene bit words of the node of the conditional enumeration tree As shown in Figure 11, the node 讣 is used as an example to illustrate that the set of related gene bit strings can be represented by the following formula. The method of the present invention uses the region breadth-first and global depth-first methods to establish the condition list. The tree 'has established the node abc and the node before establishing the node abed, and can take advantage of the above method, regardless of the condition information of the node, the interaction logic and the (and) operation are fixed to the second to reduce the operation. Time and computational complexity. Since the conventional interaction logic and (AND) operations need to interleave all the bit strings in the set of operations, as shown in Figure 12, there are four in the set A. Bit 7L string, set B has four bit string, if set A and set B perform interactive logic and (AND) operation, it must perform i6 logical AND (AND) operation. If the set a is similar The bit string is grouped, the similar bit string in the set B is grouped, and the bit string of the similar group in the set is used as an interactive logic and (AND) operation, as shown in FIG.乜 group, and similar groups of set B perform interactive logic and (AND) operations, only need to perform 8 logical AND operations. Using the above similar group method, the present invention can be calculated by another method "Conditions list the relevant gene bit string sets of nodes of W. And grouping a plurality of bit strings of the largest dimension set by the corresponding condition information to create a signature table having at least one signature bit string (Sig) and at least A set of metastrings (bss), two relative to the 'conditions of the signing of the signing of the signing string of the signing string of the string and (10) d) 145764.doc 201126365 calculation, if the logical AND (AND) operation results The number of 1 is greater than a minimum gene value', then the bit string in the set of the corresponding string of the corresponding condition information is subjected to an interactive logic and (AND) operation, and the conditional enumeration tree is calculated. The related gene is located in the word set. Referring to Figure 14, if an X node has three bit strings (1010101, 1010010, 〇101010) in the maximum dimension set, the three bit substrings Grouping 'making a signature table, first the first bit string (1010101) is placed first in the first group of signature bit string (Sig) and bit string set (BSS) Inside. The second bit string (ι〇ι〇〇10) and the signature bit string (Sig ' A bit string 1010101) performs a logical OR operation and calculates the number of 1 of the logical OR (〇R) operation result, and subtracts the number of the signature bit string (Sig) i丨, if Less than or equal to a threshold T (for example, T is 1) means that it can be the same group, and the logical OR (〇R) operation replaces the signature bit string (Sig) and the second bit The meta string (1010010) is added to the bit string set (BSS). Next, the third bit string (0101010) is logically ORed with the signature bit string (sig, ie 101 0111) (〇R Operate, and calculate the number of logical OR (〇R) operations, and subtract the number of signature bit strings (Sig) 2 i. If the threshold is greater than the threshold, the third bit is The meta string is not in the same group as the first bit string and the first bit TL string, so the third bit string is another group of signature bit strings (Sig), and The bit string set (BSS) of another group. Referring to FIG. 15, which shows the second signature table of the subgroup, when the operation of the two signature table is performed, the first of the first signature table S1 Signature bit string 145764.doc 1 2 201126365 (1010101) and the second signature to do logic and (10) D) operation: the number of the first sign of the early meta-string (1〇1〇_) and the right-hand logical (AND) operation is greater than the The minimum gene value (NR) indicates that the group of the two groups is similar, and the bit string in the set of the first string of the first signature table and the second signature table is subjected to an interactive logic and (AND) operation. The grouping method can reduce the number of logical and (na) operations, and reduce the computation time and computational complexity. When the conditional enumeration tree is established, the method of the present invention further includes a -th limiting step for determining that the maximum conditional value of all the nodes under the point of the right shovel d is less than the minimum condition 1 oblique# Conditional Bayesian value, you do not need to establish the point below the node. Referring to FIG. 16 , the number of external signals of the ribs c is 2, 1 is less than the maximum condition of all the lang points of the real 2nd nd. The minimum condition information of the present embodiment is the same. Then it is not necessary to establish the node ce under the node c. A method of the present invention further includes a second limiting step for determining that all condition information of = (4) includes - a: determining the body of the node and the associated gene of the first determining node is: Judging the node's related gene bit string set, it is not necessary to determine the node under the first point. All the condition information (abce;) of the reference figure " if μ 包含 contains the _ # ie, the point 3 has a judgment node χ all the messages (four) 'and the first judgment node ¥ related gene two (10) _ 〇〇H)1〇10) is equal to the second set of 3 yuan string set (1〇1〇100, coffee 010), then the node ace of the relevant gene position node X is not. The second method of determining the present invention further includes a third limiting step for determining that all condition information of the if-determining node includes all of the 145764.doc •13-201126365 nodes under a second determining node. Conditional information, although the related gene position 兀 string set of the first determining node is not equal to the related gene bit string set of the second determining node, if the first determining node has a related gene bit string set equal to one If all the condition information of all the nodes under the (4) two judgment nodes includes all the condition information of the third judgment node, the node under the second judgment node is not required to be established. Referring to Figure 18, if all the condition information of the first-decision node is included

第二判斷節點X下之所有節點之所有條件資訊_，雖然該第-判斷節點Y之相關基因位元字串集合⑽〇ι〇〇 ' 0101 010)不等於該第二判斷節點χ之相關基因位元字串集口 (1111111)，但若該第_判斷節點γ之相關基因位元字串集合(1_1〇〇、〇1〇1〇10)等於第三判_節點2之相關基因位元字串集合(1010100、0101010)，且該第二判斷節點乂下之所有節點之所有條件資訊aee包含該第三判斷節點2之所有條件資訊ae，則不須建立該第三判斷節之節點 ace ° 本發明利用條件列舉樹於DNA微陣列資料中探勘雙向分群之方法’不需要去建構習知技術巾基因對的最大維度集合，可避免複雜的分佈過程，並且利用雜湊結合的觀冬來有效率地找出雙向分群的結果。另外，利龍域廣度優先及全域深度優先方法，建立條件列舉樹。藉此，本發明利用條件列舉樹於DNA微陣列f料中探勘雙向分群之方法可以避免複雜的分佈過程，且可大量地降低搜尋空間及搜時間。寸 145764.doc 14 201126365 並且’本發明利用條件列舉樹於DN A微陣列資料中探勘雙向刀群之方法可適用於架設於全球資訊網（WWW)上的生物資料庫分析網站，使用者可透過本發明之方法對基因微陣列—貝料進行分群分析，或者研究單位（例如：研究基因微陣列的生醫科技公司）可以利用分群結果幫助了解基因間的關係。 ι 上述實施例僅為說明本發明之原理及其功效，並非限制Second, all condition information of all the nodes under the node X is judged, although the related gene bit string set (10) 〇ι〇〇 ' 0101 010 of the first judgment node Y is not equal to the related gene of the second judgment node Bit string set port (1111111), but if the relevant gene bit string set (1_1〇〇, 〇1〇1〇10) of the first judgment node γ is equal to the third gene_node 2 related gene bit a string set (1010100, 0101010), and all the condition information aee of all the nodes of the second judgment node include all the condition information ae of the third judgment node 2, and the node ace of the third judgment section is not required to be established. The method of the invention uses the conditional enumeration tree to explore the two-way grouping in the DNA microarray data, which does not need to construct the largest dimension set of the conventional technology towel gene pair, can avoid the complicated distribution process, and uses the hash combination to observe the winter. Efficiently find the results of two-way grouping. In addition, the Lilong domain breadth-first and global depth-first methods establish a conditional enumeration tree. Therefore, the method for exploring bidirectional grouping in the DNA microarray f material by using the conditional enumeration tree of the present invention can avoid complicated distribution process, and can greatly reduce the search space and the search time. Inch 145764.doc 14 201126365 and the method of using the conditional enumeration tree to explore the two-way knife group in the DN A microarray data can be applied to the biological database analysis website installed on the World Wide Web (WWW), the user can The method of the present invention performs a cluster analysis on the gene microarray-bean, or a research unit (for example, a biomedical company that studies gene microarrays) can use clustering results to help understand the relationship between genes. The above embodiments are merely illustrative of the principles and effects of the present invention and are not limiting.

本發明。HI此習於此技術之人士對上㉛實施 <列進行修改及變化仍不脫本發明之精神。本發明之權利範圍應如後述之申請專利範圍所列。【圖式簡單說明】圖1顯示本發明實施例之DNA微陣列資料之示意圖；this invention. HI, the person skilled in the art, does not deviate from the spirit of the invention by modifying and changing the implementation of the above. The scope of the invention should be as set forth in the appended claims. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a schematic view showing DNA microarray data of an embodiment of the present invention;

圖2’„|不计算每一編號之基因在3及b二條件編號下之件資sfl之差值之示意圖；一差值排序之示意圖3顯示將該等差值由小至大排列為圖；圖4顯示依據差值排序產生相對應位元字申之示意圖. 圖51㈣料條件對最大維度集合計算表之示意圖； τΤ数示於該條件計數表刪除部分條件資訊之示意圖；之示意圖； “条件樹建立該條件列舉樹圖8及圖9顯示利用區域廣度優先及全域深度優先方法建立該條件列舉樹之㈣及其階層^意圖；^方法建圖1。顯示利用交互邏輯及(AND)運算計算相關基因位元 145764.doc -15- 201126365 字串集合之示意圖；圖11顯示本發明利用二次交互邏輯及（and)運算計算相關基因位元字串集合之示意圖；圖丨2顯示習知利用交互邏輯及（AND)運算計算集合内所有位元字串之示意圖；圖13 -員示利用交互邏輯及（AND)運算計算集合内相似群組所有位元字串之示意圖；圖14顯示利用分群製作某一「一 X節點之簽章表之示意圖；Figure 2'|| does not calculate the difference between the number of the sfl of the numbered genes under the 3 and b conditional numbers; Figure 3 of a difference ranking shows that the differences are arranged from small to large Figure 4 shows a schematic diagram of the corresponding bit word generation according to the difference ordering. Figure 51 (4) Schematic diagram of the material condition to the maximum dimension set calculation table; τΤ number is shown in the conditional count table to delete part of the condition information; The conditional tree establishes the conditional enumeration tree. FIG. 8 and FIG. 9 show that the conditional enumeration tree (4) and its hierarchical structure intention are established by using the region breadth-first and the global depth-first method; The diagram shows the use of interactive logic and (AND) operation to calculate the related gene bit 145764.doc -15- 201126365 string set; Figure 11 shows the use of quadratic interaction logic and (and) operation to calculate the relevant gene bit string set Schematic diagram 2 shows a schematic diagram for calculating all bit strings in a set using interactive logic and (AND) operations; Figure 13 - Using interactive logic and (AND) operations to calculate all bits of similar groups in a set Schematic diagram of a string; Figure 14 shows a schematic diagram of making a "one X node signature table" by grouping;

145764.doc • 16 ·145764.doc • 16 ·

Claims

201126365 VII. Patent application scope: 丄·

'卞· w U iN 傲 P丰干 ^ W '丨' has a T诛勒 bidirectional method, wherein the DNA microarray is an array of a plurality of genes and a plurality of conditional information each, the method comprising牛 (a) Calculate a plurality of conditional versus maximum dimension sets and a _ conditional count table, wherein each-condition-to-maximum dimension set includes two conditional information and at least one meta-string, and (four) piece count table includes material information The condition information is the set of conditions and conditions for the maximum dimension set of the conditions, and the condition set of the condition for the maximum dimension set; (8) deleting part of the condition information according to the (four) piece count table according to the minimum condition information value; (0) Based on the conditions for the maximum dimension set and the deleted condition count table, a conditional enumeration tree is established by using a region breadth-first and global depth-first method to calculate a condition set of the nodes of the conditional enumeration tree and related genetic positions. A method of claim 1, wherein the method of claim 1 wherein the calculating the condition to the largest set of dimensions in step (a) comprises the steps of: (al) Because of the sequential gene number, the conditional information has a sequential condition number, and the difference of the condition information of each numbered gene under each two condition number is calculated; (a2) the difference is arranged from small to large. a difference ranking 'and sorting the corresponding genes according to the difference; (a3) sorting the differences according to a tolerance error value, and sorting the corresponding gene numbers according to the difference of the group 145764.doc 201126365 group Sorting the difference of the group by bit word competition; and - taking a value of π 丞 ,, determining whether to delete the bit string to calculate the condition set to the largest dimension set, each condition pair of the largest dimension set Including a conditional pair and a maximum dimension set, each conditional pair is a conditional information of each corresponding difference, and each-maximum dimension set is the bit string. 3. The method of claim 2, wherein Calculating the strip in step (4) includes the following steps: Table (a 5) Calculate the number of occurrences of each condition for the largest dimension set pair; and, ° '' (10): calculate each condition Information on these items Another conditional information of the pair of largest dimension sets> cattle pair is its related condition set. 4. The method of claim 1, wherein in step (b), if conditional conditions are equal to the maximum dimension set The number of occurrences is less than the maximum I:: the value is reduced to delete the condition information and the material in the other parts of the condition set. The method of the relevant month length item 1 of the ^, including in the step The following steps: Create a money list tree () 乂. The first-level node of the table of the condition of the sea; on the cow shell for the conditional enumeration tree (c2) select one of the points as the basic join node, and other join nodes , Λ ^ ', that is, the point is placed in its other node ®, /, he joins the node in the node pair ' to generate at least Yin point, the extension 145764.doc 201126365 = there are at least two condition information, and the extension node settings (c3) select one of the extended nodes as the basic joining node, and other extending nodes: placed in other joining nodes, and the basic joining node is paired with other extended nodes in other joining points, produce The less extended extension node has at least two conditional information, and the extension node δ is placed next to the selection extension node; and (4) the step (c) is repeated to establish the conditional enumeration tree point, and each calculation is calculated. The condition information of the node is strip 6. As the method of the requester, which is in the step. Like +~ Use this condition to enumerate the condition information in the condition set of the defect, and the corresponding conditional resource to the related gene bit string set of the node of the plurality of bit word condition enumeration tree of the maximum dimension set . 7. The method of claim 6, wherein in step (4), the interaction logic and the (side) operation corresponding condition information calculate the conditional enumeration tree for the maximum μ set person = number ^ yuan string The phase of the node is a set of constellation strings. The method of witch 8. In the step (4), the conditional enumeration tree has 11 conditional information in the set of (10) pieces, and the interactive condition and the second condition information are used to the second condition information to the first condition information. a plurality of bit strings of the dimension set and the first condition information=the rhythm condition information and the n-th condition information are the maximum number of bit strings in the conditional pair, and the 11-1th condition information and the The conditional information is calculated by the condition of the plurality of bits of the largest dimension set I45764.doc 201126365 string 'calculating the phase of the node of the conditional enumeration tree_gene bit string set. For example, in the method of monthly term item 6, wherein in step (4), the corresponding condition information is grouped into a plurality of bit strings of the large dimension set of the conditional materials, and a signature table is created, and the signature table has at least one signature. The bit string and the set of ^ one-character string, the signature table of the corresponding condition information, the sign of the early zero string, the logic and the (_) operation q, the number of logical AND (the) operation result is greater than The bit of the signature table of the most information, the bit string in the corresponding conditional substring set is the interaction i: the relevant gene of the node of the conditional enumeration tree is calculated. 10. The method 'in the step (4), the additional package-making step' is used to determine that if the limit number of the limit is less than 兮7, the maximum conditional node minimum condition information value of the defect point does not need to be established under the node. The method of claim 5, the step of determining the sequence of the relevant gene bit of all the condition information packets of the node including the second limit-the second judgment two:::: Hua A is equal to =, and the first judgment node does not need to be built because of the bit string set The relevant base point of the first-decision node. The section under the second judgment node of the construction of the 5th, as in the method of claim 5, which is used in the step of the step to determine if the first judgment is included - All the condition information of the second judgment node is full of all the condition information, although the related gene of the first judgment node is not equal to the poor - the relevant gene of the judgment node The set of bit string, the set of related gene bit strings of the day and the younger brother is equal to - the set of gene bit strings from the first judgment section, and the second judgment section: all the condition information related to the broken node includes The node of the third determining node does not need to establish a node under the second determining node.

145764.doc