TWI430128B

TWI430128B - Method for mining biclusters from dna microarray data using condition-enumeration tree

Info

Publication number: TWI430128B
Application number: TW99102695A
Authority: TW
Inventors: Ye In Chang; Jiun Rung Chen
Original assignee: Univ Nat Sun Yat Sen
Priority date: 2010-01-29
Filing date: 2010-01-29
Publication date: 2014-03-11
Also published as: TW201126365A

Description

Method for exploring two-way grouping in DNA microarray data by using conditional enumeration tree

本發明係關於一種於一資料中探勘雙向分群之方法，詳言之，係關於利用條件列舉樹於DNA微陣列資料中探勘雙向分群之方法。The invention relates to a method for exploring two-way grouping in a data, in particular, a method for exploring two-way grouping in a DNA microarray data by using a conditional enumeration tree.

在基因微陣列中對基因及實驗條件進行子空間分群之分群技術，已被證明能幫助理解如基因功能、基因調節、細胞進程以及細胞亞型等。一般而言，在大部分的情況下，一種疾病係由多筆基因構成，因此研究人員不斷設法去找出某些基因在某些條件下所具有的相似表現，以作為判斷疾病之依據。Sub-space grouping techniques for gene and experimental conditions in gene microarrays have been shown to help understand such functions as gene function, gene regulation, cellular processes, and cell subtypes. In general, in most cases, a disease is made up of multiple genes, so researchers are constantly trying to find out the similarities of certain genes under certain conditions as a basis for judging the disease.

在習知技術中，支援基因微陣列中子空間分群的方法裡，常見的方法如pCluster和zCluster，其為找出某些基因在某些條件下有一致性表現的子空間分群。然而，這兩個習知方法都包含很費時的步驟，也就是建構基因對的最大維度集合以及分佈其字首樹每個節點上的基因資訊。因此，習知pCluster和zCluster之基因微陣列中子空間分群方法，其分佈過程複雜，且必須使用大量之搜尋空間及搜尋時間。另有一習知MicroCluster方法，將問題轉換成圖型問題，但需要去解一個叫做Maximal Clique的NP-Complete問題。In the prior art, in the method of supporting subspace grouping in gene microarrays, common methods such as pCluster and zCluster are subspace groupings for finding out that certain genes have consistent performance under certain conditions. However, both of these conventional methods involve very time consuming steps, namely constructing the largest set of dimensions of a gene pair and distributing the genetic information on each node of the prefix tree. Therefore, the conventional sub-space grouping method of the gene microarray of pCluster and zCluster has a complicated distribution process and must use a large amount of search space and search time. There is also a conventional MicroCluster method that converts the problem into a graph problem, but needs to solve an NP-Complete problem called Maximal Clique.

因此，實有必要提供一種創新且進步性之利用條件列舉樹於DNA微陣列資料中探勘雙向分群之方法，以解決上述問題。Therefore, it is necessary to provide an innovative and progressive method for enumerating two-dimensional clustering in DNA microarray data to solve the above problems.

本發明提供一種利用條件列舉樹於DNA微陣列資料中探勘雙向分群之方法，其中該DNA微陣列係由複數個基因及每一基因之複數個條件資訊所構成之陣列，該等基因具有順序之基因編號，該等條件資訊具有順序之條件編號，該方法包括以下步驟：(a)計算複數個條件對最大維度集合及一條件計數表，其中每一條件對最大維度集合包括二條件資訊及複數個位元字串，該條件計數表包括條件資訊、條件資訊於該等條件對最大維度集合出現之個數及條件資訊於該等條件對最大維度集合之相關條件集合；(b)依據一最小條件資訊數值，於該條件計數表刪除部分條件資訊；及(c)依據該等條件對最大維度集合及刪除後之該條件計數表，利用一區域廣度優先及全域深度優先方法，建立一條件列舉樹，以計算條件列舉樹之節點之條件集合及相關基因位元字串集合。The present invention provides a method for exploring two-way clustering in a DNA microarray data using a conditional enumeration tree, wherein the DNA microarray is an array of a plurality of genes and a plurality of conditional information of each gene, the genes having an order Gene number, the condition information has a sequential condition number, the method comprising the following steps: (a) calculating a plurality of conditional pair maximum dimension sets and a conditional count table, wherein each condition pair maximum dimension set includes two condition information and a plural number a bit string, the conditional count table includes condition information, condition information, the number of occurrences of the condition to the largest dimension set, and conditional information on the condition set of the maximum dimension set; (b) based on a minimum The condition information value is used to delete part of the condition information in the condition count table; and (c) the condition set table for the largest dimension set and the deletion according to the conditions, and a conditional enumeration is established by using a region breadth priority and a global depth priority method The tree lists the condition sets of the nodes of the tree and the set of related gene bit strings by calculating the condition.

本發明利用條件列舉樹於DNA微陣列資料中探勘雙向分群之方法，不需要去建構習知技術中基因對的最大維度集合，可避免複雜的分佈過程，並且利用雜湊結合的觀念來有效率地找出雙向分群的結果。另外，利用區域廣度優先及全域深度優先方法，建立條件列舉樹。藉此，本發明利用條件列舉樹於DNA微陣列資料中探勘雙向分群之方法可以避免複雜的分佈過程，且可大量地降低搜尋空間及搜尋時間。The invention utilizes the conditional enumeration tree to explore the two-way grouping method in the DNA microarray data, does not need to construct the largest dimension set of the gene pair in the conventional technology, can avoid the complicated distribution process, and utilizes the concept of hash combination to efficiently Find out the results of two-way grouping. In addition, a conditional enumeration tree is established by using a region breadth priority and a global depth priority method. Thereby, the method of the invention for exploring two-way grouping by using the conditional enumeration tree in the DNA microarray data can avoid the complicated distribution process, and can greatly reduce the search space and the search time.

並且，本發明利用條件列舉樹於DNA微陣列資料中探勘雙向分群之方法可適用於架設於全球資訊網(WWW)上的生物資料庫分析網站，使用者可透過本發明之方法對基因微陣列資料進行分群分析，或者研究單位(例如：研究基因微陣列的生醫科技公司)可以利用分群結果幫助了解基因間的關係。Moreover, the method for exploring two-way grouping in the DNA microarray data by using the conditional enumeration tree can be applied to a biological database analysis website set up on the World Wide Web (WWW), and the user can use the method of the invention to the gene microarray. Data for cluster analysis, or research units (eg, biomedical technology companies that study gene microarrays) can use clustering results to help understand the relationship between genes.

以下參考圖式，說明本發明利用條件列舉樹於DNA微陣列資料中探勘雙向分群之方法。圖1顯示本發明實施例之DNA微陣列資料之示意圖。在本發明實施例中，該DNA微陣列係由複數個基因及每一基因之複數個條件資訊所構成之陣列，該等基因具有順序之基因編號，該等條件資訊具有順序之條件編號。The method for exploring two-way clustering in the DNA microarray data using the conditional enumeration tree is described below with reference to the drawings. Figure 1 shows a schematic representation of DNA microarray data in accordance with an embodiment of the present invention. In an embodiment of the invention, the DNA microarray is an array of a plurality of genes and a plurality of conditional information for each gene, the genes having a sequence of gene numbers, the conditional information having a sequential condition number.

首先，本發明之方法計算複數個條件對最大維度集合及一條件計數表，其中每一條件對最大維度集合包括二條件資訊及複數個位元字串，該條件計數表包括條件資訊、條件資訊於該等條件對最大維度集合出現之個數及條件資訊於該等條件對最大維度集合之相關條件集合。First, the method of the present invention calculates a plurality of condition-to-maximum dimension sets and a conditional count table, wherein each condition-to-maximum dimension set includes two conditional information and a plurality of bit-character strings, and the conditional count table includes condition information and condition information. The number and condition of the occurrence of the condition for the largest set of dimensions are related to the set of related conditions of the set of conditions to the largest set of dimensions.

上述計算該等條件對最大維度集合之步驟，首先計算每一編號之基因在每二條件編號下之條件資訊之差值，參考圖2，其顯示計算每一編號之基因在a及b二條件編號下之條件資訊之差值。接著，將該等差值由小至大排列為一差值排序，且依據該差值排序排列其相應之基因，參考圖3，其顯示將該等差值由小至大排列為一差值排序。The above steps of calculating the conditions for the largest dimension set, first calculating the difference of the condition information of each numbered gene under each two condition numbers, referring to FIG. 2, which shows the calculation of each number of genes in a and b conditions The difference between the condition information under the number. Then, the difference values are arranged from small to large as a difference, and the corresponding genes are arranged according to the difference. Referring to FIG. 3, the difference is arranged from small to large as a difference. Sort.

接著，依據一容忍誤差值，將該差值排序分群，且依據分群之差值排序其相應之基因編號，以位元字串表示分群之差值排序。參考圖4，其顯示a及b條件資訊之差值為1之群組之基因為基因編號0、2、4、6，故在相對應之位元字串之位元0、2、4、6為1，其他位元為0；在a及b條件資訊之差值為10之群組之基因為基因編號1、3、5，故在相對應之位元字串之位元1、3、5為1，其他位元為0。Then, according to a tolerance error value, the difference is sorted into groups, and the corresponding gene numbers are sorted according to the difference of the group, and the difference sequence of the group is represented by the bit string. Referring to FIG. 4, the genes showing the difference between the a and b condition information groups are gene numbers 0, 2, 4, and 6, so that the bits in the corresponding bit string are 0, 2, 4, 6 is 1, the other bits are 0; the genes in the group where the difference between the a and b condition information is 10 are the gene numbers 1, 3, and 5, so the bits 1 and 3 in the corresponding bit string are , 5 is 1, and the other bits are 0.

再依據一最小基因數值(NR)，決定是否刪除該等位元字串，以計算該等條件對最大維度集合，每一條件對最大維度集合包括一條件對及一最大維度集，每一條件對係為相應每一差值之二條件資訊，每一最大維度集係為該等位元字串。在本實施例中，最小基因數值(NR)為3，若該等位元字串之位元為1之個數小於3，則刪除該位元字串。上述二位元字串(1010101及0101010)未小於3，不刪除。則a及b條件對最大維度集合包括二條件資訊a及b及二位元字串(1010101及0101010)。And determining, according to a minimum genetic value (NR), whether to delete the bit string to calculate the conditional pair of maximum dimensions, each condition pair maximum dimension set including a condition pair and a maximum dimension set, each condition The pair is the condition information of each corresponding difference, and each of the largest dimension sets is the bit string. In this embodiment, the minimum gene value (NR) is 3, and if the number of bits of the bit string is 1 and the number is less than 3, the bit string is deleted. The above two-bit string (1010101 and 0101010) is not less than 3 and is not deleted. Then the a and b condition pair maximum dimension set includes two conditional information a and b and a two-bit string (1010101 and 0101010).

依據上述之方法計算在本實施例中所有二條件資訊之條件對最大維度集合，在本實施例中，依據該等條件對最大維度集合計算該條件計數表，如圖5所示。其中，依據該等條件對最大維度集合計算每一條件資訊於該等條件對最大維度集合之條件對中出現之個數，及計算每一條件資訊於該等條件對最大維度集合之條件對中之另一條件資訊為其相關條件集合。該條件計數表包括條件資訊、條件資訊於該等條件對最大維度集合出現之個數及條件資訊於該等條件對最大維度集合之相關條件集合。例如：條件資訊a於該等條件對最大維度集合出現之個數為4，其相關之條件集合為b、c、d、e。The condition-to-maximum dimension set of all the two conditional information in the embodiment is calculated according to the above method. In this embodiment, the conditional count table is calculated for the largest dimension set according to the conditions, as shown in FIG. 5. Calculating, according to the conditions, the number of occurrences of each condition information on the condition set of the maximum dimension set for the maximum dimension set, and calculating the conditional alignment of each condition information to the maximum dimension set of the condition Another conditional information is its set of related conditions. The conditional count table includes condition information, condition information, the number of occurrences of the conditions for the largest set of dimensions, and conditional information on the set of related conditions of the condition to the largest set of dimensions. For example, the condition information a appears in the maximum dimension set of the conditions is 4, and the related condition set is b, c, d, e.

接著，依據一最小條件資訊數值(NC)，於該條件計數表刪除部分條件資訊。若條件資訊於該等條件對最大維度集合出現之個數小於該最小條件資訊數值減1，則刪除該條件資訊及在其他條件資訊之相關條件集合內之該條件資訊。參考圖6，其顯示於該條件計數表刪除部分條件資訊之示意圖，在本實施例中，最小條件資訊數值(NC)為3，因為條件資訊d於該等條件對最大維度集合出現之個數為1小於該最小條件資訊數值減1(NC-1)，則刪除該條件資訊d及在其他條件資訊之相關條件集合內之該條件資訊，例如：條件資訊a之相關條件集合內之該條件資訊d。Then, part of the condition information is deleted in the condition count table according to a minimum condition information value (NC). If the condition information is less than 1 when the number of occurrences of the maximum dimension set is less than the minimum condition information value, the condition information and the condition information in the relevant condition set of other condition information are deleted. Referring to FIG. 6, which is a schematic diagram showing the deletion of partial condition information in the condition counter table, in the embodiment, the minimum condition information value (NC) is 3, because the condition information d appears in the maximum dimension set of the conditions. If the value of 1 is less than the minimum condition information minus 1 (NC-1), the condition information d and the condition information in the relevant condition set of other condition information are deleted, for example, the condition in the relevant condition set of the condition information a Information d.

接著，依據該等條件對最大維度集合及刪除後之該條件計數表，利用一區域廣度優先及全域深度優先方法(Local Breadth-first within Global Depth-first)，建立一條件列舉樹，以計算條件列舉樹之節點之條件集合及相關基因位元字串集合。參考圖7，其顯示依據該等條件對最大維度集合及刪除後之該條件計數表，建立該條件列舉樹。Then, according to the conditions, for the maximum dimension set and the deleted condition count table, a conditional enumeration tree is established by using a local breadth-first within Global Depth-first method to calculate a condition. List the condition sets of nodes of the tree and the set of related gene bit strings. Referring to FIG. 7, the conditional enumeration tree is established by displaying the maximum dimension set and the deleted condition count table according to the conditions.

參考圖8及圖9，其顯示利用一區域廣度優先及全域深度優先方法，建立該條件列舉樹之步驟及其階層。以下說明建立該條件列舉樹之步驟，首先以該條件計數表之該等條件資訊為該條件列舉樹之第一層節點，如圖8及圖9之步驟1(step 1)。接著，選擇其中一節點為基本加入節點(join-base)，其他節點置於其他加入節點(others)中，基本加入節點與其他加入節點中之其他節點配對，以產生至少一延伸節點(expansion)，該延伸節點具有至少二條件資訊，且延伸節點設置於該選擇節點之下一層。如圖8及圖9之步驟2(step 2)所示，例如：a為基本加入節點(join-base)，其他節點(b、c、e)置於其他加入節點(others)中，基本加入節點與其他加入節點中之其他節點配對，以產生三個延伸節點(ab、ac、ae)，該延伸節點具有二條件資訊，且延伸節點(ab、ac、ae)設置於該選擇節點(a)之下一層(第二層，level 2)。Referring to Figures 8 and 9, there is shown the steps of establishing a conditional enumeration tree and its hierarchy using a region breadth-first and global depth-first method. The following describes the steps for establishing the conditional enumeration tree. First, the condition information of the conditional count table is the first layer node of the conditional enumeration tree, as shown in step 1 of FIG. 8 and FIG. 9 (step 1). Then, one of the nodes is selected as a basic join node (join-base), the other nodes are placed in other join nodes, and the basic join node is paired with other nodes in the other join nodes to generate at least one extension node (expansion) The extended node has at least two condition information, and the extended node is disposed under the selected node. As shown in step 2 (step 2) of FIG. 8 and FIG. 9, for example, a is a basic join node (join-base), and other nodes (b, c, e) are placed in other join nodes (others), and the basic join is performed. The node is paired with other nodes in the other joining nodes to generate three extended nodes (ab, ac, ae), the extended node has two condition information, and the extended nodes (ab, ac, ae) are set at the selected node (a ) The next layer (the second layer, level 2).

選擇其中一延伸節點為基本加入節點，其他延伸節點置於其他加入節點中，基本加入節點與其他加入節點中之其他延伸節點配對，以產生至少一延伸節點，該延伸節點具有至少二條件資訊，且延伸節點設置於該選擇延伸節點之下一層。如圖8及圖9之步驟3(step 3)所示，例如：選擇(ab)為基本加入節點，其他延伸節點(ac、ae)置於其他加入節點中，基本加入節點與其他加入節點中之其他延伸節點配對，以產生二延伸節點(abc、abe)，每一延伸節點具有三個條件資訊，且二延伸節點(abc、abe)設置於該選擇延伸節點(ab)之下一層(第三層，level 3)。One of the extended nodes is selected as a basic joining node, and the other extended nodes are placed in other joining nodes, and the basic joining node is paired with other extended nodes in the other joining nodes to generate at least one extended node, and the extended node has at least two condition information. And the extension node is disposed under the selection extension node. As shown in step 3 (step 3) of FIG. 8 and FIG. 9, for example, (ab) is selected as a basic joining node, and other extended nodes (ac, ae) are placed in other joining nodes, and the basic joining node and other joining nodes are basically added. The other extended nodes are paired to generate two extended nodes (abc, abe), each extended node has three conditional information, and the second extended node (abc, abe) is disposed under the selected extended node (ab) (the first Three levels, level 3).

配合參考圖8及圖9，重複上述步驟，以建立該條件列舉樹之每一節點，並計算得每一節點之條件資訊為條件集合。Referring to FIG. 8 and FIG. 9, the above steps are repeated to establish each node of the conditional enumeration tree, and the condition information of each node is calculated as a condition set.

接著，利用該條件列舉樹之節點之條件集合內之條件資訊，及相對應條件資訊於該等條件對最大維度集合之複數個位元字串，計算該條件列舉樹之節點之相關基因位元字串集合。在本實施例中，可利用交互邏輯及(AND)運算(Ⓧ)相對應條件資訊於該等條件對最大維度集合之複數個位元字串，計算該條件列舉樹之節點之相關基因位元字串集合。Then, the condition information in the condition set of the node of the tree is enumerated by using the condition, and the corresponding condition information is used to calculate the relevant genetic bit of the node of the conditional enumeration tree by using the condition information to the plurality of bit strings of the maximum dimension set. String collection. In this embodiment, the interaction logic and the (AND) operation (X) corresponding condition information may be used to calculate a plurality of bit strings of the maximum dimension set by the conditions, and the relevant genetic bits of the node of the conditional enumeration tree may be calculated. String collection.

配合參考圖5之該等條件對最大維度集合及圖10，以節點abc為例，節點abc之相關基因位元字串集合可以下式表示GS _abc =GS _ab ⓍGS _ac ⓍGS _bc ，其中GS_ab 為ab條件對最大維度集合之該等位元字串(1010101、0101010)，GS_ac 為ac條件對最大維度集合之該等位元字串(1010100、0101010)，GS_bc 為bc條件對最大維度集合之該等位元字串(1010100、0101010)。交互邏輯及(AND)運算(Ⓧ)GS_ab 、GS_ac 及GS_bc 以計算得節點abc之相關基因位元字串集合GS_abc 為(1010100、0101010)。同理，節點abe可以上述方法計算其相關基因位元字串集合GS_abe 。With reference to the conditions of FIG. 5 for the maximum dimension set and FIG. 10, taking the node abc as an example, the set of related gene bit strings of the node abc can be expressed by the following equation: GS _abc = GS _ab X GS _ac X GS _bc , where GS _Ab is the bit string of the ab condition to the largest dimension set (1010101, 0101010), GS _ac is the ac condition to the maximum dimension set of the bit string (1010100, 0101010), GS _bc is the bc condition pair maximum The bit string of the set of dimensions (1010100, 0101010). The interactive logic AND (X) GS _ab , GS _ac and GS _bc are calculated as the correlation gene bit string set GS _abc of the node abc (1010100, 0101010). Similarly, the node abe can calculate its related gene bit string set GS _{abe by the} above method.

因本發明之方法利用區域廣度優先及全域深度優先方法建立該條件列舉樹，故可利用另一方法計算該條件列舉樹之節點之相關基因位元字串集合。該條件列舉樹之節點之條件集合內具有n個條件資訊，以交互邏輯及(AND)運算第一個條件資訊至第n-1個條件資訊於該等條件對最大維度集合之複數個位元字串及第一個條件資訊至第n-2個條件資訊及第n個條件資訊於該等條件對最大維度集合之複數個位元字串，以及第n-1個條件資訊及第n個條件資訊於該等條件對最大維度集合之複數個位元字串，計算該條件列舉樹之節點之相關基因位元字串集合。Since the method of the present invention establishes the conditional enumeration tree by using a region breadth-first and global depth-first method, another method can be used to calculate a set of related gene-bit strings of nodes of the conditional enumeration tree. The conditional enumeration tree has a condition set in the node of the tree having n condition information, and the first condition information is interactively ANDed to the n-1th condition information to the plurality of bits of the maximum dimension set of the condition The string and the first condition information to the n-2th condition information and the nth condition information in the plurality of bit strings of the condition to the largest dimension set, and the n-1th condition information and the nth The condition information is used to calculate a set of related gene bit strings of nodes of the conditional enumeration tree for the plurality of bit strings of the conditions of the maximum dimension set.

如圖11所示，以節點abcd為例說明，節點abcd之相關基因位元字串集合可以下式表示GS _abcd =GS _abc ⓍGS _abd ⓍGS _cd ，因本發明之方法利用區域廣度優先及全域深度優先方法建立該條件列舉樹，在建立節點abcd之前已經建立節點abc及節點abd，故可利用上述方法，不論節點之條件資訊有幾個，交互邏輯及(AND)運算均固定為二次，以降低運算時間及運算複雜度。As shown in FIG. 11 , taking the node abcd as an example, the set of related gene bit strings of the node abcd can be expressed by the following formula: GS _abcd = GS _abc X GS _abd X GS _cd , because the method of the present invention utilizes regional breadth priority and global domain The depth-first method establishes the conditional enumeration tree, and the node abc and the node abd are established before the node abcd is established, so the above method can be used, and the interaction logic and the (AND) operation are fixed to the second, regardless of the condition information of the node. To reduce computing time and computational complexity.

由於習知之交互邏輯及(AND)運算需對運算之集合內之所有位元字串進行交互運算，參考圖12所示，集合A內具有四個位元字串，集合B內具有四個位元字串，若集合A及集合B進行交互邏輯及(AND)運算，則必須進行16次邏輯及(AND)運算。若將集合A內相似之位元字串分群，集合B內相似之位元字串分群，再將集合內相似群組之位元字串做交互邏輯及(AND)運算，如圖13所示，以集合A之相似群組及集合B之相似群組進行交互邏輯及(AND)運算，則僅需進行8次邏輯及(AND)運算。Since the conventional interaction logic and (AND) operation requires interaction of all the bit strings in the set of operations, as shown in FIG. 12, there are four bit strings in the set A, and four bits in the set B. Meta-string, if set A and set B perform interactive logic and (AND) operations, 16 logical AND operations must be performed. If the similar bit strings in the set A are grouped, the similar bit strings in the set B are grouped, and the bit strings of the similar groups in the set are subjected to interactive logic and (AND) operations, as shown in FIG. To perform the interactive logic AND operation with the similar group of the set A and the similar group of the set B, only 8 logical AND operations are required.

利用上述相似群組之方法，本發明可利用另一方法計算該條件列舉樹之節點之相關基因位元字串集合。將相對應條件資訊於該等條件對最大維度集合之複數個位元字串分群，製作一簽章表(signature table)，該簽章表具有至少一簽章位元字串(Sig)及至少一位元字串集合(BSS)，二相對應條件資訊之簽章表之該簽章位元字串做邏輯及(AND)運算，若邏輯及(AND)運算結果之1的個數大於一最小基因數值，則二相對應條件資訊之簽章表之該位元字串集合內之位元字串做交互邏輯及(AND)運算，計算該條件列舉樹之節點之相關基因位元字串集合。Using the method of the similar group described above, the present invention can use another method to calculate a set of related gene bit strings of nodes of the conditional enumeration tree. And grouping a plurality of bit strings of the largest dimension set by the corresponding condition information to create a signature table having at least one signature bit string (Sig) and at least One bit string set (BSS), and the sign bit string of the signature table corresponding to the condition information is logically ANDed, if the number of logical AND (1) results is greater than one The minimum gene value is used to perform an interactive logic and (AND) operation on the bit string in the bit string set of the signature table corresponding to the condition information, and the relevant gene bit string of the node of the conditional enumeration tree is calculated. set.

參考圖14，若某一x節點於該等條件對最大維度集合具有三個位元字串(1010101、1010010、0101010)，將三個位元字串分群，製作一簽章表(signature table)，首先第一個位元字串(1010101)先置於第一個群組之簽章位元字串(Sig)及位元字串集合(BSS)內。第二個位元字串(1010010)與簽章位元字串(Sig，即第一個位元字串1010101)做邏輯或(OR)運算，並計算邏輯或(OR)運算結果之1的個數，再減去簽章位元字串(Sig)之1的個數，若小於或等於一門檻值T(例如T為1)，則表示其可為同一群組，且邏輯或(OR)運算結果取代該簽章位元字串(Sig)，並將第二個位元字串(1010010)加在位元字串集合(BSS)內。Referring to FIG. 14, if an x node has three bit strings (1010101, 1010010, 0101010) in the maximum dimension set, the three bit strings are grouped to create a signature table. First, the first bit string (1010101) is first placed in the signature bit string (Sig) and the bit string set (BSS) of the first group. The second bit string (1010010) is logically ORed with the signature bit string (Sig, ie, the first bit string 1010101), and the logical OR (OR) operation result is calculated as one. The number, minus the number of 1 of the signature bit string (Sig), if less than or equal to a threshold T (for example, T is 1), it means that it can be the same group, and logical OR The operation result replaces the signature bit string (Sig), and the second bit string (1010010) is added to the bit string set (BSS).

接著，第三個位元字串(0101010)與簽章位元字串(Sig，即1010111)做邏輯或(OR)運算，並計算邏輯或(OR)運算結果之1的個數，再減去簽章位元字串(Sig)之1的個數，此時大於該門檻值，則表示第三個位元字串不與第一個位元字串及第一個位元字串為同一群組，故第三個位元字串為另一群組之簽章位元字串(Sig)，且為另一群組之位元字串集合(BSS)。Then, the third bit string (0101010) is logically ORed with the signature bit string (Sig, ie 1010111), and the number of logical OR (OR) results is calculated, and then subtracted. To sign the number of the first bit string (Sig), which is greater than the threshold, indicating that the third bit string is not the same as the first bit string and the first bit string. The same group, so the third bit string is the signature bit string (Sig) of another group, and is the bit string set (BSS) of another group.

參考圖15，其顯示經分群組之二簽章表，當進行二簽章表之運算時，第一簽章表S1之第一簽章位元字串(1010101)與第二簽章表S2之第一簽章位元字串(1010100)做邏輯及(AND)運算，若邏輯及(AND)運算結果之1的個數大於該最小基因數值(NR)，表示二者之群組相似，則第一簽章表及第二簽章表之該位元字串集合內之位元字串做交互邏輯及(AND)運算。利用上述之分群方法，可減少邏輯及(AND)運算之次數，降低運算時間及運算複雜度。Referring to FIG. 15, which shows the second signature table of the sub-group, when the operation of the two signature table is performed, the first signature bit string (1010101) and the second signature table of the first signature table S1. The first signature bit string of S2 (1010100) performs a logical AND operation. If the number of logical AND (AND) operations is greater than the minimum genetic value (NR), the group of the two is similar. Then, the bit string in the set of bit strings of the first signature table and the second signature table is subjected to an interactive logic and (AND) operation. By using the above-described grouping method, the number of logical AND operations can be reduced, and the operation time and operation complexity can be reduced.

在建立該條件列舉樹時，本發明之方法另包括一第一限制步驟，用以判斷若該節點下之所有節點之最大條件資訊數小於該最小條件資訊數值，則不須建立該節點下之節點。參考圖16所示，該節點c下之所有節點之最大條件資訊數為2，其小於本實施例之該最小條件資訊數值3，因此，則不須建立該節點c下之節點ce。When the conditional enumeration tree is established, the method of the present invention further includes a first limiting step for determining that if the maximum condition information number of all nodes under the node is less than the minimum condition information value, then the node is not required to be established. node. Referring to FIG. 16, the maximum condition information number of all nodes under the node c is 2, which is smaller than the minimum condition information value 3 of the embodiment. Therefore, the node ce under the node c does not need to be established.

本發明之方法另包括一第二限制步驟，用以判斷若一第一判斷節點之所有條件資訊包含一第二判斷節點之所有條件資訊，且該第一判斷節點之相關基因位元字串集合等於該第二判斷節點之相關基因位元字串集合，則不須建立該第二判斷節點下之節點。參考圖17所示，若第一判斷節點Y之所有條件資訊(abce)包含第二判斷節點X之所有條件資訊(ac)，且該第一判斷節點Y之相關基因位元字串集合(1010100、0101010)等於該第二判斷節點X之相關基因位元字串集合(1010100、0101010)，則不須建立該第二判斷節點X下之節點ace。The method of the present invention further includes a second limiting step, configured to determine that all condition information of a first determining node includes all condition information of a second determining node, and the related genetic bit string set of the first determining node Equal to the set of related gene bit strings of the second determining node, it is not necessary to establish a node under the second determining node. Referring to FIG. 17, if all the condition information (abce) of the first judgment node Y includes all condition information (ac) of the second judgment node X, and the related gene bit string set of the first judgment node Y (1010100) , 0101010) is equal to the set of related gene bit strings (1010100, 0101010) of the second judgment node X, and it is not necessary to establish the node ace under the second judgment node X.

本發明之方法另包括一第三限制步驟，用以判斷若一第一判斷節點之所有條件資訊包含一第二判斷節點下之所有節點之所有條件資訊，雖然該第一判斷節點之相關基因位元字串集合不等於該第二判斷節點之相關基因位元字串集合，但若該第一判斷節點之相關基因位元字串集合等於一第三判斷節點之相關基因位元字串集合，且該第二判斷節點下之所有節點之所有條件資訊包含該第三判斷節點之所有條件資訊，則不須建立該第二判斷節點下之節點。The method of the present invention further includes a third limiting step for determining that all condition information of a first determining node includes all condition information of all nodes under a second determining node, although the relevant genetic bit of the first determining node The set of metastrings is not equal to the set of related gene bit strings of the second determining node, but if the set of related gene bit strings of the first determining node is equal to the set of related gene bit strings of a third determining node, And all the condition information of all the nodes under the second determining node includes all the condition information of the third determining node, and the node under the second determining node does not need to be established.

參考圖18，若第一判斷節點Y之所有條件資訊abce包含第二判斷節點X下之所有節點之所有條件資訊ace，雖然該第一判斷節點Y之相關基因位元字串集合(1010100、0101010)不等於該第二判斷節點X之相關基因位元字串集合(1111111)，但若該第一判斷節點Y之相關基因位元字串集合(1010100、0101010)等於第三判斷節點Z之相關基因位元字串集合(1010100、0101010)，且該第二判斷節點X下之所有節點之所有條件資訊ace包含該第三判斷節點Z之所有條件資訊ae，則不須建立該第二判斷節點X下之節點ace。Referring to FIG. 18, if all condition information abce of the first judgment node Y includes all condition information ace of all nodes under the second judgment node X, although the first judgment node Y has a related gene bit string set (1010100, 0101010) ) is not equal to the set of related gene bit strings of the second judgment node X (1111111), but if the correlation gene bit string set of the first judgment node Y (1010100, 0101010) is equal to the third judgment node Z a set of gene bit strings (1010100, 0101010), and all condition information ace of all nodes under the second judgment node X includes all condition information ae of the third judgment node Z, and the second judgment node does not need to be established. Node ace under X.

上述實施例僅為說明本發明之原理及其功效，並非限制本發明。因此習於此技術之人士對上述實施例進行修改及變化仍不脫本發明之精神。本發明之權利範圍應如後述之申請專利範圍所列。The above embodiments are merely illustrative of the principles and effects of the invention and are not intended to limit the invention. Therefore, those skilled in the art can make modifications and changes to the above embodiments without departing from the spirit of the invention. The scope of the invention should be as set forth in the appended claims.

圖1顯示本發明實施例之DNA微陣列資料之示意圖；1 shows a schematic diagram of DNA microarray data of an embodiment of the present invention;

圖2顯示計算每一編號之基因在a及b二條件編號下之條件資訊之差值之示意圖；Figure 2 is a diagram showing the difference between the condition information of each numbered gene under the condition number a and b;

圖3顯示將該等差值由小至大排列為一差值排序之示意圖；FIG. 3 is a schematic diagram showing the ranking of the difference values from small to large as a difference order;

圖4顯示依據差值排序產生相對應位元字串之示意圖；4 shows a schematic diagram of generating a corresponding bit string according to the difference ranking;

圖5顯示依據該等條件對最大維度集合計算該條件計數表之示意圖；FIG. 5 is a schematic diagram showing the calculation of the conditional count table for the largest set of dimensions according to the conditions;

圖6顯示於該條件計數表刪除部分條件資訊之示意圖；Figure 6 shows a schematic diagram of deleting part of the condition information in the condition count table;

圖7顯示依據刪除後之該條件計數表建立該條件列舉樹之示意圖；FIG. 7 is a schematic diagram showing the establishment of the conditional enumeration tree according to the conditional count table after deletion;

圖8及圖9顯示利用區域廣度優先及全域深度優先方法建立該條件列舉樹之步驟及其階層之示意圖；8 and FIG. 9 are diagrams showing the steps of establishing a conditional enumeration tree and its hierarchy by using a region breadth-first and a global depth-first method;

圖10顯示利用交互邏輯及(AND)運算計算相關基因位元字串集合之示意圖；FIG. 10 is a schematic diagram showing the calculation of a set of related gene bit strings by using an interactive logic and an AND operation;

圖11顯示本發明利用二次交互邏輯及(AND)運算計算相關基因位元字串集合之示意圖；FIG. 11 is a schematic diagram showing the calculation of a set of related gene bit strings by using a quadratic interaction logic and an AND operation;

圖12顯示習知利用交互邏輯及(AND)運算計算集合內所有位元字串之示意圖；Figure 12 shows a schematic diagram of conventionally calculating all bit strings in a set using interactive logic and (AND) operations;

圖13顯示利用交互邏輯及(AND)運算計算集合內相似群組所有位元字串之示意圖；Figure 13 is a diagram showing the calculation of all bit strings of similar groups in a set using interactive logic and (AND) operations;

圖14顯示利用分群製作某一x節點之簽章表之示意圖；Figure 14 is a schematic diagram showing the use of grouping to create a signature table for an x-node;

圖15顯示經分群組之二簽章表進行運算之示意圖；FIG. 15 is a schematic diagram showing operations performed by the second signature table of the group;

圖16顯示本發明之第一限制步驟之示意圖；Figure 16 is a schematic view showing the first limiting step of the present invention;

圖17顯示本發明之第二限制步驟之示意圖；及Figure 17 is a view showing a second limiting step of the present invention;

圖18顯示本發明之第三限制步驟之示意圖。Figure 18 shows a schematic diagram of a third limiting step of the present invention.

(無元件符號說明)(no component symbol description)

Claims

A method for exploring two-way clustering in a DNA microarray data using a conditional enumeration tree, wherein the DNA microarray is an array of a plurality of genes and a plurality of conditional information of each gene, the method comprising the steps of: (a) Calculating a plurality of condition-to-maximum dimension sets and a conditional count table, wherein each condition-to-maximum dimension set includes two condition information and at least one meta-word string, the condition count table including condition information, condition information, and the conditional maximum The number and condition information of the dimension set appearing on the condition set of the maximum dimension set; (b) deleting part of the condition information in the condition count table according to a minimum condition information value; and (c) according to the condition For the maximum dimension set and the deleted condition count table, a conditional enumeration tree is established by using a region breadth-first and global depth-first method to calculate a condition set of nodes of the conditional enumeration tree and a set of related gene bit strings.

The method of claim 1, wherein calculating the condition-to-maximum dimension set in step (a) comprises the steps of: (a1) the genes have a sequence of gene numbers, and the condition information has a sequence of condition numbers, each of which is calculated The difference between the condition information of the numbered gene under each of the two condition numbers; (a2) the difference values are arranged from small to large as a difference order, and the corresponding genes are arranged according to the difference; (a3) According to a tolerance error value, the difference is sorted into groups, and the corresponding gene numbers are sorted according to the difference of the group, and the difference is sorted by the bit string; and (a4) is determined according to a minimum gene value. Whether to delete the bit string to calculate the condition pair to the largest dimension set, each condition pair maximum dimension set includes a condition pair and a maximum dimension set, and each condition pair is a condition of each corresponding difference Information, each largest dimension set is the bit string.

The method of claim 2, wherein calculating the condition count table in step (a) comprises the step of: (a5) calculating a number of occurrences of each condition information in a condition pair of the condition for the largest set of dimensions; and A6) Calculate each condition information in the conditional pair of the conditional pair of the largest dimension set as its related condition set.

The method of claim 1, wherein in the step (b), if the condition information is that the number of occurrences of the maximum dimension set is less than the minimum condition information value minus 1, the condition information and other condition information are deleted. The condition information in the relevant condition set.

The method of claim 1, wherein the establishing the conditional enumeration tree in the step (c) comprises the steps of: (c1) counting the condition information of the table by the condition as a first layer node of the condition enumeration tree; (c2) One of the nodes is selected as a basic joining node, the other nodes are placed in other joining nodes, and the basic joining node is paired with other nodes in the other joining nodes to generate at least one extended node, the extended node has at least two condition information, and the extended node Set at a layer below the selection node; (c3) select one of the extension nodes as a basic join node, and the other extension nodes are placed in other join nodes, and the basic join node is paired with other extension nodes in other join nodes to generate at least one Extending the node, the extended node has at least two conditional information, and the extended node is disposed under the selected extended node; and (c4) repeating steps (c3) and (c2) to establish each node of the conditional enumeration tree, and The condition information of each node is calculated as a condition set.

The method of claim 1, wherein in the step (c), the condition information in the condition set of the node of the tree is enumerated by using the condition, and the corresponding condition information is used in the plurality of bit strings of the maximum dimension set in the condition And calculating a set of related gene bit strings of nodes of the conditional enumeration tree.

The method of claim 6, wherein in step (c), the conditional enumeration tree is calculated by using an interactive logic and (AND) operation corresponding condition information on the plurality of bit strings of the maximum dimension set of the conditions. A collection of related gene bit strings for a node.

The method of claim 6, wherein in the step (c), the condition set of the node of the conditional enumeration tree has n condition information, and the first condition information is interactively ANDed to the n-1th The condition information is a plurality of bit strings and the first condition information of the condition to the largest dimension set to the n-2th condition information and the nth condition information is a plurality of bits of the condition to the largest dimension set The string, and the n-1th condition information and the nth condition information are used in the plurality of bit strings of the condition to the largest dimension set, and the set of related gene bit strings of the node of the conditional enumeration tree is calculated.

The method of claim 6, wherein in step (c), the corresponding condition information is grouped into the plurality of bit strings of the largest dimension set by the conditions, and a signature table is created, the signature table having at least one The signature bit string and the at least one meta string set, and the signature bit string of the signature table corresponding to the condition information is logically ANDed, if the logical AND result is 1 If the number is greater than a minimum gene value, then the bit string in the set of bit strings corresponding to the signature table of the condition information is subjected to an interactive logic and (AND) operation, and the node of the conditional enumeration tree is calculated. A collection of gene bit strings.

The method of claim 5, wherein in step (c), a first limiting step is further included to determine that if the maximum condition information of all nodes under the node is less than the minimum condition information value, the The node under the node.

The method of claim 5, wherein in step (c), a second limiting step is further included to determine that all condition information of a first determining node includes all condition information of a second determining node, and the If the set of related gene bit strings of the determining node is equal to the set of related gene bit strings of the second determining node, then the node under the second determining node does not need to be established.

The method of claim 5, wherein in step (c), a third limiting step is further included to determine that all condition information of a first determining node includes all condition information of all nodes under a second determining node. The set of related gene bit strings of the first determining node is not equal to the set of related gene bit strings of the second determining node, but if the set of related gene bit strings of the first determining node is equal to a third Determining a set of related gene bit strings of the node, and all condition information of all nodes under the second determining node includes all condition information of the third determining node, and the node under the second determining node is not required to be established.