TWI413913B

TWI413913B - Method for mining subspace clusters from dna microarray data

Info

Publication number: TWI413913B
Application number: TW98136225A
Authority: TW
Inventors: Ye In Chang; Jiun Rung Chen; Yueh Chi Tsai
Original assignee: Univ Nat Sun Yat Sen
Priority date: 2009-10-26
Filing date: 2009-10-26
Publication date: 2013-11-01
Also published as: TW201115379A

Abstract

A method for mining subspace clusters from DNA microarray data is provided, in which a plurality of maximum dimension sets (MDSs) are calculated according to a minimum gene value, a minimum condition value and a clustering threshold value; the frequent pattern tree (FP-Tree) is used to calculate a plurality of conditional pattern bases (CPBs) according to each gene and corresponding MDS(s), and qualified subspace clusters are obtained according to the CPBs. Whereby, the complicated distribution process is simplified and the searching space and searching time are reduced. Furthermore, user(s) can utilize the method of the invention to cluster DNA microarray data of biological data base on the World Wide Web (WWW), and can utilize the clustering results to realize relationship between genes.

Description

Method for exploring subspace grouping in DNA microarray data

本發明係關於一種於一資料中探勘子空間分群之方法，詳言之，係關於DNA微陣列資料中探勘子空間分群之方法。The invention relates to a method for exploring subspace grouping in a data, in particular, a method for exploring subspace grouping in DNA microarray data.

在基因微陣列中對基因及實驗條件進行子空間分群之分群技術，已被證明能幫助理解如基因功能、基因調節、細胞進程以及細胞亞型等。一般而言，在大部分的情況下，一種疾病係由多筆基因構成，因此研究人員不斷設法去找出某些基因在某些條件下所具有的相似表現，以作為判斷疾病之依據。Sub-space grouping techniques for gene and experimental conditions in gene microarrays have been shown to help understand such functions as gene function, gene regulation, cellular processes, and cell subtypes. In general, in most cases, a disease is made up of multiple genes, so researchers are constantly trying to find out the similarities of certain genes under certain conditions as a basis for judging the disease.

在習知技術中，支援基因微陣列中子空間分群的方法裡，常見的方法如pCluster和zCluster，其為找出某些基因在某些條件下有一致性表現的子空間分群。然而，這兩個習知方法都包含很費時的步驟，也就是建構基因對的最大維度集合以及分佈其字首樹每個節點上的基因資訊。因此，習知pCluster和zCluster之基因微陣列中子空間分群方法，其分佈過程複雜，且必須使用大量之搜尋空間及搜尋時間。In the prior art, in the method of supporting subspace grouping in gene microarrays, common methods such as pCluster and zCluster are subspace groupings for finding out that certain genes have consistent performance under certain conditions. However, both of these conventional methods involve very time consuming steps, namely constructing the largest set of dimensions of a gene pair and distributing the genetic information on each node of the prefix tree. Therefore, the conventional sub-space grouping method of the gene microarray of pCluster and zCluster has a complicated distribution process and must use a large amount of search space and search time.

因此，實有必要提供一種創新且進步性之於DNA微陣列資料中探勘子空間分群之方法，以解決上述問題。Therefore, it is necessary to provide an innovative and progressive method for probing subspace grouping in DNA microarray data to solve the above problems.

本發明提供一種於DNA微陣列資料中探勘子空間分群之方法，其中該DNA微陣列係由M個基因及每一基因之N個條件資訊所構成之M×N陣列，該等基因具有順序之基因編號，該等條件資訊具有順序之條件編號。該方法包括以下步驟：(a)設定一最小基因數值、一最小條件資訊數值及一分群門檻值；(b)根據該最小基因數值、該最小條件資訊數值及該分群門檻值計算複數個條件對最大維度集合，且利用高頻式樣樹(Frequent Pattern Tree，FP-Tree)方式根據該等基因及其相應條件對最大維度集合計算複數個條件式樣基礎；及(c)根據該等條件式樣基礎計算複數個適格之子空間分群(subspace cluster)。The invention provides a method for exploring subspace grouping in DNA microarray data, wherein the DNA microarray is an M×N array composed of M genes and N conditional information of each gene, and the genes have a sequence. Gene number, which has a conditional number in the order. The method comprises the steps of: (a) setting a minimum gene value, a minimum condition information value, and a group threshold; (b) calculating a plurality of condition pairs based on the minimum gene value, the minimum condition information value, and the group threshold value. a set of maximum dimensions, and using a Frequent Pattern Tree (FP-Tree) method to calculate a plurality of conditional pattern bases based on the genes and their corresponding conditions for the largest set of dimensions; and (c) calculating based on the conditional patterns A plurality of subspace clusters.

在本發明於DNA微陣列資料中探勘子空間分群之方法中，係針對基因微陣列中子空間分群之問題，利用高頻式樣樹方式從基因之條件對的最大維度集合中計算出相應大項目集，從中獲得與條件對有關的基因集合。藉此，本發明於DNA微陣列資料中探勘子空間分群之方法可以避免複雜的分佈過程，且利用FP-Tree方式可大量地降低搜尋空間及搜尋時間。In the method for exploring subspace grouping in the DNA microarray data, the method for calculating the subspace grouping in the gene microarray is to calculate the corresponding large item from the largest dimension set of the condition pair of the gene by using the high frequency model tree method. Set, from which to obtain a collection of genes related to the conditional pair. Therefore, the method for exploring subspace grouping in the DNA microarray data can avoid complicated distribution process, and the FP-Tree method can greatly reduce the search space and the search time.

並且，本發明於DNA微陣列資料中探勘子空間分群之方法可適用於架設於全球資訊網(WWW)上的生物資料庫分析網站，使用者可透過本發明之方法對基因微陣列資料進行分群分析，或者研究單位(例如：研究基因微陣列的生醫科技公司)可以利用分群結果幫助了解基因間的關係。Moreover, the method for exploring subspace grouping in the DNA microarray data of the present invention can be applied to a biological database analysis website set up on the World Wide Web (WWW), and the user can group the gene microarray data by the method of the present invention. Analysis, or research units (eg, biomedical companies that study gene microarrays) can use clustering results to help understand the relationships between genes.

本發明提供一種於DNA微陣列資料中探勘子空間分群之方法。參考圖1，該DNA微陣列係由M個基因(Gene)及每一基因之N個條件資訊(Condition)所構成之M×N陣列，該等基因具有順序之基因編號Gene 1至Gene M，該等條件資訊具有順序之條件編號c1至cN，相應每一基因之每一條件資訊具有一數值。較佳地，M係為10³ 至10⁶ 之間，N係小於100。The present invention provides a method of profiling subspace grouping in DNA microarray data. Referring to Fig. 1, the DNA microarray is an M×N array composed of M genes (Gene) and N conditional information of each gene, and the genes have sequential gene numbers Gene 1 to Gene M, The condition information has sequential condition numbers c1 to cN, and each condition information of each gene has a value. Preferably, the M system is between 10 ³ and 10 ⁶ and the N system is less than 100.

圖2顯示本發明於DNA微陣列資料中探勘子空間分群之方法流程圖。首先，參考步驟S21，設定一最小基因數值、一最小條件資訊數值及一分群門檻值。其中，該最小基因數值係依據使用者所欲得到之探勘子空間分群包含之基因個數設定。例如，使用者欲得到包含至少3個基因之探勘子空間分群，則設定該最小基因數值為3。Figure 2 is a flow chart showing the method of exploring subspace grouping in the DNA microarray data of the present invention. First, referring to step S21, a minimum gene value, a minimum condition information value, and a group threshold are set. The minimum gene value is set according to the number of genes included in the subspace group of the exploration sub-score desired by the user. For example, if the user wants to obtain a sub-population of the exploration subspace containing at least 3 genes, the minimum gene value is set to 3.

參考步驟S22，根據該最小基因數值、該最小條件資訊數值及該分群門檻值，計算複數個條件對最大維度集合(maximum dimension sets，MDSs)，且利用高頻式樣樹(Frequent Pattern Tree，FP-Tree)方式根據該等基因及其相應條件對最大維度集合計算複數個條件式樣基礎(Conditional Pattern Base)。Referring to step S22, a plurality of conditional pair maximal dimension sets (MDSs) are calculated according to the minimum gene value, the minimum condition information value, and the group threshold value, and a high frequency model tree (Frequent Pattern Tree, FP-) is used. The Tree method calculates a plurality of Conditional Pattern Bases for the largest set of dimensions based on the genes and their corresponding conditions.

在本實施例中，計算該等條件對最大維度集合包括以下步驟：計算每一編號之基因在每二條件編號下之條件資訊之差值，其中每一差值係由相應較小條件編號之條件資訊與相應較大條件編號之條件資訊之差；將該等差值由小至大排列為一差值排序，且根據該差值排序排列其相應之基因；及根據該最小基因數值、該最小條件資訊數值、該分群門檻值、該差值排序及其相應基因編號之基因，計算該等條件對最大維度集合。In this embodiment, calculating the condition-to-maximum dimension set includes the steps of: calculating a difference of condition information of each numbered gene under each two condition number, wherein each difference is numbered by a corresponding smaller condition The difference between the condition information and the condition information of the corresponding larger condition number; the difference values are arranged from small to large as a difference order, and the corresponding genes are sorted according to the difference; and according to the minimum gene value, the The minimum condition information value, the group threshold value, the difference ranking, and the genes of the corresponding gene numbers are calculated, and the condition is calculated for the largest dimension set.

參考圖3，其顯示本發明之一DNA微陣列及其相關參數示意圖。在圖3中僅以7個基因g₀ 至g₆ 及5個條件資訊c₁ 至c₅ 為例說明，其中，在本實施例中該最小基因數值(minG)設定為3、該最小條件資訊數值(minC)設定為3、該分群門檻值(δ)設定為1。Referring to Figure 3, there is shown a schematic diagram of one of the DNA microarrays of the present invention and its associated parameters. In FIG. 3, only 7 genes g ₀ to g ₆ and 5 conditional information c ₁ to c _{5 are taken} as an example, wherein in the present embodiment, the minimum gene value (minG) is set to 3, and the minimum condition information is set. The value (minC) is set to 3, and the group threshold value (δ) is set to 1.

首先，計算g₀ 至g₆ 在c₁ 至c₅ 中每二條件編號下之條件資訊之差值，其中計算每二條件編號下之條件資訊之差值以及計算條件對最大維度集合之方法相同，以下僅以條件資訊c₃ 及c₅ 舉例說明。First, calculating the difference g ₀ to g ₆ in the c ₁ to c ₅ in each of the conditions under the second condition of the ID information, which is calculated under the same process conditions for each of the second condition information, and calculates a difference number of the largest dimension set of conditions The following is only exemplified by the conditional information c ₃ and c ₅ .

參考圖4，計算該等條件式樣基礎包括以下步驟，首先，計算每一編號之基因g₀ 至g₆ 在c₃ 及c₅ 之條件資訊之差值(c₃ -c₅ )；接著，將該等差值由小至大排列儲存為一差值排序(-3,-2,-2,-1,-1,4,5)，並根據該差值排序排列其相應之基因(g₄ ,g₀ ,g₁ ,g₂ ,g₃ ,g₆ ,g₅ )。Referring to FIG. 4, calculating the conditional pattern base includes the following steps. First, calculating the difference (c ₃ -c ₅ ) of the condition information of each of the numbered genes g ₀ to g ₆ at c ₃ and c ₅ ; The differences are stored in a small to large order as a difference order (-3, -2, -2, -1, -1, 4, 5), and the corresponding genes are sorted according to the difference (g ₄ , g ₀ , g ₁ , g ₂ , g ₃ , g ₆ , g ₅ ).

接著，根據該分群門檻值1依序由該差值排序中每一差值為基礎點，界定出最大值與最小值之差值小於或等於該分群門檻值1部分((-3,-2,-2)及(-2,-2,-1,-1)，如矩形框線所界定)之相應基因((g₄ ,g₀ ,g₁ )及(g₀ ,g₁ ,g₂ ,g₃ ))及相應條件資訊(c₃ ,c₅ )，據以計算複數個條件對最大維度集(g₀ ,g₁ ,g₄ )x(c₃ ,c₅ )及(g₀ ,g₁ ,g₂ ,g₃ )x(c₃ ,c₅ )，該條件對最大維度集(g₀ ,g₁ ,g₄ )x(c₃ ,c₅ )包括一條件對(c₃ ,c₅ )及一最大維度集(g₀ ,g₁ ,g₄ )，該條件對最大維度集(g₀ ,g₁ ,g₂ ,g₃ )x(c₃ ,c₅ )包括一條件對(c₃ ,c₅ )及一最大維度集(g₀ ,g₁ ,g₂ ,g₃ )。該等條件對(c₃ ,c₅ )係為相應每一差值之二條件資訊c₃ 及c₅ ，該等最大維度集(g₀ ,g₁ ,g₄ )及(g₀ ,g₁ ,g₂ ,g₃ )具有至少為最小基因數值3之基因個數。Then, according to the group threshold value 1, each difference value in the difference ranking is sequentially used as a base point, and the difference between the maximum value and the minimum value is defined to be less than or equal to the part of the grouping threshold value ((-3, -2) , -2) and (-2, -2, -1, -1), as defined by the rectangular line) (g ₄ , g ₀ , g ₁ ) and (g ₀ , g ₁ , g ₂ , g ₃ )) and corresponding condition information (c ₃ , c ₅ ), according to which a plurality of conditions are calculated for the largest set of dimensions (g ₀ , g ₁ , g ₄ ) x (c ₃ , c ₅ ) and (g ₀ , g ₁ , g ₂ , g ₃ )x(c ₃ , c ₅ ), the condition includes a conditional pair (c ₃ , for the largest set of dimensions (g ₀ , g ₁ , g ₄ ) x(c ₃ , c ₅ ) c ₅ ) and a maximum dimension set (g ₀ , g ₁ , g ₄ ), the condition includes a condition pair for the largest dimension set (g ₀ , g ₁ , g ₂ , g ₃ ) x(c ₃ , c ₅ ) (c ₃ , c ₅ ) and a set of maximum dimensions (g ₀ , g ₁ , g ₂ , g ₃ ). The conditional pair (c ₃ , c ₅ ) is the condition information c ₃ and c ₅ corresponding to each difference, the largest set of dimensions (g ₀ , g ₁ , g ₄ ) and (g ₀ , g _{1 )} , g ₂ , g ₃ ) has a number of genes of at least the minimum gene value of 3.

參考圖5，其顯示本發明計算一交易資料庫之方法示意圖，其中圖5之左側表格係為以圖3之資料為基礎所計算之所有條件對(Condition-pair)及最大維度集(MDSs)。首先，計算該等最大維度集中各基因出現之次數，並根據該等基因由多至少之出現次數計算一基因排序，並記錄於圖5之中間表格，Large 1-itemset表示1個基因長度之項目，Support表示各基因出現之次數，其中基因g₆ 出現之次數小於該最小條件資訊數值3，將基因g₆ 刪除不考慮，其可表示為：若刪除不考慮，其中該最小條件資訊數值minC為3，且該最小條件資訊數值minC等於該最小基因數值minG。Referring to FIG. 5, a schematic diagram of a method for calculating a transaction database according to the present invention is shown. The table on the left side of FIG. 5 is all condition pairs (Condition-pair) and maximum dimension set (MDSs) calculated based on the data of FIG. . First, the number of occurrences of each gene in the largest dimension set is calculated, and a gene order is calculated based on the number of occurrences of the genes, and is recorded in the middle table of FIG. 5, and the Large 1-itemset represents a gene length item. , Support indicates the number of occurrences of each gene, wherein the number of occurrences of the gene g ₆ is less than the minimum condition information value of 3, and the deletion of the gene g _{6 is} not considered, which can be expressed as: The deletion is not considered, wherein the minimum condition information value minC is 3, and the minimum condition information value minC is equal to the minimum gene value minG.

接著，根據該等條件對及各基因出現之次數計算一交易資料庫(Transaction DB，參考圖5之右側表格)，該交易資料庫包括複數個交易排序(TID：T₁ -T₁₂ )及相應之條件對及複數個交易最大維度集(transMDSs)，每一交易最大維度集中之該等基因係根據出現次數由多至少排序。其中，若一條件對相應多個最大維度集，每一最大維度集皆配對該條件對，以形成複數個條件對及交易最大維度集，例如：條件對(c₀ ,c₃ )相應2個最大維度集(g₀ ,g₂ ,g₃ ,g₄ )及(g₁ ,g₃ ,g₄ )，因此將其配對為T₂ 及T₃ 之條件對及交易最大維度集。Then, a transaction database (transaction DB, refer to the table on the right side of FIG. 5) is calculated according to the conditions and the number of occurrences of each gene, and the transaction database includes a plurality of transaction rankings (TID: T _{1 -} T ₁₂ ) and corresponding The condition pair and the plurality of transaction maximum dimension sets (transMDSs), the gene systems in the largest dimension set of each transaction are sorted according to at least the number of occurrences. Wherein, if a condition pairs a corresponding plurality of maximum dimension sets, each maximum dimension set is paired with the condition pair to form a plurality of condition pairs and a transaction maximum dimension set, for example: condition pairs (c ₀ , c ₃ ) corresponding to 2 The largest set of dimensions (g ₀ , g ₂ , g ₃ , g ₄ ) and (g ₁ , g ₃ , g ₄ ) are therefore paired into the conditional pair of T ₂ and T ₃ and the largest set of dimensions of the transaction.

其中，因基因g₆ 出現之次數小於該最小條件資訊數值3而刪除不考慮，所以條件對(c₀ ,c₄ )相應之最大維度集僅考慮(g₃ ,g₄ ,g₅ )，交易排序T₄ 之條件對及交易最大維度集為(c₀ ,c₄ )及(g₃ ,g₄ ,g₅ )；(c₄ ,c₅ )相應之最大維度集(g₀ ,g₅ ,g₆ )僅考慮(g₀ ,g₅ )，交易排序T₁₃ 之條件對及交易最大維度集為(c₄ ,c₅ )及(g₀ ,g₅ )。Wherein, since the number of occurrences of the gene g ₆ is less than the minimum condition information value 3 and the deletion is not considered, the condition pair (c ₀ , c ₄ ) corresponds to the largest dimension set only considering (g ₃ , g ₄ , g ₅ ), the transaction The conditional pair of T ₄ and the largest dimension set of the transaction are (c ₀ , c ₄ ) and (g ₃ , g ₄ , g ₅ ); (c ₄ , c ₅ ) the corresponding maximum dimension set (g ₀ , g ₅ , g ₆ ) Consider only (g ₀ , g ₅ ), the conditional pair of transaction order T ₁₃ and the largest dimension set of transactions are (c ₄ , c ₅ ) and (g ₀ , g ₅ ).

圖6顯示本發明以圖3之資料為基礎所建立之高頻式樣樹示意圖。其中每一節點中記錄有一基因及其經過該節點之次數，例如：在根節點(root)正下方第1子節點中為g₃ ：10，其表示基因g₃ 經過該節點之次數為10次。配合參考圖4至圖6，根據該基因排序(g₄ ,g₀ ,g₁ ,g₂ ,g₃ ,g₅ )及該交易資料庫建立該高頻式樣樹(FP-Tree)，該高頻式樣樹之每一最下階節點具有至少一相應交易條件對，例如，該高頻式樣樹左下方最下階節點中記錄為(g₂ ：3)，(g₂ ：3)表示基因g₂ 在該高頻式樣樹之(g₃ ,g₀ ,g₄ ,g₂ )分枝上出現3次，且其相應交易條件對為T₁ ：(c₀ ,c₁ )、T₂ ：(c₀ ,c₃ )、T₇ ：(c₁ ,c₃ )。關於該高頻式樣樹之詳細說明，可參考期刊論文：Han,J.,Pei,J.,and Yin,Y. 2000. Mining frequent patterns without candidate generation. Proc. ACM SIGMOD Int. Conf. Manage. Data 1-12，在此不再加以贅述。Figure 6 shows a schematic diagram of the high frequency pattern tree established by the present invention based on the data of Figure 3. Each of the nodes records a gene and the number of times it passes through the node, for example, g ₃ : 10 in the first child node directly below the root node, which indicates that the number of times the gene g ₃ passes through the node is 10 times. . Referring to FIG. 4 to FIG. 6, the high frequency pattern tree (FP-Tree) is established according to the genetic order (g ₄ , g ₀ , g ₁ , g ₂ , g ₃ , g ₅ ) and the transaction database. Each of the lowest-order nodes of the frequency pattern tree has at least one corresponding transaction condition pair. For example, the lowest-order node in the lower left of the high-frequency pattern tree is recorded as (g ₂ : 3), and (g ₂ : 3) represents the gene g. ₂ appears 3 times on the branch of (g ₃ , g ₀ , g ₄ , g ₂ ) of the high-frequency model tree, and the corresponding trading condition pair is T ₁ :(c ₀ , c ₁ ), T ₂ :( c ₀ , c ₃ ), T ₇ : (c ₁ , c ₃ ). For a detailed description of the high-frequency model tree, refer to the journal article: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate generation. Proc. ACM SIGMOD Int. Conf. Manage. Data 1-12, no further description here.

配合參考圖6及圖7，根據該高頻式樣樹、該最小基因數值及該等交易條件對，計算相應各基因之條件式樣基礎(Conditional Pattern Base，CPB)。在本實施例中，本發明之方法係依該基因排序(圖4)之順序計算相應各基因之條件式樣基礎。每一條件式樣基礎之計算方式相同，以下僅以該高頻式樣樹最左側分支路徑之基因項目(gItem)為g₄ 例說明。因最小基因數值minG為3，故考慮在該高頻式樣樹中位於其上方路徑之二基因項目(g₃ 及g₀ )，並根據該等基因項目g₃ 、g₀ 及該基因項目g₄ 下方路徑之相應交易條件對T₁ ：(c₀ ,c₁ )、T₂ ：(c₀ ,c₃ )、T₇ ：(c₁ ,c₃ )，計算相應條件式樣基礎(g₃ ,g₀ )x(c₀ ,c₁ )&(c₀ ,c₃ )&(c₁ ,c₃ )。Referring to FIG. 6 and FIG. 7 , a Conditional Pattern Base (CPB) of each gene is calculated according to the high frequency model tree, the minimum gene value, and the pair of transaction conditions. In the present example, the method of the present invention calculates the conditional pattern basis of the respective genes in the order of the genetic ordering (Fig. 4). The same basic pattern calculated for each condition, the following only the most high-frequency pattern Genome Project tree of the left branch path (gItem) to g ₄ illustrates. Since the minimum gene value minG is 3, the two gene items (g ₃ and g ₀ ) located above the path in the high-frequency pattern tree are considered, and according to the gene items g ₃ , g ₀ and the gene item g ₄ Corresponding transaction conditions for the lower path for T ₁ :(c ₀ , c ₁ ), T ₂ :(c ₀ , c ₃ ), T ₇ :(c ₁ , c ₃ ), calculate the corresponding conditional pattern basis (g ₃ , g ₀ ) x (c ₀ , c ₁ ) & (c ₀ , c ₃ ) & (c ₁ , c ₃ ).

參考圖8，其顯示本發明以圖3之資料為基礎所建立之所有相應基因項目及條件式樣基礎之示意圖。舉例而言，基因項目g₂ 及其相應條件式樣基礎包括：(g₃ ,g₀ ,g₄ )x(c₀ ,c₁ )&(c₀ ,c₃ )&(c₁ ,c₃ )、(g₀ ,g4)x(c₀ ,c₅ )、(g₃ ,g₀ )x(c₃ ,c₅ )&(c₁ ,c₅ )。Referring to Figure 8, there is shown a schematic diagram of all corresponding genetic items and conditional pattern foundations established by the present invention based on the data of Figure 3. For example, the genetic item g ₂ and its corresponding conditional pattern basis include: (g ₃ , g ₀ , g ₄ ) x (c ₀ , c ₁ ) & (c ₀ , c ₃ ) & (c ₁ , c ₃ ) (g ₀ , g4) x (c ₀ , c ₅ ), (g ₃ , g ₀ ) x (c ₃ , c ₅ ) & (c ₁ , c ₅ ).

參考步驟S23，根據該等條件式樣基礎計算複數個適格之子空間分群(subspace cluster)。在本實施例中，步驟S23包括：步驟S231，根據該等基因及該等條件式樣基礎計算複數個候選集，每一候選集包括一候選基因對及一候選條件對；步驟S232，根據該等基因及該等候選集計算複數個相應第一基因組及第一條件組，及根據該等基因、相應第一基因組及第一條件組計算複數個第一大項目集並輸出為第一子空間分群；步驟S233，根據該等相應第一基因組及第一條件組計算複數個相應第二基因組及第二條件組，及根據該等基因、相應第二基因組及第二條件組計算複數個第二大項目集並輸出為第二子空間分群，其中每一第二基因組中之基因個數為該第一基因組之基因個數加1；及步驟S234，定義該等相應第二基因組及第二條件組為下一循環之候選集，且重複步驟S232及步驟S233，計算下一循環之第一子空間分群及第二子空間分群。在本實施例中，該第一基因組之候選基因對之基因個數為該最小基因數值減1。Referring to step S23, a plurality of eligible subspace clusters are calculated according to the conditional pattern basis. In this embodiment, step S23 includes: step S231, calculating a plurality of candidate sets according to the genes and the conditional pattern basis, each candidate set includes a candidate gene pair and a candidate condition pair; and step S232, according to the Genes and the candidate sets calculate a plurality of corresponding first genomes and first condition groups, and calculate a plurality of first large item sets according to the genes, the corresponding first genomes, and the first condition group, and output the first subspace group; Step S233, calculating a plurality of corresponding second genomes and a second condition group according to the corresponding first genome and the first condition group, and calculating a plurality of second largest items according to the genes, the corresponding second genome, and the second condition group And the output is a second subspace group, wherein the number of genes in each second genome is the number of genes in the first genome plus one; and in step S234, the corresponding second genome and the second condition group are defined as The candidate set of the next cycle, and repeating steps S232 and S233, calculates the first subspace group and the second subspace group of the next cycle. In this embodiment, the number of genes of the candidate gene pair of the first genome is the minimum gene value minus one.

其中，在步驟步驟S232中係根據該等基因及具有相同基因項目之候選集計算該等相應第一基因組及第一條件組；在步驟S233中係根據具有相同的第一條件組之集合計算該等第二條件組，及根據相應該等第一條件組之第一基因組計算該等第二基因組；在步驟S234中另包括一篩選步驟，刪除與該等第二大項目集中相同之第一子空間分群。Wherein, in step S232, the corresponding first genome and the first condition group are calculated according to the genes and the candidate set having the same gene item; and in step S233, the group is calculated according to the set having the same first condition group. Waiting for the second condition group, and calculating the second genome according to the first genome of the corresponding first condition group; further comprising a screening step in step S234, deleting the first child that is identical to the second large item set Space grouping.

相應各基因項目之子空間分群之計算方式相同，以下僅以相應基因項目g₂ 之條件對最大維度集合為例說明。配合參考圖8、圖9、圖10及上述關於步驟S231之敘述，分別根據該等條件式樣基礎計算複數個候選集(CandidateSet[2]，其中[2]之數值係為最小基因數值minG減1)。在本實施例中，條件式樣基礎中之(g₃ ,g₀ ,g₄ )分為候選基因對(g₃ ,g₀ )、(g₃ ,g₄ )及(g₀ ,g₄ )，且候選基因對(g₃ ,g₀ )、(g₃ ,g₄ )及(g₀ ,g₄ )分別與候選條件對(c₀ ,c₁ )、(c₀ ,c₃ )及(c₁ ,c₃ )配對形成複數個候選集；條件式樣基礎中之(g₀ ,g₄ )即為候選基因對，其與候選條件對(c₀ ,c₅ )配對形成一候選集；條件式樣基礎中之(g₃ ,g₀ )即為候選基因對，其與候選條件對(c₃ ,c₅ )及(c₁ ,c₅ )配對形成二候選集。相應基因項目g₁ 、g₂ 、g₄ 、g₅ 之候選集，其完整記錄於圖10之中間表格內。The subspace grouping of each gene project is calculated in the same manner. The following is only an example of the maximum dimension set by the condition of the corresponding gene project g ₂ . Referring to FIG. 8, FIG. 9, FIG. 10 and the above description about step S231, a plurality of candidate sets (CandidateSet[2], wherein the value of [2] is the minimum gene value minG minus 1 is calculated according to the conditional pattern basis respectively. ). In the present embodiment, (g ₃ , g ₀ , g ₄ ) in the conditional pattern basis is divided into candidate gene pairs (g ₃ , g ₀ ), (g ₃ , g ₄ ), and (g ₀ , g ₄ ), And the candidate gene pairs (g ₃ , g ₀ ), (g ₃ , g ₄ ) and (g ₀ , g ₄ ) and the candidate condition pairs (c ₀ , c ₁ ), (c ₀ , c ₃ ) and (c, respectively ₁ , c ₃ ) pairing to form a plurality of candidate sets; (g ₀ , g ₄ ) in the conditional pattern basis is a candidate gene pair, which is paired with the candidate condition pair (c ₀ , c ₅ ) to form a candidate set; conditional pattern The (g ₃ , g ₀ ) in the base is the candidate gene pair, which is paired with the candidate condition pairs (c ₃ , c ₅ ) and (c ₁ , c ₅ ) to form a second candidate set. Candidate sets of corresponding gene items g ₁ , g ₂ , g ₄ , g _{5 are} completely recorded in the intermediate table of FIG.

再參考基因項目g₂ 及其相應條件式樣基礎(g₃ ,g₀ ,g₄ )，根據該基因g₂ 及其相應候選集計算複數個第一及第二子空間分群。參考圖11及上述關於步驟S232之敘述，在候選集(CandidateSet[2])欄中第1-3列之候選集具有相同之候選基因對(g₃ ,g₀ )，該候選基因對(g₃ ,g₀ )即定義為第一基因組，且根據相應基因項目g₂ 、第一基因組及第一條件組計算複數個第一大項目集並輸出為第一子空間分群。其中，該等候選集之候選條件對(c₀ ,c₁ )、(c₀ ,c₃ )及(c₁ ,c₃ )結合為一第一條件組(c₀ ,c₁ ,c₃ )，亦即，該第一條件組(c₀ ,c₁ ,c₃ )中之組合包含(c₀ ,c₁ )、(c₀ ,c₃ )及(c₁ ,c₃ )，相應基因項目g₂ 、第一基因組(g₃ ,g₀ )及第一條件組(c₀ ,c₁ ,c₃ )形成一第一大項目集L[2]：(g₂ ,g₃ ,g₀ )x(c₀ ,c₁ ,c₃ )；同理，在候選集(CandidateSet[2])欄中第3列及倒數第1-2列之候選集具有相同之候選基因對(g₃ ,g₀ )，該候選基因對(g₃ ,g₀ )定義為第一基因組，該等候選集之候選條件對(c₁ ,c₃ )、(c₃ ,c₅ )及(c₁ ,c₅ )結合為一第一條件組(c₁ ,c₃ ,c₅ )，亦即，該第一條件組(c₁ ,c₃ ,c₅ )中之組合包含(c₁ ,c₃ )、(c₃ ,c₅ )及(c₁ ,c₅ )，相應基因項目g₂ 、第一基因組(g₃ ,g₀ )及第一條件組(c₁ ,c₃ ,c₅ )形成另一第一大項目集L[2]：(g₂ ,g₃ ,g₀ )x(c₁ ,c₃ ,c₅ )。計算該等第一大項目集L[2]後，再將該等第一大項目集L[2]輸出為第一子空間分群。Referring again to the gene project g ₂ and its corresponding conditional pattern basis (g ₃ , g ₀ , g ₄ ), a plurality of first and second subspace clusters are calculated based on the gene g ₂ and its corresponding candidate set. Referring to FIG. 11 and the above description of step S232, the candidate sets of columns 1-3 in the candidate set (CandidateSet [2]) column have the same candidate gene pair (g ₃ , g ₀ ), the candidate gene pair (g ₃ , g ₀ ) is defined as the first genome, and a plurality of first large item sets are calculated according to the corresponding gene item g ₂ , the first genome and the first condition group and output as the first subspace group. The candidate condition pairs (c ₀ , c ₁ ), (c ₀ , c ₃ ), and (c ₁ , c ₃ ) of the candidate sets are combined into a first condition group (c ₀ , c ₁ , c ₃ ), That is, the combination of the first condition group (c ₀ , c ₁ , c ₃ ) includes (c ₀ , c ₁ ), (c ₀ , c ₃ ), and (c ₁ , c ₃ ), and the corresponding gene item g _{2. The} first genome (g ₃ , g ₀ ) and the first condition group (c ₀ , c ₁ , c ₃ ) form a first large item set L[2]: (g ₂ , g ₃ , g ₀ ) x (c ₀ , c ₁ , c ₃ ); Similarly, in the candidate set (CandidateSet [2]) column, the candidate set of the third column and the last 1-2 columns have the same candidate gene pair (g ₃ , g ₀ The candidate gene pair (g ₃ , g ₀ ) is defined as a first genome, and the candidate condition pairs of the candidate sets are combined with (c ₁ , c ₃ ), (c ₃ , c ₅ ) and (c ₁ , c ₅ ) Is a first condition group (c ₁ , c ₃ , c ₅ ), that is, the combination of the first condition group (c ₁ , c ₃ , c ₅ ) comprises (c ₁ , c ₃ ), (c _{3 )} , c ₅ ) and (c ₁ , c ₅ ), the corresponding gene project g ₂ , the first genome (g ₃ , g ₀ ) and the first condition group (c ₁ , c ₃ , c ₅ ) form another first largest Item set L[2]: (g ₂ , g ₃ , g ₀ )x(c ₁ , c ₃ , c ₅ ). After calculating the first large item set L[2], the first large item set L[2] is output as the first subspace group.

參考圖12及上述關於步驟S233之敘述，相應條件式樣基礎中(g₃ ,g₀ ,g₄ )之子集合(g₃ ,g₀ )之第一基因組及第一條件組為(g₃ ,g₀ )x(c₀ ,c₁ ,c₃ )及(g₃ ,g₀ )x(c₁ ,c₃ ,c₅ )；相應條件式樣基礎中(g₃ ,g₀ ,g₄ )之子集合(g₃ ,g₄ )之第一基因組及第一條件組為(g₃ ,g₄ )x(c₀ ,c₁ ,c₃ )；相應條件式樣基礎中(g₃ ,g₀ ,g₄ )之子集合(g₀ ,g₄ )之第一基因組及第一條件組為(g₀ ,g₄ )x(c₀ ,c₁ ,c₃ )。在此，(g₃ ,g₀ )x(c₀ ,c₁ ,c₃ )、(g₃ ,g₄ )x(c₀ ,c₁ ,c₃ )及(g₀ ,g₄ )x(c₀ ,c₁ ,c₃ )之第一條件組皆為(c₀ ,c₁ ,c₃ )，亦即，上述3個第一條件組之集合可形成一為第二條件組(c₀ ,c₁ ,c₃ )。接著，結合相應上述3個第一條件組之第一基因組(g₃ ,g₀ )、(g₃ ,g₄ )及(g₀ ,g₄ )為一基因組集合(g₃ ,g₀ ,g₄ )(第二基因組)。另外，上述該第二基因組(g₃ ,g₀ ,g₄ )該第二條件組(c₀ ,c₁ ,c₃ )之組合係以C[3]：(g₃ ,g₀ ,g₄ )x(c₀ ,c₁ ,c₃ )表示之，而上述該第二基因組(g₃ ,g₀ ,g₄ )、該第二條件組(c₀ ,c₁ ,c₃ )及相應基因項目g₂ 結合為一第二大項目集L[3]，再將該第二大項目集L[3]輸出為第二子空間分群(g₀ ,g₂ ,g₃ ,g₄ )x(c₀ ,c₁ ,c₃ )。Referring to FIG. 12 and the above description of step S233, the first genome and the first condition group of the subset (g ₃ , g ₀ ) of the corresponding condition pattern basis (g ₃ , g ₀ , g ₄ ) are (g ₃ , g ₀ ) x(c ₀ , c ₁ , c ₃ ) and (g ₃ , g ₀ )x(c ₁ , c ₃ , c ₅ ); a subset of (g ₃ , g ₀ , g ₄ ) in the corresponding conditional pattern basis (g _3, g ₄₎ and the first genome of a first set of conditions _{_{(g 3, g 4) x}} (c 0, c 1, c 3); the basic pattern corresponding conditions _{_{(g 3, g 0, g}} 4 The first genome of the set of children (g ₀ , g ₄ ) and the first condition set are (g ₀ , g ₄ ) x (c ₀ , c ₁ , c ₃ ). Here, (g ₃ , g ₀ )x(c ₀ , c ₁ , c ₃ ), (g ₃ , g ₄ )x(c ₀ , c ₁ , c ₃ ) and (g ₀ , g ₄ )x ( _{_{_{c 0, c 1, c 3}}} ) are all of a first set of conditions _{_{_{(c 0, c 1, c 3}}} ) , i.e., a first set of the three sets of conditions may be formed as a second set of conditions (c ₀ , c ₁ , c ₃ ). Then, combining the first genomes (g ₃ , g ₀ ), (g ₃ , g ₄ ), and (g ₀ , g ₄ ) corresponding to the above three first condition groups into a genome set (g ₃ , g ₀ , g ₄ ) (second genome). In addition, the second genome (g ₃ , g ₀ , g ₄ ) of the second condition group (c ₀ , c ₁ , c ₃ ) is a combination of C[3]: (g ₃ , g ₀ , g _{4 )} x(c ₀ , c ₁ , c ₃ ), and the second genome (g ₃ , g ₀ , g ₄ ), the second condition group (c ₀ , c ₁ , c ₃ ) and the corresponding gene The item g _{2 is} combined into a second largest item set L[3], and the second large item set L[3] is output as the second subspace group (g ₀ , g ₂ , g ₃ , g ₄ )x ( c ₀ , c ₁ , c ₃ ).

關於其他基因項目(例如：g₁ 、g₄ 、g₅ )相應候選集、第一第二子空間分群及第二子空間分群之計算方法，與基因項目g₂ 相應候選集、第一第二子空間分群及第二子空間分群之計算方法相同，在此不再加以贅述。For other gene items (eg, g ₁ , g ₄ , g ₅ ) corresponding candidate sets, first and second subspace groupings, and second subspace grouping calculation methods, corresponding to the gene item g ₂ candidate sets, first and second The calculation methods of subspace grouping and second subspace grouping are the same, and will not be repeated here.

將所計算之複數個第二基因組及該第二條件組定義為下一階段之候選基因對及候選條件對，並重複上述計算第一子空間分群及第二子空間分群之步驟，直至無法結合出下一階段之第一大項目集(L[2+1],L[2+2],L[2+3],...)或相應第二基因組及該第二條件組之組合(C[3+1],C[3+2],C[3+3],...)，即完成本發明於DNA微陣列資料中探勘子空間分群之方法。在本發明於DNA微陣列資料中探勘子空間分群之方法中，係針對基因微陣列中子空間分群之問題，採用以大項目集為基礎的分群演算法，利用高頻式樣樹(Frequent Pattern Tree，FP-Tree)方式從基因之條件對的最大維度集合中計算出相應大項目集，從中獲得與條件對有關的基因集合。藉此，本發明於DNA微陣列資料中探勘子空間分群之方法可以避免複雜的分佈過程，且利用FP-Tree方式可大量地降低搜尋空間及搜尋時間。The calculated plurality of second genomes and the second condition group are defined as candidate gene pairs and candidate condition pairs in the next stage, and the steps of calculating the first subspace group and the second subspace group are repeated until the combination cannot be performed. The first large project set of the next stage (L[2+1], L[2+2], L[2+3],...) or a combination of the corresponding second genome and the second condition group ( C[3+1], C[3+2], C[3+3], ...), which is a method for performing subspace grouping in the DNA microarray data of the present invention. In the method for exploring subspace grouping in DNA microarray data, the problem of subspace grouping in gene microarray is adopted, and a clustering algorithm based on a large item set is used, and a high frequency model tree is used (Frequent Pattern Tree). , FP-Tree) method calculates the corresponding large item set from the largest dimension set of the condition pair of genes, and obtains the gene set related to the condition pair. Therefore, the method for exploring subspace grouping in the DNA microarray data can avoid complicated distribution process, and the FP-Tree method can greatly reduce the search space and the search time.

上述實施例僅為說明本發明之原理及其功效，並非限制本發明。因此習於此技術之人士對上述實施例進行修改及變化仍不脫本發明之精神。本發明之權利範圍應如後述之申請專利範圍所列。The above embodiments are merely illustrative of the principles and effects of the invention and are not intended to limit the invention. Therefore, those skilled in the art can make modifications and changes to the above embodiments without departing from the spirit of the invention. The scope of the invention should be as set forth in the appended claims.

圖1顯示一DNA微陣列之示意圖；Figure 1 shows a schematic diagram of a DNA microarray;

圖2顯示本發明於DNA微陣列資料中探勘子空間分群之方法流程圖；2 is a flow chart showing a method for exploring subspace grouping in the DNA microarray data of the present invention;

圖3顯示本發明之一DNA微陣列及其相關參數示意圖；Figure 3 is a schematic view showing a DNA microarray of the present invention and related parameters;

圖4顯示本發明計算條件式樣基礎之過程示意圖；Figure 4 is a schematic view showing the process of calculating the condition pattern base of the present invention;

圖5顯示本發明計算一交易資料庫之方法示意圖；Figure 5 is a schematic diagram showing the method of calculating a transaction database of the present invention;

圖6及圖7顯示本發明計算相應各基因之條件式樣基礎示意圖；6 and FIG. 7 are schematic diagrams showing the conditional basis of the corresponding genes in the present invention;

圖8顯示本發明以圖3之資料為基礎所建立之所有相應基因項目及條件式樣基礎之示意圖；Figure 8 is a schematic diagram showing the basis of all corresponding genetic items and conditional patterns established by the present invention based on the data of Figure 3;

圖9及圖10顯示本發明計算候選集之示意圖；9 and 10 are schematic diagrams showing the calculation candidate set of the present invention;

圖11顯示本發明計算第一子空間分群之示意圖；及FIG. 11 is a schematic diagram showing the calculation of the first subspace grouping by the present invention; and

圖12顯示本發明計算第二子空間分群之示意圖。Figure 12 is a diagram showing the calculation of the second subspace grouping of the present invention.

(無元件符號說明)(no component symbol description)

Claims

A method for exploring subspace grouping in DNA microarray data, wherein the DNA microarray is an M×N array composed of M genes and N conditional information of each gene, and the genes have sequential gene numbers. The condition information has a sequential condition number, and the method comprises the steps of: (a) setting a minimum gene value, a minimum condition information value, and a group threshold; (b) according to the minimum gene value, the minimum condition information value And the group threshold value is used to calculate a plurality of condition-to-maximum dimension sets, and a Frequent Pattern Tree (FP-Tree) method is used to calculate a plurality of conditional pattern bases according to the genes and corresponding conditions for the largest dimension set; and (c) Calculating a plurality of eligible subspace clusters based on the conditional pattern basis.

The method of claim 1, wherein in the step (a), the M system is between 10 ³ and 10 ⁶ and the N system is less than 100.

The method of claim 1, wherein calculating the condition-to-maximum dimension set in step (b) comprises the step of: (b1) calculating a difference in condition information of each numbered gene under each two condition number, wherein each The difference is the difference between the condition information of the corresponding smaller condition number and the condition information of the corresponding larger condition number; (b2) the difference is ranked from small to large as a difference, and the difference is sorted according to the difference Arrange the corresponding genes; and (b3) calculating the condition-to-maximum dimension set according to the minimum gene value, the minimum condition information value, the grouping threshold value, the difference ranking, and the genes of the corresponding gene numbers.

The method of claim 3, wherein in step (b3), the difference between the maximum value and the minimum value is defined as less than or equal to the difference between the difference value and the minimum value according to the grouping threshold value. The corresponding gene of the group threshold and the corresponding condition information are used to calculate a plurality of conditional pair maximum dimension sets, wherein each condition pair maximum dimension set includes a condition pair and a maximum dimension set, and each condition pair is a corresponding each difference value The second condition information, each maximum dimension set has at least the number of genes with the smallest gene value.

The method of claim 4, wherein calculating the conditional pattern base after the step (b3) comprises the steps of: (b31) calculating a genetic order according to the number of occurrences of each gene in the maximum dimension set; (b32) according to the conditions Calculating a transaction database for the number of occurrences of each gene, the transaction database includes a plurality of transaction rankings and corresponding condition pairs and a plurality of transaction maximum dimension sets; (b33) establishing the height according to the genetic ranking and the transaction database a frequency pattern tree, each of the lowest order nodes of the high frequency pattern tree having at least one corresponding pair of transaction conditions; and (b34) calculating corresponding genes according to the high frequency model tree, the minimum gene value, and the pair of transaction conditions Conditional Pattern Base.

The method of claim 5, wherein in step (b32), the maximum dimension of each transaction The gene lines in the degree of concentration are ordered at least according to the number of occurrences.

The method of claim 5, wherein in step (b34), the conditional pattern basis of the respective genes is calculated in the order in which the genes are sorted.

The method of claim 5, wherein the step (c) comprises the steps of: (c1) calculating a plurality of candidate sets based on the genes and the conditional pattern basis, each candidate set comprising a candidate gene pair and a candidate condition pair; (c2) calculating a plurality of corresponding first genomes and first condition groups based on the genes and the candidate sets, and calculating a plurality of first large item sets based on the genes, the corresponding first genomes, and the first condition group, and outputting First subspace grouping; (c3) calculating a plurality of corresponding second genomes and second condition groups according to the corresponding first genome and the first condition group, and calculating according to the genes, the corresponding second genome, and the second condition group a plurality of second largest item sets and outputting the second subspace group, wherein the number of genes in each second genome is the number of genes in the first genome plus one; and (c4) defining the corresponding second genome And the second condition group is a candidate set of the next cycle, and steps (c2) and (c3) are repeated, and the first subspace group and the second subspace group of the next cycle are calculated.

The method of claim 8, wherein in the step (c1), the number of genes of the candidate gene pair is the minimum gene value minus one.

The method of claim 8, wherein in step (c2), the corresponding first genome is calculated based on the genes and candidate sets having the same gene project First condition group.

The method of claim 8, wherein in step (c3), the second condition group is calculated according to a set having the same first condition group, and the first group is calculated according to the first genome group corresponding to the first condition group The second genome.

The method of claim 8, wherein in the step (c4), a screening step is further included to delete the first subspace grouping that is the same as the second largest item set.