CN106980775A

CN106980775A - Temporal gene chip data method for digging based on the consistent Evolution Type of whole continuation columns

Info

Publication number: CN106980775A
Application number: CN201710190092.8A
Authority: CN
Inventors: 邝秋华; 陈鑫; 刘志煌; 薛云; 蔡倩华
Original assignee: South China Normal University
Current assignee: Tianjin Jinyu Medical Laboratory Co.,Ltd.
Priority date: 2017-03-27
Filing date: 2017-03-27
Publication date: 2017-07-25
Anticipated expiration: 2037-03-27
Also published as: CN106980775B

Abstract

The invention discloses a kind of temporal gene chip data method for digging based on the consistent Evolution Type of whole continuation columns, it includes：（1）Data are pre-processed by input timing microarray data, and original matrix is converted into a difference matrix；（2）Double focusing class is initialized using the method for the consistent Evolution Type of whole continuation columns；（3）Double focusing class is weighed by changing ranks come iteration, and using the method for the consistent Evolution Type of whole continuation columns, so as to update double focusing class；（4）Double focusing class is exported using ranks threshold restriction.This method considers the factor of adjacent time, and the information of the consistent Evolution Type of whole continuation columns can be captured, the variation relation to gene and time can be learnt, so as to learn the various contacts to different genes, the information of regulation and control contact between gene etc., and it is more efficient than existing method quick.

Description

Temporal gene chip data method for digging based on the consistent Evolution Type of whole continuation columns

Technical field

The invention belongs to computerized algorithm, data mining technology field, and in particular to one kind is directed to temporal gene chip list Up to the temporal gene chip data method for digging based on the consistent Evolution Type of whole continuation columns of data.

Background technology

In biological information field, scientific and technological develops rapidly, the continuous progress of life science so that utilize data mining technology To analyze the trend that biological information has become current and future.The main points of bioinformatic analysis are genomics and albumen Group learns two fields, assigns nucleotide sequence and protein sequence as starting point respectively, assigns high throughput analysis as technology point, research Biological meaning that sequence contains (Campbell ＆Heyer, 2003).The unlatching of the Human Genome Project promotes bioinformatics Advance, the end worked with gene order-checking, the gates of genome times afterwards comprehensively open (Masters ＆Lakhani, 2000)。

There is a kind of gene expression data matrix to be called Time Series Gene Expression Data matrix, this data are time series datas, therefore There is the factor of time in data.The value of Time Series Gene Expression Data matrix be expression of the test cdna under the different time and Obtain, the characteristics of value in matrix discloses expression value over time different and different (Bar-Joseph, 2004).Utilize Genetic test technology goes to obtain the degree of gene expression of continuous time, can learn the variation relation to gene and time change, So as to learn between the various contacts to different genes, gene regulation and control contact etc. information (Korenberg, 2007).For the gene of common regulation and control, they may have the same expression in some continuous time section, this continuous time One stage of Duan Keneng and some cell processes is closely related (Zhang, Zha, 2005).Time Series Gene Expression Data Tend to provide the knowledge of the common regulation and control of some physiological processes, and the pass of bioprocess and time change can be reflected The system, (Bar- therefore Time Series Gene Expression Data plays an important role to the network and dynamic bioprocess of analyzing gene regulation Joseph,2004；F.Liu&Wang,2010).

In Time Series Gene Expression Data matrix, row represents gene, and row represent the experimental period of priority, there is the priority of time Sequentially.For the Time Series Gene Expression Data of time series data, it can be found that the rule of the gene expression dose of adjacent time Can be more meaningful.But the double focusing class that many double focusing class models are excavated is not the continuous adjacent time, therefore these double focusing class models Time series data can not be tackled very well.With going deep into for research, there is the characteristics of some double clustering methods can be directed to time series data To excavate, these methods can analyze Time Series Gene Expression Data matrix, find the biological knowledge of continuous time setting.

Cluster is the classical way of Data Mining, and an important aspect of current research gene expression data is to use Clustering method, such as k- averages (J.A.Hartigan＆Wong, 1979), hierarchical clustering (Cameron, Middleton, Chenn ,s ＆Olson, 2012), self organizing neural network (Kolehmainen, Wong , ＆Castr é n, 1999), these Clustering method solves the biological question of many reality.Cluster can consider the similitude handle under whole dimensions in data matrix Gene is divided into several set do not occured simultaneously.Typically it is clustered into after several set, the gene inside same set Expression Data Representation it is similar, it is and dissimilar with the expression Data Representation of the gene of other set.

Analyze gene expression data application tradition cluster and also have part weak point.First, clustering method is to consider one Whole values of dimension, either simply consider to go and the rows of cluster of shape or the cluster for simply considering row and formation row, formation What cluster considered is all global information.But studied for cell processes, some possible genes are simply in some ambient As Under have an obvious reaction, rather than have reaction in whole ambient As.Second, tradition cluster is typically divided into sample not There is the set of common factor, and a sample or gene can only at most belong to a cluster, and multiple clusters can not be belonged to simultaneously (Tanay,Sharan,&Shamir,2002).But in actual gene research, the gene of research has very big chance to participate in some lifes The very big chance of thing process, the i.e. gene is a member of some different clusters simultaneously.

Because the limitation of tradition cluster so that tradition cluster is difficult to the local expression pattern in mining data matrix, it is difficult to The hiding complex relationship (Cheung, Cheung, Kao, Yip , ＆Ng, 2006) between gene is excavated, and double focusing class can be solved Certainly this problem.

In order to overcome the shortcoming of cluster, double focusing class arise at the historic moment and adopted by extensive gene data research institute (Eren, Deveci,&2013；Flores,Inza,&Calvo,2013；Sara C Madeira& Arlindo L Oliveira,2004；Nepomuceno,Troncoso,&Aguilar-Ruiz,2011).With traditional cluster What is differed is that the gene dimension and experiment condition dimension in gene data are considered when double focusing is similar, while to two dimensions point Analysis, double focusing class can excavate the local message of data, i.e., simply have the genome of similar expression performance in some conditions.Double focusing Class is more flexibly, and gene sets and set of circumstances can be contained in different clusters simultaneously, i.e., different cluster it Between can have overlapping scope.

The concept of double focusing class is used gene expression number by Cheng and Church (Cheng＆Church, 2000) for the first time According to research, and the mean square sesidual for limiting double focusing class is proposed, while proposing that heuristic CC methods are used for digging Dig double focusing class.CC method main process is the mean square sesidual of calculated sub-matrix, and constantly row and column is deleted, and finally excavates and arrives The less double focusing class of mean square sesidual.Their method can find some biological informations of gene expression data, it can be difficult to point Time Series Gene Expression Data is analysed, because Time Series Gene Expression Data has the temporal information of inside, and this method is unable to discovery time The contact of sequence.

Zhang et al. (Zhang et al., 2005) improves CC methods and propose can be for analysis time sequence data Method CC-TSB.The main thought of the double clustering methods of CC-TSB is similar with CC methods, and the main distinction is CC-TSB methods to row Operation it is restricted, it is ensured that the row of double focusing class are continuous several columns, addition or to remove row be in the row operation of first or tail, so should Method is it can be found that the double focusing class of regular hour information.But, the amount for the limitation submatrix that the double clustering methods of CC-TSB are used It is also the mean square sesidual that Cheng and Church is proposed, but mean square sesidual is concerned with the absolute size of gene expression values, and make an uproar Sound factor, dimension factors affect gene expression values, therefore the robustness to noise data of CC-TSB methods is not strong enough.In addition, Mean square sesidual function is not concerned with internal sequence relation, therefore this method can not measure time series gene expression data very well The characteristics of.

Consider the factor of continuous time, " the consistent change (coherent evolution) between adjacent time becomes for concern Gesture " is more more meaningful than concern " size of actual value " (Sara C Madeira＆Arlindo L Oliveira, 2004), a kind of Continuous consistent row (Contiguous coherent columns, abbreviation CCC) are by Sara C.Madeira et al. (Madeira＆ Oliveira, 2005) propose.Pattern is limited on continuous row by CCC, for finding the consistent double focusing class of all maximum continuation columns (contiguous column coherent bicluster).CCC methods are converted into initial data represent lifting first Character type sequence, then improves efficiency using suffix data tree structure, finally excavates double focusing class consistent to maximum continuation column.

Because CCC methods do not consider the influence of noise, noise factor is considered later, and Sara C.Madeira et al. change Enter the double clustering methods of original CCC, propose the double clustering methods (Madeira＆Oliveira, 2007) of e-CCC.E-CCC double focusing classes Method is similar with the double clustering methods of CCC, is a difference in that e-CCC methods can tolerate the situation of certain error, and error is small Regard same pattern as in the pattern of predetermined threshold.Experimental result illustrates that e-CCC methods are better to the robustness of noise, find The biological information of double focusing class be more enriched with.

CCC, e-CCC method all take into account the important feature of temporal gene data --- timing, pay close attention to two neighboring Variation tendency between time point, and the relative size rather than absolute size of gene expression values are paid close attention to, therefore model has stronger Noise robustness.But, CCC only considers local most long pattern, and lost the second length, the company of other length of the 3rd length etc. The continuous information for arranging consistent Evolution Type.In addition, after finding method CCC, e-CCC of the consistent Evolution Type of existing continuation column are all based on Sew the string processing technology of tree, space complexity is higher, it is difficult to processing data data in large scale.

Existing method exists following not enough：

(1) existing method does not account for the information of adjacent time to take into account, but time series data include priority when Between order, have priority sequence relation in continuous time, it is impossible to weigh the similitude of such data very well.Accordingly, it would be desirable to solve to examine Consider the double focusing class Mining Problems of adjacent time information.

(2) existing method only considers local most long pattern, and lost the second length, the company of other length of the 3rd length etc. The continuous information for arranging consistent Evolution Type.Accordingly, it would be desirable to which whole continuation columns are unanimously drilled between solving two sequences of Time Series Gene Expression Data The seizure problem of the information of change type.

(3) operating efficiency of existing method is relatively low, need to expend during double focusing class is excavated more time and Resource, this is also one of technical problems to be solved by the invention.

The content of the invention

It is an object of the invention to overcome deficiencies of the prior art, propose that one kind is unanimously developed based on continuation column The double focusing class method for digging of the gene chip expression data of property.Original matrix is converted into a difference matrix, Ran Houti first Go out the method for the quality of new measurement double focusing class, then obtain double focusing class by changing ranks come iteration.This method considers phase The factor of adjacent time, and the information of the consistent Evolution Type of whole continuation columns can be captured, and it is more efficient than existing method quick, Concrete technical scheme is as follows.

Temporal gene chip data method for digging based on the consistent Evolution Type of whole continuation columns, it comprises the following steps：

(1) data are pre-processed by input timing microarray data；

(2) using the double clustering methods of the measurement based on the consistent Evolution Type of whole continuation columns, the data obtained to step (1), Carry out initialization double focusing class；

(3) using the double clustering methods of the measurement based on the consistent Evolution Type of whole continuation columns, by way of additions and deletions ranks, repeatedly In generation, updates double focusing class；

(4) using the output of ranks threshold restriction, double focusing class is obtained, realizes that temporal gene chip data is excavated.

Further, step (1) is specifically：In the data initialization stage, original matrix is converted into a record first and appointed The matrix for two adjacent time variation tendencies of anticipating, for reflecting change of the expression value of each gene under two neighboring time point Change trend.

Further, in step (2) (3), the specific of double clustering methods is weighed based on the consistent Evolution Type of whole continuation columns Operation is：Often consistent with whole continuation columns of the double focusing class core Evolution Type quantity of row is calculated in double focusing class, then to these quantity Summation, then with the columns of summed result divided by double focusing class, so as to obtain weighing the value of double focusing class, is designated as perACCC.

Further, it is described to calculate often all continuous between row and double focusing class core in double focusing class in step (2) (3) Arranging the concrete operations of consistent Evolution Type quantity is：In two rows to be calculated, two row expression values in any continuation column are had The pattern of Similar trend counts the number of all patterns as objects of statistics, and obtained result is designated as ACCC.

Further, step (2) is specifically：The first step：Double focusing class core is selected, i.e., one is randomly choosed in data matrix OK, a continuation column is then randomly choosed as the row collection of double focusing class, and value of the row on the continuation column collectively forms double focusing Class core；Second step：The row collection of double focusing class is obtained, i.e., some rows are randomly choosed in data matrix as initial row collection, to institute ACCC described in the double focusing class core calculations that every a line of initial row collection is obtained with the first step respectively is stated, the ACCC more than setting value Corresponding row is saved in the row collection of double focusing class；3rd step：Double focusing class core, i.e. each row to double focusing class are updated, calculate every One be listed in the double focusing class row collection on mode, with the mode as double focusing class core respective column value, so as to obtain Double focusing class core after renewal.

Further, step (3) is specifically：The first step：Calculate the perACCC values for obtaining given double focusing class；Second step：Make Row are updated with addition and the mode deleted；3rd step：The row for the condition that meets is added to advancing by scan data matrix Row updates；4th step：By taking the method for mode to update double focusing class core with the 3rd step identical in step (2)；5th step：Root The perACCC values of double focusing class after being updated according to the row collection after renewal, continuation column, double focusing class core calculations；6th step：Calculate and update double The rate of change of the perACCC values of double focusing class before and after cluster, judges whether to meet predetermined threshold value, to determine whether to continue iteration more New double focusing class.

Further, in step (3), row are updated using addition and the mode deleted, the adding conditional of continuation column It is as follows with the condition of deletion：

Row adding conditional：If after the row of addition one, the perACCC values increase obtained by calculating then confirms this row of addition；

Expand addition continuation column to the right first, after expansion addition terminates, continue to expand addition continuation column to the left, until expanding Addition terminates；

Row deletion condition：If deleted after a row, the perACCC values increase obtained by calculating then confirms to delete this row；

Deletion action is carried out to the right since leftmost row first, until stopping；Then enter to the left since rightmost Row deletion action, until stopping；

If a continuation column is expanded with the addition of several columns to the right, then without the row on the right of deletion；If one continuous Row are expanded with the addition of several columns to the left, then without the row for deleting the left side.

Further, in step (3), the row for adding the condition that meets by scan data matrix is updated to row, and row adds Plus condition is as follows：

Row update condition：If after addition a line, perACCC values obtained by calculating more than or equal to do not add the row it The perACCC values of preceding double focusing class, then confirm addition this journey, otherwise without；

Double focusing class is calculated first and obtains perACCC values, a data matrix is then scanned, often row and double focusing are calculated respectively ACCC values between class core, then the ACCC values divided by the columns value of the continuation column of double focusing class core, add institute State value to concentrate to row more than or equal to the corresponding row of the perACCC values, so that more newline collection.

Further, the determination methods for whether continuing iteration described in step (3) are specifically：Set a rate of change threshold Value, pA0 is designated as without calculating obtained perACCC values before updating ranks and core, obtains the calculating after renewal PerACCC values are designated as pA1, if pA0 to pA1 rate of change is less than default rate of change threshold value, stop iteration and update double focusing Class, into step (4).

Further, step (4) is specifically：One group of ranks threshold value is set, if double focusing class is unsatisfactory for default ranks threshold Value, then reinitialize double focusing class, go to step (2)；If double focusing class meets ranks threshold value, double cluster results are obtained, it is real Current sequence microarray data is excavated.

The present invention compared with prior art, substantive distinguishing features and remarkable advantage is protruded with following：

In the present invention, temporal gene chip expression data are analyzed using the consistent Evolution Type model of continuation column, found Meet the submatrix of the method for measurement double focusing class proposed by the present invention.The present invention considers the factor for having the time in data, Neng Gouxue Gene and the variation relation of time change are practised, so as to learn the regulation and control between the various contacts to different genes, gene The information of contact etc..For the gene of common regulation and control, they have the same expression in some continuous time section, and this is continuous Period and a stage of some cell processes are closely related, using the teaching of the invention it is possible to provide the knowledge of the common regulation and control of some physiological processes, The network for analyzing gene regulation is played an important role with dynamic bioprocess.Still further aspect, operating efficiency of the invention is very Height, using fewer resource and can be completed in a relatively short time the excavation of double focusing class.

Brief description of the drawings

Fig. 1 is the flow chart of the microarray data method for digging based on the consistent Evolution Type of whole continuation columns.

Fig. 2 is the flow chart for initializing double focusing class example.

Fig. 3 is the flow chart for updating double focusing class row collection example.

Fig. 4 is the flow chart for updating double focusing class row collection example.

Fig. 5 is the flow chart that iteration updates and judged example.

Fig. 6 is the line chart excavated to double focusing class.

Fig. 7 is the block diagram of saccharomycete chip data enriching analysis.

Fig. 8 is the line chart of operating time and data matrix line number relation.

Fig. 9 is the line chart of operating time and data matrix columns relation.

Embodiment

The embodiment to the present invention is described further below in conjunction with the accompanying drawings, but the implementation of the present invention is not limited to This.It is that those skilled in the art can if needing it is emphasized that have the symbol not described in detail especially or operating process below With what is realized with reference to prior art.

Such as Fig. 1, the double focusing class method for digging based on the consistent evolutive gene chip expression data of continuation column of this example Including following content：

Data are pre-processed by the 1st, input timing microarray data.

In the data initialization stage, original matrix is converted into one first and records any two adjacent time variation tendency Matrix, for reflecting variation tendency of the expression value of each gene under two neighboring time point.

In this example, variation tendency is reduced to two kinds of situations, one kind is increase or constant, another to reduce, Represented respectively with 1,0, finally original matrix is converted into the matrix of columns 1 row fewer than original matrix, referred to as difference matrix.

As shown in table 1, in table 1, rower a, b, c, d represent different genes to original matrix, row mark t1, t2, t3, t4, T5, t6 represent time point.The difference matrix that table 1 is obtained by pretreatment is as shown in table 2, and in table 2, rower a, b, c, d are represented not Same gene, row mark 1,2,3,4,5 represents t1 to t2, t2 to t3, t3 to t4, t4 to t5, t5 to t6 variation tendency respectively.Example If 1 row a, t1, t2 moment of table expression value is 0.55 and 0.19 respectively, because 0.19<0.55, thus in the matrix of table 2 row a first Train value is 0.Similarly, in table 1 0.83 in row a>0.19, so the train values of row a second are 1.

Primordial time series data before the conversion of table 1.

Difference matrix data after the conversion of table 2.

2nd, using the double clustering methods of the measurement based on the consistent Evolution Type of whole continuation columns, the data obtained to step (1) are entered Row initialization double focusing class.

The concrete operations of the double clustering methods of measurement based on the consistent Evolution Type of whole continuation columns are：Calculate and often gone in double focusing class Then these quantity are summed by Evolution Type quantity consistent with whole continuation columns of double focusing class core, then with summed result divided by The columns of double focusing class, so as to obtain weighing the value of double focusing class, is designated as perACCC.

It is described to calculate the specific of the consistent Evolution Type quantity of whole continuation columns in double focusing class often between row and double focusing class core Operation is：In two rows to be calculated, two row expression values in any continuation column are had the pattern of Similar trend as Objects of statistics, counts the number of all patterns, and obtained result is designated as ACCC.

Variation tendency is being reduced in the example of two kinds of situations, the consistent Evolution Type number of whole continuation columns of two rows is calculated Method concrete operations it is as follows：

Firstly generate with the two rows equal length data line, be designated as record row.Then to two row, observe successively Each column, the value of two rows is all identical, and record row is set to 1 in the value of the row, different then be set to 0, so that more new record row.To record OK, complete 1 hop count occurred, and 1 number occurred in every section are counted, finally by calculating the consistent Evolution Type number of continuation column, So as to obtain result.Complete 1 section occurred refers to that value is entirely 1 part.

For example：Recording row is：110011101.Complete 1 section occurred has 3 sections.Each section is represented with underscore, is respectively： 110011101, therefore each section 1 of number is respectively：2,3,1, be by calculating obtained similarity：10.

Initializing double focusing class is specifically：The first step：Double focusing class core is selected, i.e., a line is randomly choosed in data matrix, Then a continuation column is randomly choosed as the row collection of double focusing class, and value of the row on the continuation column collectively forms double focusing nucleoid The heart；Second step：The row collection of double focusing class is obtained, i.e., some rows are randomly choosed in data matrix as initial row collection, to described first Begin ACCC described in the double focusing class core calculations that are obtained respectively with the first step of every a line of collection, and the ACCC institutes more than setting value is right The row answered is saved in the row collection of double focusing class；3rd step：Double focusing class core, i.e. each row to double focusing class are updated, each row are calculated The double focusing class row collection on mode, with the mode as double focusing class core respective column value, so as to be updated Double focusing class core afterwards.

Initialize double focusing class example as shown in Figure 2.

The first step：Double focusing class core is selected, i.e., a line is randomly choosed in data matrix, one is then randomly choosed continuously The row collection as double focusing class is arranged, value of the row on the continuation column collectively forms double focusing class core.In the difference matrix of table 2 Randomly selected a line is row b, i.e. [0 011 0], and randomly selected continuation column is 2,3 row, constitutes double focusing class core.As schemed Shown in 2, S1 represents double focusing class core, and ce represents the value of double focusing class core, at once b value, and C1 represents the continuous of double focusing class core Row, are 2,3 row.In Fig. 2, " row " that corresponding row represents the row mark of difference matrix, and " value " that corresponding row represents double focusing nucleoid Row in the value of the heart, dotted line frame represent the continuation column and respective value of selection.

Second step：The row collection of double focusing class is obtained, i.e., some rows are randomly choosed in data matrix as initial row collection, to institute ACCC described in the double focusing class core calculations that every a line of initial row collection is obtained with the first step respectively is stated, more than or equal to setting value ACCC corresponding to row be saved in double focusing class row collection.Randomly selected initial row collection is { b, c, d }, to initial row collection b, C, d } the double focusing class core calculations ACCC that is obtained respectively with the first step of every a line, be as a result 3,1,1 respectively, the threshold value 1 of setting, Because 3,1,1 both greater than or equal to 1, therefore obtain the row collection R1={ b, c, d } of double focusing class.

3rd step：Double focusing class core, i.e. each row to double focusing class are updated, each row collection for being listed in the double focusing class is calculated On mode, with the mode as double focusing class core respective column value so that the double focusing class core after being updated.As schemed Row b, c, d value is shown shown in 2, in figure, takes mode to obtain the corresponding rows of ce to each column, is worth for [1 011 0].For example go B, c, d first row, 0 number is 1, and 1 number is 2, i.e., 1 number is more than 0 number, and mode is 1, so core First is classified as 1.

After being operated by above-mentioned steps, the row collection row of initial double focusing class collect and double focusing class core is respectively：S1=(ce =[1 011 0], C1={ 2,3 }), R1={ b, c, d }, C1={ 2,3 }.

3rd, using the double clustering methods of the measurement based on the consistent Evolution Type of whole continuation columns, by way of additions and deletions ranks, repeatedly In generation, updates double focusing class.

The first step：Calculate the perACCC values for obtaining given double focusing class；Second step：Using addition and the mode deleted to row It is updated；3rd step：The row for adding the condition that meets by scan data matrix is updated to row；4th step：By with step Suddenly the 3rd step identical takes the method for mode to update double focusing class core in (2)；5th step：According to after renewal row collection, continuation column, The perACCC values of double focusing class after double focusing class core calculations update；6th step：Calculate and update double focusing class before and after double focusing class The rate of change of perACCC values, judges whether to meet predetermined threshold value, and double focusing class is updated to determine whether to continue iteration.

The first step：Calculate the perACCC values for obtaining given double focusing class.

Operating process is as described in the concrete operations of the double clustering methods of the measurement based on the consistent Evolution Type of whole continuation columns.

Second step：Row are updated using addition and the mode deleted.

Row are updated using addition and the mode deleted, the adding conditional and deletion condition of continuation column are as follows：

The example for updating the row collection process of double focusing class is as shown in Figure 3.

The row collection R1={ b, c, d } of double focusing class, continuation column C1={ 2,3 } calculate the perACCC values of the double focusing class first, Remember pA=perACCC (R1, C1), calculate shown " before renewal " in pA=0.83, such as figure.Addition continuation column is expanded to the right.It is first First investigate addition the 4th to arrange, as shown in step 1 in Fig. 3, the row of underscore are the current row for investigating addition, and this step is the 4th row, Continuously it is classified as { 2,3,4 }, it is 1.11 to calculate perACCC values, because 1.11>PA=0.83, so confirming the row of addition the 4th to company New perACCC values 1.11, are assigned to pA by continuous row C1.Continuation column is expanded to the right in continuation, and addition the 5th is arranged, such as step 2 institute in figure Show, it is identical with method when investigating the 4th row, i.e., continuously it is classified as { 2,3,4,5 }, it is 1.42 to calculate perACCC values, because 1.42 >New perACCC values 1.42, so confirming that the row of addition the 5th arrive continuation column C1, are assigned to pA by pA=1.11.Because the 5th row are The row of rightmost, it is impossible to expand to the right again, so stopping expanding to the right.Expand to the right after stopping, starting to expand to the left.First Investigate addition the 1st to arrange, as shown in step 3 in figure, i.e., be continuously classified as { 1,2,3,4,5 }, it is 1.33 to calculate perACCC values, because 1.33<PA=1.42, so continuation column C1 is arrived without the 4th row, while pA keeps initial value 1.42.Because the 1st row are Far Lefts Row, it is impossible to again to the left expand, so stop expand to the left.

After addition step terminates, start continuation column delete step.It is leftmost in consideration deletion most continuation column first to arrange the 2nd Row, as shown in step 4 in figure, the row of strikethrough are the current row for considering to delete, i.e., be continuously classified as { 3,4,5 }, calculate perACCC It is worth for 1.44, because 1.44>PA=1.42, so confirming that deleting the 2nd from continuation column C1 arranges, and assigns new perACCC values 1.44 It is worth to pA.Continue to investigate and delete the 3rd row, as shown in step 5 in figure, i.e., be continuously classified as { 4,5 }, it is 1 to calculate perACCC values, because For 1<PA=1.44, so the 4th row are not deleted, while pA keeps initial value 1.44.Because being continuously classified as { 2,3 } after initialization, It is { 2,3,4,5 } by the continuation column that row are expanded after adding, i.e., is expanded to the right in adding procedure and confirm with the addition of 4,5 row to company Continuous row, so the operation of deletion row turned left from the right side need not be carried out.So this wheel operation ultimately produce continuation column C1=3, 4,5 }, as shown in figure " after renewal ".

3rd step：The row for adding the condition that meets by scan data matrix is updated to row

The row for adding the condition that meets by scan data matrix is updated to row, and row adding conditional is as follows：

The example for updating the process of double focusing class row collection is as shown in Figure 4.

The support row obtained in above step integrates as R1={ b, c, d }, is continuously classified as C1={ 3,4,5 }, double focusing nucleoid The heart is S1=(ce=[1 011 0], C1={ 3,4,5 }), and the perACCC values of the double focusing class after renewal row are assigned to All rows, primarily look at the 1st every trade a, perACCC ({ a }, C1)=1 in pA0, i.e. pA0=1.44. scanning difference matrix, because For 1<PA0=1.44, so the 2nd every trade b, perACCC ({ b }, C1)=2 are then investigated to row collection R1. without row a, because 2 >PA0=1.44, so a that adds line collects R1. similarly to row, investigates row c, d.Finally obtain row collection R1={ b, d }.Compared to original The row collection { b, c, d } of double focusing class, is unsatisfactory for condition, so that row c is deleted equivalent to row c.

4th step：By taking the method for mode to update double focusing class core with the 3rd step identical in step (2)

The method of double focusing class is updated as being mentioned above, because the method for renewal double focusing class core is above In describe in detail, so do not deploy here narration.Update double focusing class, obtained core be S1=(ce=[1 011 0], C1={ 3,4,5 }).

5th step：The perACCC values of double focusing class after being updated according to the row collection after renewal, continuation column, double focusing class core calculations.

Because the method for calculating perACCC is hereinbefore described in detail, narration is not deployed here.

6th step：The rate of change of the perACCC values of double focusing class before and after calculating renewal double focusing class, judges whether to meet default Threshold value, double focusing class is updated to determine whether to continue iteration.

The determination methods for whether continuing iteration are specifically：A rate of change threshold value is set, without renewal ranks and core Obtained perACCC values are calculated before and are designated as pA0, and the perACCC values that the calculating after renewal is obtained are designated as pA1, if pA0 Rate of change to pA1 is less than default rate of change threshold value, then stops iteration and update double focusing class, into step (4).

The example that iteration updates and judged is as shown in Figure 5.

In this example, rate of change threshold value might as well be set as deta=0.1, i.e., the rate of change of perACCC value needs small Just stop iteration in 0.1 as shown in Figure 5.The perACCC value pA0=0.83 of double focusing class before double focusing class are not updated, update double After cluster, the perACCC values pA1=2 for obtaining double focusing class is calculated.(pA1-pA0)/pA0=1.41 is calculated, because 1.41>deta The operation that second of iteration of=0.1. updates here in not reinflated narration second of iteration of, does not have as the operation of first time There are the perACCC value pA0=2 of double focusing class before updating double focusing class, update after double focusing class, calculate the perACCC for obtaining double focusing class Value pA1=2. calculates (pA1-pA0)/pA0=0, because 0<Deta=0.1, so stopping iteration, obtains the row collection row of double focusing class Collect (R1, C1)=({ b, d }, { 3,4,5 })

4th, exported using ranks threshold restriction, obtain double focusing class, realize that temporal gene chip data is excavated.

One group of ranks threshold value is set, if double focusing class is unsatisfactory for default ranks threshold value, double focusing class is reinitialized, turned To step (2)；If double focusing class meets ranks threshold value, double cluster results are obtained, realize that temporal gene chip data is excavated.

In this example, row threshold value min_row=2, row threshold value min_col=2 might as well be set.The row collection row collection of double focusing class It is (R1, C1)=({ b, d }, { 3,4,5 }) respectively.Obvious double focusing class line number is 3>Min_row=2, therefore output.Therefore final pair Cluster result is (R1, C1)=({ b, d }, { 3,4,5 }).The line chart of the double focusing class of output is as shown in Figure 6.

With reference to above-mentioned flow, give one example again below.

1st, data matrix is originated

Data matrix is saccharomycete chip data, and the data are temporal gene chip datas, from CHO et al. (R.J.Cho Et al., 1998) saccharomycete test data.Sampling interval is 10 minutes, has 17 sampled points, have recorded the expression of gene Level.The row of the data matrix represents expression of the gene in different time points, and row represent the table of all genes under same time point Up to situation.For the missing values in the data matrix, filled up with " cubic spline interpolation method ", obtained data matrix Size is 6147 × 17.

Artificial synthesized data matrix is the data of random synthesis, and the span of data is 0 to 1, and obedience is uniformly distributed.It is raw Into some data sets, one of which data set is that columns is fixed as 20 row, and line number increases to 2000 rows from 1000 rows, and amplification is 100 rows, another set data set is that line number is fixed as 1000, and columns increases to 40 row from 20 row, and amplification is 2 row.

2nd, GO analyses are carried out to the double cluster results excavated

GO (Gene Ontology) annotations are used to examine the biological significance of double focusing class, and the cluster result of generation is made The checking of authenticity.Biological meaning analysis is carried out to obtained double cluster results with GOToolbox, P-value is make use of.Pass through The P-value between the pattern found and known classification is calculated, observation belongs to the Gene Ontology of same genoid The similitude of annotation, it is known that the biological meaning of double focusing class.The similar gene of gene expression data generally falls into same biology Path is learned, and bioprocess and molecular function have similarity.GO (Gene Ontology) is carried out to the double focusing class found Annotation experiment, obtained P-value values are smaller, illustrate that the relevance of pattern and known classification found is stronger.

As shown in table 3, the first row in table 3 is GO to the GO analysis parts result of saccharomycete chip data：0002181, base 7166 genes altogether in factor data bank, wherein only 186 genes (accounting for 2.6%) belong to the GO of cytoplasm translation.Based on whole The double focusing class that the temporal gene chip data method for digging of the consistent Evolution Type of continuation column is found has 754 genes, wherein having 90 genes (accounting for 11.9%) belong to the GO items of cytoplasm translation.Calculating obtains very small p-value values (4.62E-37), this Individual very small p-value values reflect the significant biological meaning of display of this double focusing class.In saccharomycete chip data, GO The result of analysis shows that the double cluster results found have significant biological meaning.

The GO analysis part result tables of 3. pairs of cluster results of table

3rd, bioconcentration analysis is carried out to the double focusing class excavated

In order to verify the biological meaning of obtained double focusing class from the statistical significance, pass through the biological work(of the double cluster results of analysis Can enriching carry out biological meaning analysis (Al-Akwaa＆Kadah, 2009；et al.,2006).Looked for by calculating P-value between the double focusing class and known classification that arrive, counts those double focusings for being less than P-value threshold values set in advance The percentage of whole double focusing class numbers shared by class, it is known that statistical significance of the double focusing class in bioprocess enriching.For phase With relatively low P-value values set, count the whole double focusing classes found of the bigger explanation of obtained percentage generally with it is known The relevance of gene classification is stronger, so as to illustrate that the biological meaning of double focusing class is more notable.

Bioconcentration analysis is done on saccharomycete chip data, a variety of clusters or double clustering methods, double focusing class knot is contrasted The concentration ratio block diagram of fruit is as shown in fig. 7, the biology richness that the method as seen from the figure based on the consistent Evolution Type of whole continuation columns is drawn Collection property will be apparently higher than CCC, CC-TSB, CCC, OPSM, xMotifs, these methods of K-means.Number is expressed by real gene According to shown in the enriching analysis of collection, the temporal gene chip data method for digging based on the consistent Evolution Type of whole continuation columns is well The more valuable double focusing class with biological significance is obtained from gene expression data.

4th, scalability Analysis is carried out to method for digging

Relation the Fig. 8 and Fig. 9 for analyzing program runtime and data set size illustrate method in different data squares Run time situation under the size of battle array, is demonstrated by similar trend in Fig. 8 and Fig. 9 as we can see from the figure.For Fig. 8, number It is fixed 20 according to total columns of matrix.In method program, the line number of matrix increases to 2000 from 1000, increases by 100 every time, from Fig. 8 can see, with the increase of line number, and the run time of method linearly increases.For Fig. 9, total line number of data matrix is Fixed 2000.In method program, matrix column number increases to 40 from 20, and increase by 2 every time is arranged.It will be seen from figure 9 that with row Several increases, the run time approximately linear increase of method.

Therefore, the experimental result more than can see, the temporal gene chip based on the consistent Evolution Type of whole continuation columns Data digging method has good scalability in the ever-increasing situation of data volume.

Examples detailed above further illustrate the invention has the advantages that：

(1) this method takes into account the information of adjacent time, can effectively excavate double with time serial message Cluster.The variation relation to gene and time change can be learnt, be may learn between the various contacts of different genes, gene Regulation and control contact etc. information.

(2) this method considers the information of the consistent Evolution Type of whole continuation columns, will not lose the continuation column of short length The information of consistent Evolution Type.The double focusing class with notable biological meaning can effectively be found, and there is provided the common of some physiological processes The knowledge of regulation and control.

(3) operating efficiency of this method is very high, using fewer resource and can be completed in a relatively short time double focusing class Excavate, the operating time of method and the approximately linear relation of data scale, will not sharp increase, tool with the increase of data rule There is the practice significance of reality.

Claims

1. the temporal gene chip data method for digging based on the consistent Evolution Type of whole continuation columns, it is characterised in that including following step Suddenly：

（1）Data are pre-processed by input timing microarray data；

（2）Using the double clustering methods of the measurement based on the consistent Evolution Type of whole continuation columns, to step（1）Obtained data, are carried out Initialize double focusing class；

（3）Using the double clustering methods of the measurement based on the consistent Evolution Type of whole continuation columns, by way of additions and deletions ranks, iteration is more New double focusing class；

（4）Exported using ranks threshold restriction, obtain double focusing class, realize that temporal gene chip data is excavated.

2. the temporal gene chip data method for digging according to claim 1 based on the consistent Evolution Type of whole continuation columns, It is characterized in that step（1）Specifically：In the data initialization stage, original matrix is converted into a record any two phase first The matrix of adjacent time trend, for reflecting variation tendency of the expression value of each gene under two neighboring time point.

3. the temporal gene chip data method for digging according to claim 1 based on the consistent Evolution Type of whole continuation columns, It is characterized in that step（2）（3）In, the concrete operations of the double clustering methods of measurement based on the consistent Evolution Type of whole continuation columns are： Often consistent with whole continuation columns of the double focusing class core Evolution Type quantity of row is calculated in double focusing class, then to the summation of these quantity, is connect The columns with summed result divided by double focusing class, so as to obtain weighing the value of double focusing class, perACCC is designated as.

4. the temporal gene chip data method for digging according to claim 3 based on the consistent Evolution Type of whole continuation columns, It is characterized in that step（2）（3）In, the whole continuation columns calculated in double focusing class often between row and double focusing class core are unanimously drilled The concrete operations of change type quantity are：In two rows to be calculated, two row expression values in any continuation column had into identical change The pattern of trend counts the number of all patterns as objects of statistics, and obtained result is designated as ACCC.

5. the temporal gene chip data method for digging according to claim 4 based on the consistent Evolution Type of whole continuation columns, It is characterized in that step（2）Specifically：The first step：Double focusing class core is selected, i.e., a line is randomly choosed in data matrix, then A continuation column is randomly choosed as the row collection of double focusing class, value of the row on the continuation column collectively forms double focusing class core； Second step：The row collection of double focusing class is obtained, i.e., some rows are randomly choosed in data matrix as initial row collection, to the initial row ACCC described in the double focusing class core calculations that every a line of collection is obtained with the first step respectively, corresponding to the ACCC more than setting value Row is saved in the row collection of double focusing class；3rd step：Double focusing class core, i.e. each row to double focusing class are updated, calculating is each to be listed in institute State double focusing class row collection on mode, with the mode as double focusing class core respective column value so that after being updated Double focusing class core.

6. the temporal gene chip data method for digging according to claim 5 based on the consistent Evolution Type of whole continuation columns, It is characterized in that step（3）Specifically：The first step：Calculate the perACCC values for obtaining given double focusing class；Second step：Use addition Row are updated with the mode of deletion；3rd step：The row for adding the condition that meets by scan data matrix is updated to row； 4th step：By with step（2）In the 3rd step identical take mode method update double focusing class core；5th step：According to renewal The perACCC values of double focusing class after row collection afterwards, continuation column, double focusing class core calculations update；6th step：Calculate and update before double focusing class The rate of change of the perACCC values of double focusing class afterwards, judges whether to meet predetermined threshold value, and double focusing is updated to determine whether to continue iteration Class.

7. the temporal gene chip data method for digging according to claim 6 based on the consistent Evolution Type of whole continuation columns, It is characterized in that step（3）In, row are updated using addition and the mode deleted, the adding conditional and deletion bar of continuation column Part is as follows：

Expand addition continuation column to the right first, after expansion addition terminates, continue to expand addition continuation column to the left, added until expanding Terminate；

Deletion action is carried out to the right since leftmost row first, until stopping；Then deleted to the left since rightmost Division operation, until stopping；

If a continuation column is expanded with the addition of several columns to the right, then without the row on the right of deletion；If a continuation column to Left expansion with the addition of several columns, then without the row for deleting the left side.

8. the temporal gene chip data method for digging according to claim 6 based on the consistent Evolution Type of whole continuation columns, It is characterized in that step（3）In, the row for adding the condition that meets by scan data matrix is updated to row, and row adding conditional is such as Under：

Row update condition：If added after a line, the perACCC values obtained by calculating are more than or equal to before not adding the row double The perACCC values of cluster, then confirm addition this journey, otherwise without；

Double focusing class is calculated first and obtains perACCC values, a data matrix is then scanned, often row and double focusing nucleoid are calculated respectively ACCC values between the heart, then the ACCC values divided by the columns value of the continuation column of double focusing class core, add described value Row corresponding more than or equal to the perACCC values is concentrated to row, so that more newline collection.

9. the temporal gene chip data method for digging according to claim 6 based on the consistent Evolution Type of whole continuation columns, It is characterized in that step（3）Described in whether continue the determination methods of iteration and be specifically：A rate of change threshold value is set, not having Obtained perACCC values, which are calculated, before updating ranks and core is designated as pA0, the perACCC values that the calculating after renewal is obtained PA1 is designated as, if pA0 to pA1 rate of change is less than default rate of change threshold value, stops iteration and updates double focusing class, into step Suddenly（4）.

10. the temporal gene chip data method for digging according to claim 1 based on the consistent Evolution Type of whole continuation columns, It is characterized in that step（4）Specifically：One group of ranks threshold value is set, if double focusing class is unsatisfactory for default ranks threshold value, is weighed New initialization double focusing class, goes to step（2）；If double focusing class meets ranks threshold value, double cluster results are obtained, sequential base is realized Because chip data is excavated.