Background technology
In biological information field, scientific and technological develops rapidly, the continuous progress of life science so that utilize data mining technology
To analyze the trend that biological information has become current and future.The main points of bioinformatic analysis are genomics and albumen
Group learns two fields, assigns nucleotide sequence and protein sequence as starting point respectively, assigns high throughput analysis as technology point, research
Biological meaning that sequence contains (Campbell &Heyer, 2003).The unlatching of the Human Genome Project promotes bioinformatics
Advance, the end worked with gene order-checking, the gates of genome times afterwards comprehensively open (Masters &Lakhani,
2000)。
There is a kind of gene expression data matrix to be called Time Series Gene Expression Data matrix, this data are time series datas, therefore
There is the factor of time in data.The value of Time Series Gene Expression Data matrix be expression of the test cdna under the different time and
Obtain, the characteristics of value in matrix discloses expression value over time different and different (Bar-Joseph, 2004).Utilize
Genetic test technology goes to obtain the degree of gene expression of continuous time, can learn the variation relation to gene and time change,
So as to learn between the various contacts to different genes, gene regulation and control contact etc. information (Korenberg,
2007).For the gene of common regulation and control, they may have the same expression in some continuous time section, this continuous time
One stage of Duan Keneng and some cell processes is closely related (Zhang, Zha, 2005).Time Series Gene Expression Data
Tend to provide the knowledge of the common regulation and control of some physiological processes, and the pass of bioprocess and time change can be reflected
The system, (Bar- therefore Time Series Gene Expression Data plays an important role to the network and dynamic bioprocess of analyzing gene regulation
Joseph,2004;F.Liu&Wang,2010).
In Time Series Gene Expression Data matrix, row represents gene, and row represent the experimental period of priority, there is the priority of time
Sequentially.For the Time Series Gene Expression Data of time series data, it can be found that the rule of the gene expression dose of adjacent time
Can be more meaningful.But the double focusing class that many double focusing class models are excavated is not the continuous adjacent time, therefore these double focusing class models
Time series data can not be tackled very well.With going deep into for research, there is the characteristics of some double clustering methods can be directed to time series data
To excavate, these methods can analyze Time Series Gene Expression Data matrix, find the biological knowledge of continuous time setting.
Cluster is the classical way of Data Mining, and an important aspect of current research gene expression data is to use
Clustering method, such as k- averages (J.A.Hartigan&Wong, 1979), hierarchical clustering (Cameron, Middleton,
Chenn ,s &Olson, 2012), self organizing neural network (Kolehmainen, Wong , &Castr é n, 1999), these
Clustering method solves the biological question of many reality.Cluster can consider the similitude handle under whole dimensions in data matrix
Gene is divided into several set do not occured simultaneously.Typically it is clustered into after several set, the gene inside same set
Expression Data Representation it is similar, it is and dissimilar with the expression Data Representation of the gene of other set.
Analyze gene expression data application tradition cluster and also have part weak point.First, clustering method is to consider one
Whole values of dimension, either simply consider to go and the rows of cluster of shape or the cluster for simply considering row and formation row, formation
What cluster considered is all global information.But studied for cell processes, some possible genes are simply in some ambient As
Under have an obvious reaction, rather than have reaction in whole ambient As.Second, tradition cluster is typically divided into sample not
There is the set of common factor, and a sample or gene can only at most belong to a cluster, and multiple clusters can not be belonged to simultaneously
(Tanay,Sharan,&Shamir,2002).But in actual gene research, the gene of research has very big chance to participate in some lifes
The very big chance of thing process, the i.e. gene is a member of some different clusters simultaneously.
Because the limitation of tradition cluster so that tradition cluster is difficult to the local expression pattern in mining data matrix, it is difficult to
The hiding complex relationship (Cheung, Cheung, Kao, Yip , &Ng, 2006) between gene is excavated, and double focusing class can be solved
Certainly this problem.
In order to overcome the shortcoming of cluster, double focusing class arise at the historic moment and adopted by extensive gene data research institute (Eren,
Deveci,&2013;Flores,Inza,&Calvo,2013;Sara C Madeira&
Arlindo L Oliveira,2004;Nepomuceno,Troncoso,&Aguilar-Ruiz,2011).With traditional cluster
What is differed is that the gene dimension and experiment condition dimension in gene data are considered when double focusing is similar, while to two dimensions point
Analysis, double focusing class can excavate the local message of data, i.e., simply have the genome of similar expression performance in some conditions.Double focusing
Class is more flexibly, and gene sets and set of circumstances can be contained in different clusters simultaneously, i.e., different cluster it
Between can have overlapping scope.
The concept of double focusing class is used gene expression number by Cheng and Church (Cheng&Church, 2000) for the first time
According to research, and the mean square sesidual for limiting double focusing class is proposed, while proposing that heuristic CC methods are used for digging
Dig double focusing class.CC method main process is the mean square sesidual of calculated sub-matrix, and constantly row and column is deleted, and finally excavates and arrives
The less double focusing class of mean square sesidual.Their method can find some biological informations of gene expression data, it can be difficult to point
Time Series Gene Expression Data is analysed, because Time Series Gene Expression Data has the temporal information of inside, and this method is unable to discovery time
The contact of sequence.
Zhang et al. (Zhang et al., 2005) improves CC methods and propose can be for analysis time sequence data
Method CC-TSB.The main thought of the double clustering methods of CC-TSB is similar with CC methods, and the main distinction is CC-TSB methods to row
Operation it is restricted, it is ensured that the row of double focusing class are continuous several columns, addition or to remove row be in the row operation of first or tail, so should
Method is it can be found that the double focusing class of regular hour information.But, the amount for the limitation submatrix that the double clustering methods of CC-TSB are used
It is also the mean square sesidual that Cheng and Church is proposed, but mean square sesidual is concerned with the absolute size of gene expression values, and make an uproar
Sound factor, dimension factors affect gene expression values, therefore the robustness to noise data of CC-TSB methods is not strong enough.In addition,
Mean square sesidual function is not concerned with internal sequence relation, therefore this method can not measure time series gene expression data very well
The characteristics of.
Consider the factor of continuous time, " the consistent change (coherent evolution) between adjacent time becomes for concern
Gesture " is more more meaningful than concern " size of actual value " (Sara C Madeira&Arlindo L Oliveira, 2004), a kind of
Continuous consistent row (Contiguous coherent columns, abbreviation CCC) are by Sara C.Madeira et al. (Madeira&
Oliveira, 2005) propose.Pattern is limited on continuous row by CCC, for finding the consistent double focusing class of all maximum continuation columns
(contiguous column coherent bicluster).CCC methods are converted into initial data represent lifting first
Character type sequence, then improves efficiency using suffix data tree structure, finally excavates double focusing class consistent to maximum continuation column.
Because CCC methods do not consider the influence of noise, noise factor is considered later, and Sara C.Madeira et al. change
Enter the double clustering methods of original CCC, propose the double clustering methods (Madeira&Oliveira, 2007) of e-CCC.E-CCC double focusing classes
Method is similar with the double clustering methods of CCC, is a difference in that e-CCC methods can tolerate the situation of certain error, and error is small
Regard same pattern as in the pattern of predetermined threshold.Experimental result illustrates that e-CCC methods are better to the robustness of noise, find
The biological information of double focusing class be more enriched with.
CCC, e-CCC method all take into account the important feature of temporal gene data --- timing, pay close attention to two neighboring
Variation tendency between time point, and the relative size rather than absolute size of gene expression values are paid close attention to, therefore model has stronger
Noise robustness.But, CCC only considers local most long pattern, and lost the second length, the company of other length of the 3rd length etc.
The continuous information for arranging consistent Evolution Type.In addition, after finding method CCC, e-CCC of the consistent Evolution Type of existing continuation column are all based on
Sew the string processing technology of tree, space complexity is higher, it is difficult to processing data data in large scale.
Existing method exists following not enough:
(1) existing method does not account for the information of adjacent time to take into account, but time series data include priority when
Between order, have priority sequence relation in continuous time, it is impossible to weigh the similitude of such data very well.Accordingly, it would be desirable to solve to examine
Consider the double focusing class Mining Problems of adjacent time information.
(2) existing method only considers local most long pattern, and lost the second length, the company of other length of the 3rd length etc.
The continuous information for arranging consistent Evolution Type.Accordingly, it would be desirable to which whole continuation columns are unanimously drilled between solving two sequences of Time Series Gene Expression Data
The seizure problem of the information of change type.
(3) operating efficiency of existing method is relatively low, need to expend during double focusing class is excavated more time and
Resource, this is also one of technical problems to be solved by the invention.
The content of the invention
It is an object of the invention to overcome deficiencies of the prior art, propose that one kind is unanimously developed based on continuation column
The double focusing class method for digging of the gene chip expression data of property.Original matrix is converted into a difference matrix, Ran Houti first
Go out the method for the quality of new measurement double focusing class, then obtain double focusing class by changing ranks come iteration.This method considers phase
The factor of adjacent time, and the information of the consistent Evolution Type of whole continuation columns can be captured, and it is more efficient than existing method quick,
Concrete technical scheme is as follows.
Temporal gene chip data method for digging based on the consistent Evolution Type of whole continuation columns, it comprises the following steps:
(1) data are pre-processed by input timing microarray data;
(2) using the double clustering methods of the measurement based on the consistent Evolution Type of whole continuation columns, the data obtained to step (1),
Carry out initialization double focusing class;
(3) using the double clustering methods of the measurement based on the consistent Evolution Type of whole continuation columns, by way of additions and deletions ranks, repeatedly
In generation, updates double focusing class;
(4) using the output of ranks threshold restriction, double focusing class is obtained, realizes that temporal gene chip data is excavated.
Further, step (1) is specifically:In the data initialization stage, original matrix is converted into a record first and appointed
The matrix for two adjacent time variation tendencies of anticipating, for reflecting change of the expression value of each gene under two neighboring time point
Change trend.
Further, in step (2) (3), the specific of double clustering methods is weighed based on the consistent Evolution Type of whole continuation columns
Operation is:Often consistent with whole continuation columns of the double focusing class core Evolution Type quantity of row is calculated in double focusing class, then to these quantity
Summation, then with the columns of summed result divided by double focusing class, so as to obtain weighing the value of double focusing class, is designated as perACCC.
Further, it is described to calculate often all continuous between row and double focusing class core in double focusing class in step (2) (3)
Arranging the concrete operations of consistent Evolution Type quantity is:In two rows to be calculated, two row expression values in any continuation column are had
The pattern of Similar trend counts the number of all patterns as objects of statistics, and obtained result is designated as ACCC.
Further, step (2) is specifically:The first step:Double focusing class core is selected, i.e., one is randomly choosed in data matrix
OK, a continuation column is then randomly choosed as the row collection of double focusing class, and value of the row on the continuation column collectively forms double focusing
Class core;Second step:The row collection of double focusing class is obtained, i.e., some rows are randomly choosed in data matrix as initial row collection, to institute
ACCC described in the double focusing class core calculations that every a line of initial row collection is obtained with the first step respectively is stated, the ACCC more than setting value
Corresponding row is saved in the row collection of double focusing class;3rd step:Double focusing class core, i.e. each row to double focusing class are updated, calculate every
One be listed in the double focusing class row collection on mode, with the mode as double focusing class core respective column value, so as to obtain
Double focusing class core after renewal.
Further, step (3) is specifically:The first step:Calculate the perACCC values for obtaining given double focusing class;Second step:Make
Row are updated with addition and the mode deleted;3rd step:The row for the condition that meets is added to advancing by scan data matrix
Row updates;4th step:By taking the method for mode to update double focusing class core with the 3rd step identical in step (2);5th step:Root
The perACCC values of double focusing class after being updated according to the row collection after renewal, continuation column, double focusing class core calculations;6th step:Calculate and update double
The rate of change of the perACCC values of double focusing class before and after cluster, judges whether to meet predetermined threshold value, to determine whether to continue iteration more
New double focusing class.
Further, in step (3), row are updated using addition and the mode deleted, the adding conditional of continuation column
It is as follows with the condition of deletion:
Row adding conditional:If after the row of addition one, the perACCC values increase obtained by calculating then confirms this row of addition;
Expand addition continuation column to the right first, after expansion addition terminates, continue to expand addition continuation column to the left, until expanding
Addition terminates;
Row deletion condition:If deleted after a row, the perACCC values increase obtained by calculating then confirms to delete this row;
Deletion action is carried out to the right since leftmost row first, until stopping;Then enter to the left since rightmost
Row deletion action, until stopping;
If a continuation column is expanded with the addition of several columns to the right, then without the row on the right of deletion;If one continuous
Row are expanded with the addition of several columns to the left, then without the row for deleting the left side.
Further, in step (3), the row for adding the condition that meets by scan data matrix is updated to row, and row adds
Plus condition is as follows:
Row update condition:If after addition a line, perACCC values obtained by calculating more than or equal to do not add the row it
The perACCC values of preceding double focusing class, then confirm addition this journey, otherwise without;
Double focusing class is calculated first and obtains perACCC values, a data matrix is then scanned, often row and double focusing are calculated respectively
ACCC values between class core, then the ACCC values divided by the columns value of the continuation column of double focusing class core, add institute
State value to concentrate to row more than or equal to the corresponding row of the perACCC values, so that more newline collection.
Further, the determination methods for whether continuing iteration described in step (3) are specifically:Set a rate of change threshold
Value, pA0 is designated as without calculating obtained perACCC values before updating ranks and core, obtains the calculating after renewal
PerACCC values are designated as pA1, if pA0 to pA1 rate of change is less than default rate of change threshold value, stop iteration and update double focusing
Class, into step (4).
Further, step (4) is specifically:One group of ranks threshold value is set, if double focusing class is unsatisfactory for default ranks threshold
Value, then reinitialize double focusing class, go to step (2);If double focusing class meets ranks threshold value, double cluster results are obtained, it is real
Current sequence microarray data is excavated.
The present invention compared with prior art, substantive distinguishing features and remarkable advantage is protruded with following:
In the present invention, temporal gene chip expression data are analyzed using the consistent Evolution Type model of continuation column, found
Meet the submatrix of the method for measurement double focusing class proposed by the present invention.The present invention considers the factor for having the time in data, Neng Gouxue
Gene and the variation relation of time change are practised, so as to learn the regulation and control between the various contacts to different genes, gene
The information of contact etc..For the gene of common regulation and control, they have the same expression in some continuous time section, and this is continuous
Period and a stage of some cell processes are closely related, using the teaching of the invention it is possible to provide the knowledge of the common regulation and control of some physiological processes,
The network for analyzing gene regulation is played an important role with dynamic bioprocess.Still further aspect, operating efficiency of the invention is very
Height, using fewer resource and can be completed in a relatively short time the excavation of double focusing class.
Embodiment
The embodiment to the present invention is described further below in conjunction with the accompanying drawings, but the implementation of the present invention is not limited to
This.It is that those skilled in the art can if needing it is emphasized that have the symbol not described in detail especially or operating process below
With what is realized with reference to prior art.
Such as Fig. 1, the double focusing class method for digging based on the consistent evolutive gene chip expression data of continuation column of this example
Including following content:
Data are pre-processed by the 1st, input timing microarray data.
In the data initialization stage, original matrix is converted into one first and records any two adjacent time variation tendency
Matrix, for reflecting variation tendency of the expression value of each gene under two neighboring time point.
In this example, variation tendency is reduced to two kinds of situations, one kind is increase or constant, another to reduce,
Represented respectively with 1,0, finally original matrix is converted into the matrix of columns 1 row fewer than original matrix, referred to as difference matrix.
As shown in table 1, in table 1, rower a, b, c, d represent different genes to original matrix, row mark t1, t2, t3, t4,
T5, t6 represent time point.The difference matrix that table 1 is obtained by pretreatment is as shown in table 2, and in table 2, rower a, b, c, d are represented not
Same gene, row mark 1,2,3,4,5 represents t1 to t2, t2 to t3, t3 to t4, t4 to t5, t5 to t6 variation tendency respectively.Example
If 1 row a, t1, t2 moment of table expression value is 0.55 and 0.19 respectively, because 0.19<0.55, thus in the matrix of table 2 row a first
Train value is 0.Similarly, in table 1 0.83 in row a>0.19, so the train values of row a second are 1.
Primordial time series data before the conversion of table 1.
Difference matrix data after the conversion of table 2.
2nd, using the double clustering methods of the measurement based on the consistent Evolution Type of whole continuation columns, the data obtained to step (1) are entered
Row initialization double focusing class.
The concrete operations of the double clustering methods of measurement based on the consistent Evolution Type of whole continuation columns are:Calculate and often gone in double focusing class
Then these quantity are summed by Evolution Type quantity consistent with whole continuation columns of double focusing class core, then with summed result divided by
The columns of double focusing class, so as to obtain weighing the value of double focusing class, is designated as perACCC.
It is described to calculate the specific of the consistent Evolution Type quantity of whole continuation columns in double focusing class often between row and double focusing class core
Operation is:In two rows to be calculated, two row expression values in any continuation column are had the pattern of Similar trend as
Objects of statistics, counts the number of all patterns, and obtained result is designated as ACCC.
Variation tendency is being reduced in the example of two kinds of situations, the consistent Evolution Type number of whole continuation columns of two rows is calculated
Method concrete operations it is as follows:
Firstly generate with the two rows equal length data line, be designated as record row.Then to two row, observe successively
Each column, the value of two rows is all identical, and record row is set to 1 in the value of the row, different then be set to 0, so that more new record row.To record
OK, complete 1 hop count occurred, and 1 number occurred in every section are counted, finally by calculating the consistent Evolution Type number of continuation column,
So as to obtain result.Complete 1 section occurred refers to that value is entirely 1 part.
For example:Recording row is:110011101.Complete 1 section occurred has 3 sections.Each section is represented with underscore, is respectively:
110011101, therefore each section 1 of number is respectively:2,3,1, be by calculating obtained similarity:10.
Initializing double focusing class is specifically:The first step:Double focusing class core is selected, i.e., a line is randomly choosed in data matrix,
Then a continuation column is randomly choosed as the row collection of double focusing class, and value of the row on the continuation column collectively forms double focusing nucleoid
The heart;Second step:The row collection of double focusing class is obtained, i.e., some rows are randomly choosed in data matrix as initial row collection, to described first
Begin ACCC described in the double focusing class core calculations that are obtained respectively with the first step of every a line of collection, and the ACCC institutes more than setting value is right
The row answered is saved in the row collection of double focusing class;3rd step:Double focusing class core, i.e. each row to double focusing class are updated, each row are calculated
The double focusing class row collection on mode, with the mode as double focusing class core respective column value, so as to be updated
Double focusing class core afterwards.
Initialize double focusing class example as shown in Figure 2.
The first step:Double focusing class core is selected, i.e., a line is randomly choosed in data matrix, one is then randomly choosed continuously
The row collection as double focusing class is arranged, value of the row on the continuation column collectively forms double focusing class core.In the difference matrix of table 2
Randomly selected a line is row b, i.e. [0 011 0], and randomly selected continuation column is 2,3 row, constitutes double focusing class core.As schemed
Shown in 2, S1 represents double focusing class core, and ce represents the value of double focusing class core, at once b value, and C1 represents the continuous of double focusing class core
Row, are 2,3 row.In Fig. 2, " row " that corresponding row represents the row mark of difference matrix, and " value " that corresponding row represents double focusing nucleoid
Row in the value of the heart, dotted line frame represent the continuation column and respective value of selection.
Second step:The row collection of double focusing class is obtained, i.e., some rows are randomly choosed in data matrix as initial row collection, to institute
ACCC described in the double focusing class core calculations that every a line of initial row collection is obtained with the first step respectively is stated, more than or equal to setting value
ACCC corresponding to row be saved in double focusing class row collection.Randomly selected initial row collection is { b, c, d }, to initial row collection b,
C, d } the double focusing class core calculations ACCC that is obtained respectively with the first step of every a line, be as a result 3,1,1 respectively, the threshold value 1 of setting,
Because 3,1,1 both greater than or equal to 1, therefore obtain the row collection R1={ b, c, d } of double focusing class.
3rd step:Double focusing class core, i.e. each row to double focusing class are updated, each row collection for being listed in the double focusing class is calculated
On mode, with the mode as double focusing class core respective column value so that the double focusing class core after being updated.As schemed
Row b, c, d value is shown shown in 2, in figure, takes mode to obtain the corresponding rows of ce to each column, is worth for [1 011 0].For example go
B, c, d first row, 0 number is 1, and 1 number is 2, i.e., 1 number is more than 0 number, and mode is 1, so core
First is classified as 1.
After being operated by above-mentioned steps, the row collection row of initial double focusing class collect and double focusing class core is respectively:S1=(ce
=[1 011 0], C1={ 2,3 }), R1={ b, c, d }, C1={ 2,3 }.
3rd, using the double clustering methods of the measurement based on the consistent Evolution Type of whole continuation columns, by way of additions and deletions ranks, repeatedly
In generation, updates double focusing class.
The first step:Calculate the perACCC values for obtaining given double focusing class;Second step:Using addition and the mode deleted to row
It is updated;3rd step:The row for adding the condition that meets by scan data matrix is updated to row;4th step:By with step
Suddenly the 3rd step identical takes the method for mode to update double focusing class core in (2);5th step:According to after renewal row collection, continuation column,
The perACCC values of double focusing class after double focusing class core calculations update;6th step:Calculate and update double focusing class before and after double focusing class
The rate of change of perACCC values, judges whether to meet predetermined threshold value, and double focusing class is updated to determine whether to continue iteration.
The first step:Calculate the perACCC values for obtaining given double focusing class.
Operating process is as described in the concrete operations of the double clustering methods of the measurement based on the consistent Evolution Type of whole continuation columns.
Second step:Row are updated using addition and the mode deleted.
Row are updated using addition and the mode deleted, the adding conditional and deletion condition of continuation column are as follows:
Row adding conditional:If after the row of addition one, the perACCC values increase obtained by calculating then confirms this row of addition;
Expand addition continuation column to the right first, after expansion addition terminates, continue to expand addition continuation column to the left, until expanding
Addition terminates;
Row deletion condition:If deleted after a row, the perACCC values increase obtained by calculating then confirms to delete this row;
Deletion action is carried out to the right since leftmost row first, until stopping;Then enter to the left since rightmost
Row deletion action, until stopping;
If a continuation column is expanded with the addition of several columns to the right, then without the row on the right of deletion;If one continuous
Row are expanded with the addition of several columns to the left, then without the row for deleting the left side.
The example for updating the row collection process of double focusing class is as shown in Figure 3.
The row collection R1={ b, c, d } of double focusing class, continuation column C1={ 2,3 } calculate the perACCC values of the double focusing class first,
Remember pA=perACCC (R1, C1), calculate shown " before renewal " in pA=0.83, such as figure.Addition continuation column is expanded to the right.It is first
First investigate addition the 4th to arrange, as shown in step 1 in Fig. 3, the row of underscore are the current row for investigating addition, and this step is the 4th row,
Continuously it is classified as { 2,3,4 }, it is 1.11 to calculate perACCC values, because 1.11>PA=0.83, so confirming the row of addition the 4th to company
New perACCC values 1.11, are assigned to pA by continuous row C1.Continuation column is expanded to the right in continuation, and addition the 5th is arranged, such as step 2 institute in figure
Show, it is identical with method when investigating the 4th row, i.e., continuously it is classified as { 2,3,4,5 }, it is 1.42 to calculate perACCC values, because 1.42
>New perACCC values 1.42, so confirming that the row of addition the 5th arrive continuation column C1, are assigned to pA by pA=1.11.Because the 5th row are
The row of rightmost, it is impossible to expand to the right again, so stopping expanding to the right.Expand to the right after stopping, starting to expand to the left.First
Investigate addition the 1st to arrange, as shown in step 3 in figure, i.e., be continuously classified as { 1,2,3,4,5 }, it is 1.33 to calculate perACCC values, because
1.33<PA=1.42, so continuation column C1 is arrived without the 4th row, while pA keeps initial value 1.42.Because the 1st row are Far Lefts
Row, it is impossible to again to the left expand, so stop expand to the left.
After addition step terminates, start continuation column delete step.It is leftmost in consideration deletion most continuation column first to arrange the 2nd
Row, as shown in step 4 in figure, the row of strikethrough are the current row for considering to delete, i.e., be continuously classified as { 3,4,5 }, calculate perACCC
It is worth for 1.44, because 1.44>PA=1.42, so confirming that deleting the 2nd from continuation column C1 arranges, and assigns new perACCC values 1.44
It is worth to pA.Continue to investigate and delete the 3rd row, as shown in step 5 in figure, i.e., be continuously classified as { 4,5 }, it is 1 to calculate perACCC values, because
For 1<PA=1.44, so the 4th row are not deleted, while pA keeps initial value 1.44.Because being continuously classified as { 2,3 } after initialization,
It is { 2,3,4,5 } by the continuation column that row are expanded after adding, i.e., is expanded to the right in adding procedure and confirm with the addition of 4,5 row to company
Continuous row, so the operation of deletion row turned left from the right side need not be carried out.So this wheel operation ultimately produce continuation column C1=3,
4,5 }, as shown in figure " after renewal ".
3rd step:The row for adding the condition that meets by scan data matrix is updated to row
The row for adding the condition that meets by scan data matrix is updated to row, and row adding conditional is as follows:
Row update condition:If after addition a line, perACCC values obtained by calculating more than or equal to do not add the row it
The perACCC values of preceding double focusing class, then confirm addition this journey, otherwise without;
Double focusing class is calculated first and obtains perACCC values, a data matrix is then scanned, often row and double focusing are calculated respectively
ACCC values between class core, then the ACCC values divided by the columns value of the continuation column of double focusing class core, add institute
State value to concentrate to row more than or equal to the corresponding row of the perACCC values, so that more newline collection.
The example for updating the process of double focusing class row collection is as shown in Figure 4.
The support row obtained in above step integrates as R1={ b, c, d }, is continuously classified as C1={ 3,4,5 }, double focusing nucleoid
The heart is S1=(ce=[1 011 0], C1={ 3,4,5 }), and the perACCC values of the double focusing class after renewal row are assigned to
All rows, primarily look at the 1st every trade a, perACCC ({ a }, C1)=1 in pA0, i.e. pA0=1.44. scanning difference matrix, because
For 1<PA0=1.44, so the 2nd every trade b, perACCC ({ b }, C1)=2 are then investigated to row collection R1. without row a, because 2
>PA0=1.44, so a that adds line collects R1. similarly to row, investigates row c, d.Finally obtain row collection R1={ b, d }.Compared to original
The row collection { b, c, d } of double focusing class, is unsatisfactory for condition, so that row c is deleted equivalent to row c.
4th step:By taking the method for mode to update double focusing class core with the 3rd step identical in step (2)
The method of double focusing class is updated as being mentioned above, because the method for renewal double focusing class core is above
In describe in detail, so do not deploy here narration.Update double focusing class, obtained core be S1=(ce=[1 011 0],
C1={ 3,4,5 }).
5th step:The perACCC values of double focusing class after being updated according to the row collection after renewal, continuation column, double focusing class core calculations.
Because the method for calculating perACCC is hereinbefore described in detail, narration is not deployed here.
6th step:The rate of change of the perACCC values of double focusing class before and after calculating renewal double focusing class, judges whether to meet default
Threshold value, double focusing class is updated to determine whether to continue iteration.
The determination methods for whether continuing iteration are specifically:A rate of change threshold value is set, without renewal ranks and core
Obtained perACCC values are calculated before and are designated as pA0, and the perACCC values that the calculating after renewal is obtained are designated as pA1, if pA0
Rate of change to pA1 is less than default rate of change threshold value, then stops iteration and update double focusing class, into step (4).
The example that iteration updates and judged is as shown in Figure 5.
In this example, rate of change threshold value might as well be set as deta=0.1, i.e., the rate of change of perACCC value needs small
Just stop iteration in 0.1 as shown in Figure 5.The perACCC value pA0=0.83 of double focusing class before double focusing class are not updated, update double
After cluster, the perACCC values pA1=2 for obtaining double focusing class is calculated.(pA1-pA0)/pA0=1.41 is calculated, because 1.41>deta
The operation that second of iteration of=0.1. updates here in not reinflated narration second of iteration of, does not have as the operation of first time
There are the perACCC value pA0=2 of double focusing class before updating double focusing class, update after double focusing class, calculate the perACCC for obtaining double focusing class
Value pA1=2. calculates (pA1-pA0)/pA0=0, because 0<Deta=0.1, so stopping iteration, obtains the row collection row of double focusing class
Collect (R1, C1)=({ b, d }, { 3,4,5 })
4th, exported using ranks threshold restriction, obtain double focusing class, realize that temporal gene chip data is excavated.
One group of ranks threshold value is set, if double focusing class is unsatisfactory for default ranks threshold value, double focusing class is reinitialized, turned
To step (2);If double focusing class meets ranks threshold value, double cluster results are obtained, realize that temporal gene chip data is excavated.
In this example, row threshold value min_row=2, row threshold value min_col=2 might as well be set.The row collection row collection of double focusing class
It is (R1, C1)=({ b, d }, { 3,4,5 }) respectively.Obvious double focusing class line number is 3>Min_row=2, therefore output.Therefore final pair
Cluster result is (R1, C1)=({ b, d }, { 3,4,5 }).The line chart of the double focusing class of output is as shown in Figure 6.
With reference to above-mentioned flow, give one example again below.
1st, data matrix is originated
Data matrix is saccharomycete chip data, and the data are temporal gene chip datas, from CHO et al. (R.J.Cho
Et al., 1998) saccharomycete test data.Sampling interval is 10 minutes, has 17 sampled points, have recorded the expression of gene
Level.The row of the data matrix represents expression of the gene in different time points, and row represent the table of all genes under same time point
Up to situation.For the missing values in the data matrix, filled up with " cubic spline interpolation method ", obtained data matrix
Size is 6147 × 17.
Artificial synthesized data matrix is the data of random synthesis, and the span of data is 0 to 1, and obedience is uniformly distributed.It is raw
Into some data sets, one of which data set is that columns is fixed as 20 row, and line number increases to 2000 rows from 1000 rows, and amplification is
100 rows, another set data set is that line number is fixed as 1000, and columns increases to 40 row from 20 row, and amplification is 2 row.
2nd, GO analyses are carried out to the double cluster results excavated
GO (Gene Ontology) annotations are used to examine the biological significance of double focusing class, and the cluster result of generation is made
The checking of authenticity.Biological meaning analysis is carried out to obtained double cluster results with GOToolbox, P-value is make use of.Pass through
The P-value between the pattern found and known classification is calculated, observation belongs to the Gene Ontology of same genoid
The similitude of annotation, it is known that the biological meaning of double focusing class.The similar gene of gene expression data generally falls into same biology
Path is learned, and bioprocess and molecular function have similarity.GO (Gene Ontology) is carried out to the double focusing class found
Annotation experiment, obtained P-value values are smaller, illustrate that the relevance of pattern and known classification found is stronger.
As shown in table 3, the first row in table 3 is GO to the GO analysis parts result of saccharomycete chip data:0002181, base
7166 genes altogether in factor data bank, wherein only 186 genes (accounting for 2.6%) belong to the GO of cytoplasm translation.Based on whole
The double focusing class that the temporal gene chip data method for digging of the consistent Evolution Type of continuation column is found has 754 genes, wherein having
90 genes (accounting for 11.9%) belong to the GO items of cytoplasm translation.Calculating obtains very small p-value values (4.62E-37), this
Individual very small p-value values reflect the significant biological meaning of display of this double focusing class.In saccharomycete chip data, GO
The result of analysis shows that the double cluster results found have significant biological meaning.
The GO analysis part result tables of 3. pairs of cluster results of table
3rd, bioconcentration analysis is carried out to the double focusing class excavated
In order to verify the biological meaning of obtained double focusing class from the statistical significance, pass through the biological work(of the double cluster results of analysis
Can enriching carry out biological meaning analysis (Al-Akwaa&Kadah, 2009;et al.,2006).Looked for by calculating
P-value between the double focusing class and known classification that arrive, counts those double focusings for being less than P-value threshold values set in advance
The percentage of whole double focusing class numbers shared by class, it is known that statistical significance of the double focusing class in bioprocess enriching.For phase
With relatively low P-value values set, count the whole double focusing classes found of the bigger explanation of obtained percentage generally with it is known
The relevance of gene classification is stronger, so as to illustrate that the biological meaning of double focusing class is more notable.
Bioconcentration analysis is done on saccharomycete chip data, a variety of clusters or double clustering methods, double focusing class knot is contrasted
The concentration ratio block diagram of fruit is as shown in fig. 7, the biology richness that the method as seen from the figure based on the consistent Evolution Type of whole continuation columns is drawn
Collection property will be apparently higher than CCC, CC-TSB, CCC, OPSM, xMotifs, these methods of K-means.Number is expressed by real gene
According to shown in the enriching analysis of collection, the temporal gene chip data method for digging based on the consistent Evolution Type of whole continuation columns is well
The more valuable double focusing class with biological significance is obtained from gene expression data.
4th, scalability Analysis is carried out to method for digging
Relation the Fig. 8 and Fig. 9 for analyzing program runtime and data set size illustrate method in different data squares
Run time situation under the size of battle array, is demonstrated by similar trend in Fig. 8 and Fig. 9 as we can see from the figure.For Fig. 8, number
It is fixed 20 according to total columns of matrix.In method program, the line number of matrix increases to 2000 from 1000, increases by 100 every time, from
Fig. 8 can see, with the increase of line number, and the run time of method linearly increases.For Fig. 9, total line number of data matrix is
Fixed 2000.In method program, matrix column number increases to 40 from 20, and increase by 2 every time is arranged.It will be seen from figure 9 that with row
Several increases, the run time approximately linear increase of method.
Therefore, the experimental result more than can see, the temporal gene chip based on the consistent Evolution Type of whole continuation columns
Data digging method has good scalability in the ever-increasing situation of data volume.
Examples detailed above further illustrate the invention has the advantages that:
(1) this method takes into account the information of adjacent time, can effectively excavate double with time serial message
Cluster.The variation relation to gene and time change can be learnt, be may learn between the various contacts of different genes, gene
Regulation and control contact etc. information.
(2) this method considers the information of the consistent Evolution Type of whole continuation columns, will not lose the continuation column of short length
The information of consistent Evolution Type.The double focusing class with notable biological meaning can effectively be found, and there is provided the common of some physiological processes
The knowledge of regulation and control.
(3) operating efficiency of this method is very high, using fewer resource and can be completed in a relatively short time double focusing class
Excavate, the operating time of method and the approximately linear relation of data scale, will not sharp increase, tool with the increase of data rule
There is the practice significance of reality.