CN102419774B

CN102419774B - Method for clustering single nucleotide polymorphism (SNP) data

Info

Publication number: CN102419774B
Application number: CN 201110418812
Authority: CN
Inventors: 吴悦; 贾敏; 雷州; 刘宗田
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2011-12-15
Filing date: 2011-12-15
Publication date: 2013-04-03
Anticipated expiration: 2031-12-15
Also published as: CN102419774A

Abstract

The invention discloses a method for clustering single nucleotide polymorphism (SNP) data. The method comprises the following steps of: firstly, pre-processing original SNP data, and converting the format of the original SNP data into data format which can be processed by using the method; secondly, performing grid division on the pre-processed SNP data, and dividing each dimension of the SNP data into three grids according to an expression value, in each sample, of each SNP site; thirdly, calculating the density of the divided grids to obtain a sub-space which comprises clusters; fourthly, clustering the obtained sub-space to obtain the classified SNP data, wherein each cluster is a set of co-expression SNP sites; and finally, storing a clustering result into a file. By adoption of the method, the problem of clustering of high-dimension classification data is solved, and the SNP data can be quickly clustered with high quality.

Description

A kind of clustering method towards the SNP data

Technical field

The present invention relates to extensive higher-dimension classifying type data are carried out the correlation technique of clustering processing, particularly design a kind of clustering method towards the SNP data, belong to field of computer technology.

Background technology

The high dimensional data cluster has become an important research direction in the data mining.Because along with the progress of technology so that Data Collection becomes more and more easier, cause that the database scale is increasing, complicacy is more and more higher, such as various types of trade transaction datas, Web document, gene expression data etc., their dimension (attribute) can reach hundreds and thousands of dimensions usually, even higher.

SNP is the abbreviation of single nucleotide polymorphism, and the meaning is single nucleotide polymorphism, mainly refers on genomic level by the caused dna sequence polymorphism of the variation of single core thuja acid.Present studies show that estimates at 3,000,000 SNP sites in the human genome.The SNP data refer to the expression data of SNP site in sample, belong to higher-dimension classifying type data.

Be subjected to the impact of " dimensionality effect ", manyly be used in the Clustering Effect that often can't obtain on the higher dimensional space at the good clustering method of low-dimensional data space performance.In order to address this problem, R.Agrawal has proposed the concept of subspace clustering first, to solve the clustering problem of high dimensional data.But the subspace clustering algorithm is only applicable to continuous data, and is not suitable for the classifying type data.

Summary of the invention

The problem that exists for solving above-mentioned prior art, the object of the present invention is to provide a kind of clustering method towards the SNP data, the data type of subspace clustering algorithm process is expanded to classifying type, efficiently solve the clustering problem of SNP data, improve the cluster accuracy.For achieving the above object, the technical solution used in the present invention is that its concrete operation step is as follows:

A. original SNP data are carried out pre-service, convert the manageable data layout of clustering method to;

B. pretreated SNP data being carried out grid divides;

C. calculate the density of the grid after dividing, obtain comprising the subspace of cluster;

D. cluster is carried out in the subspace that step C is obtained, and obtains by the SNP data of minute good class;

E. cluster result is saved in the file.

Among the above-mentioned steps A original SNP data are carried out pre-service, the operation steps that converts the manageable data layout of clustering method to is as follows:

A1) data encoding: the data mode that the SNP chip detection derives is such, each SNP site is a kind of somatotype result, always having four kinds of somatotype results, is respectively wild homozygous AA, sudden change heterozygous AB, the homozygous BB of sudden change and somatotype fail flag NC; SNP data AA is encoded to 0, AB to be encoded to 1, BB and to be encoded to 2;

A2) data scrubbing: if full line data all are NC, these full line data are all deleted so, if several NC data are arranged in the delegation, and the data value that is next sample same loci with these several NC data replacements then; Have in the delegation and surpass 10% NC data, then delete entire row data.

Among the above-mentioned steps B pretreated SNP data being carried out the operation steps that grid divides is, according to the expression value of each SNP site in each sample every one dimension of SNP data is divided into 3 grids, represent Spatial Dimension with K, at this moment K=1;

Subspace among the above-mentioned steps C refers to the set of dense cell, and described dense cell refers to mesh-density greater than the grid of density threshold, the density of the grid after calculating among the above-mentioned steps C is divided, and the operation steps of subspace that obtains comprising cluster is as follows:

C1) density of all grids in the calculating K dimension space obtains the dense cell collection in the K dimension space;

C2) according to the create-rule of K dimension candidate dense cell collection, generate K+1 dimension candidate dense cell collection;

C3) judge whether K+1 dimension candidate dense cell collection is empty, does not then make K=K+1 and forwards step C2 to for sky) continue to generate the more subspace of higher-dimension, be the highest n-dimensional subspace n that comprises cluster for sky represents the K n-dimensional subspace n, then forward above-mentioned steps D to;

Above-mentioned steps C2) create-rule of K dimension candidate dense cell collection is in:

Input:

,

Refer to the dense cell collection of all K-1 dimensions

Output:

,

Refer to K dimension candidate dense cell collection

To generate like this: appoint and get

Two dense cell

With

, and if only if

With Expression data in a front k-2 sample is identical, and the expression data in k-1 sample is established not simultaneously

, get

Expression data in a front k-2 sample,

Expression data in k-1 sample,

Expression data in k-1 sample is done attended operation, and the result is exactly K dimension candidate dense cell

, Set be

Wherein

The vector that represents the i dimension, "＜" expression lexicographic order.

Obtain among the above-mentioned steps D by the SNP data of minute good class, refer to that each class is the set in the SNP site of coexpression after the cluster, the operation steps that cluster is carried out in the subspace that above-mentioned steps D obtains C is as follows:

D1) every sub spaces is mapped respectively G=＜V, E 〉, wherein G refers to figure, and V refers to the summit of figure, and E refers to the summit of the limit of figure: figure

Represent the interior SNP site set of dense cell i of subspace, the limit of figure

It is coplanar representing dense cell i, j;

D2) to step D1) in the figure that does carry out depth-first search and obtain connected subgraph, a connected subgraph is exactly a cluster, each cluster is the set in the SNP site of coexpression.

Above-mentioned steps D1) in

Representing dense cell i, j is coplanar referring to:

Two dense cell i are identical with the expression data of j in k-1 sample, and only the expression data in a sample is not identical.

The SNP data have the characteristic of two equipotential polymorphisms, so the dense cell under the one dimension state is coplanar.

Cluster result refers to total what clusters among the above-mentioned steps E, and what clusters every one-dimensional subspace has, and what the set in SNP site is in each cluster.

Compared with the prior art a kind of clustering method towards the SNP data of the present invention has following apparent outstanding feature and remarkable advantage:

(1) the present invention expands to classifying type with the data type of subspace clustering algorithm process, has effectively solved the clustering problem of SNP data.

(2) the present invention compares with traditional subspace clustering algorithm, does not carry out cut operator, although reduced the treatment effeciency of algorithm, can ensure effective information and not cut, and has improved cluster accuracy and cluster quality.

(3) the present invention compares with the conventional subspace clustering algorithm, with List＜Set＜Integer〉〉 structure storage dense cell collection, convenience and high-efficiency has been accomplished in additions and deletions operation to the dense cell collection, utilization factor to internal memory also is greatly improved, thereby greatly improved the execution efficient of algorithm, remedied the delay of not carrying out cut operator.

Description of drawings

Fig. 1 is the process flow diagram of a kind of clustering method towards the SNP data of the present invention;

Fig. 2 carries out pretreated process flow diagram with original SNP data among the present invention;

Fig. 3 is the density of the grid after the calculating among the present invention is divided, and obtains comprising the process flow diagram of the subspace of cluster;

Fig. 4 carries out the process flow diagram of cluster to the subspace that obtains among the present invention.

Embodiment

The present invention is further detailed explanation below in conjunction with Figure of description and specific embodiment.

Embodiment one:

Referring to Fig. 1, this is characterized in that towards the clustering method of SNP data:

B. pretreated SNP data being carried out grid divides;

E. cluster result is saved in the file.

Embodiment two:

Present embodiment and embodiment one are basic identical, and special feature is as follows:

Referring to Fig. 2～Fig. 4, in the described steps A original SNP data are carried out pre-service, the operation steps that converts the manageable data layout of clustering method to is as follows:

A1) data encoding: the data mode that the SNP chip detection derives is such, and each SNP site is a kind of somatotype result, always has four kinds of somatotype results, is respectively wild homozygous AA, sudden change heterozygous AB, the homozygous BB of sudden change and somatotype fail flag NC; SNP data AA is encoded to 0, AB to be encoded to 1, BB and to be encoded to 2;

A2) data scrubbing: full line data that have all are NC, and these full line data are all deleted so, have plenty of several NC data are arranged in the delegation, the data value that is next sample same loci with these several NC data replacements then; Have in the delegation and surpass 10% NC data, then delete entire row data.

Among the described step B pretreated SNP data being carried out the operation steps that grid divides is, according to the expression value of each SNP site in each sample every one dimension of SNP data is divided into 3 grids, represent Spatial Dimension with K, at this moment K=1.

Subspace among the described step C refers to the set of dense cell, and described dense cell refers to that mesh-density is greater than the grid of density threshold; The density of the grid after described calculating is divided, the operation steps of subspace that obtains comprising cluster is as follows:

C3) judge whether K+1 dimension candidate dense cell collection is empty, does not then make K=K+1 and forwards step C2 to for sky) continue to generate the more subspace of higher-dimension, be the highest n-dimensional subspace n that comprises cluster for sky represents the K n-dimensional subspace n, then forward described step D to.

Described step C2) create-rule of K dimension candidate dense cell collection is in:

Input:

, Refer to the dense cell collection of all K-1 dimensions

Output:

,

Refer to K dimension candidate dense cell collection

To generate like this: appoint and get

Two dense cell

With

, and if only if With

Expression data in a front k-2 sample is identical, and the expression data in k-1 sample is established not simultaneously

, get

Expression data in a front k-2 sample, Expression data in k-1 sample,

Expression data in k-1 sample is done attended operation, and the result is exactly K dimension candidate dense cell ,

Set be

Wherein

Obtain among the described step D by the SNP data of minute good class, refer to that each class is the set in the SNP site of coexpression after the cluster, the operation steps that cluster is carried out in the described subspace that step C is obtained is as follows:

It is coplanar representing dense cell i, j;

Described step D1) in Representing dense cell i, j is coplanar referring to:

Two dense cell i are identical with the expression data of j in k-1 sample, and only the expression data in a sample is not identical;

Cluster result refers to total what clusters in the described step e, and what clusters every one-dimensional subspace has, and what the set in SNP site is in each cluster.

Embodiment three:

With reference to Fig. 1～Fig. 4, a kind of clustering method towards the SNP data of the present invention, take the SNP data clusters of patients with hypertension as example, its concrete steps are as follows:

(1) original SNP data are carried out pre-service, convert the manageable data layout of clustering method to, as shown in Figure 2, its concrete steps are as follows:

A) data encoding: the data mode that the SNP chip detection derives is such, each SNP site is a kind of somatotype result, always having four kinds of somatotype results, is respectively wild homozygous AA, sudden change heterozygous AB, the homozygous BB of sudden change and somatotype fail flag NC; SNP data AA is encoded to 0, AB to be encoded to 1, BB and to be encoded to 2;

B) data scrubbing: full line data that have all are NC, and these full line data are all deleted so, have plenty of several NC data are arranged in the delegation, the data value that is next sample same loci with these several NC data replacements then; Have in the delegation and surpass 10% NC data, then delete entire row data.

(2) pretreated SNP data being carried out grid divides.

Its concrete operation step is, according to the expression value of each SNP site in each sample every one dimension of SNP data is divided into 3 grids, represents Spatial Dimension with K, this moment K=1;

(3) density of the grid after calculate dividing obtains comprising the subspace of cluster.

Wherein the subspace refers to the set of dense cell, and described dense cell refers to that mesh-density is greater than the grid of density threshold.As shown in Figure 3, its concrete steps are as follows:

A) density of all grids in the calculating K dimension space obtains the dense cell collection in the K dimension space;

B) according to the create-rule of K dimension candidate dense cell collection, generate K+1 dimension candidate dense cell collection;

Wherein the create-rule of K dimension candidate dense cell collection is:

Input: ,

Refer to the dense cell collection of all K-1 dimensions

Output:

,

Refer to K dimension candidate dense cell collection

To generate like this: appoint and get

Two dense cell

With

, and if only if

, get Expression data in a front k-2 sample,

Expression data in k-1 sample,

,

Set be

Wherein

C) judge whether K+1 dimension candidate dense cell collection is empty, does not continue to generate the more subspace of higher-dimension for sky then makes K=K+1 and forwards step (b) to, and representing the K n-dimensional subspace n for sky is the highest n-dimensional subspace n that comprises cluster, then forwards step (4) to;

(4) cluster is carried out in the subspace that step (3) is obtained, and obtains by the SNP data of minute good class.

Wherein obtain by the SNP data of minute good class, refer to that each class is the set in the SNP site of coexpression after the cluster.As shown in Figure 4, its concrete steps are as follows:

A) every sub spaces is mapped respectively G=＜V, E 〉, wherein G refers to figure, and V refers to the summit of figure, and E refers to the summit of the limit of figure: figure

It is coplanar representing dense cell i, j;

Wherein

Representing dense cell i, j is coplanar referring to:

B) figure that does in the step (a) is carried out depth-first search and obtain connected subgraph, a connected subgraph is exactly a cluster, and each cluster is the set in the SNP site of coexpression.

(5) cluster result is saved in the file.

Wherein cluster result refers to total what clusters, and what clusters every one-dimensional subspace has, and what the set in SNP site is in each cluster.

Experimental result shows that the present invention expands to classifying type with the data type of subspace clustering algorithm process, has effectively solved the clustering problem of SNP data, and has improved cluster accuracy and cluster quality.

More than a kind of clustering method towards the SNP data of the present invention is described in detail, just be used for helping to understand method of the present invention and core concept; Simultaneously, for one of ordinary skill in the art, according to method of the present invention and thought, all can change to some extent on embodiment and range of application, in sum, this description should not be construed as limitation of the present invention.

Claims

1. clustering method towards the SNP data is characterized in that concrete operation step is as follows:

A. original SNP data are carried out pre-service, convert the manageable data layout of clustering method to, concrete steps are:

A2) data scrubbing: full line data that have all are NC, and these full line data are all deleted so, have plenty of several NC data are arranged in the delegation, the data value that is next sample same loci with these several NC data replacements then; Have in the delegation and surpass 10% NC data, then delete entire row data;

B. pretreated SNP data being carried out grid divides;

E. cluster result is saved in the file.

2. a kind of clustering method towards the SNP data according to claim 1, it is characterized in that the operation steps of among the described step B pretreated SNP data being carried out the grid division is, according to the expression value of each SNP site in each sample every one dimension of SNP data is divided into 3 grids, represent Spatial Dimension with K, at this moment K=1.

3. a kind of clustering method towards the SNP data according to claim 1 is characterized in that subspace among the described step C refers to the set of dense cell, and described dense cell refers to that mesh-density is greater than the grid of density threshold; The density of the grid after described calculating is divided, the operation steps of subspace that obtains comprising cluster is as follows:

C2) according to the create-rule of K dimension candidate dense cell collection, generate K+1 dimension candidate dense cell collection, concrete steps are:

Input refers to the dense cell collection of all K-1 dimensions;

Output ,

Refer to K dimension candidate dense cell collection;

To generate like this: appoint two dense cell of getting

With

, and if only if

With

, get

Expression data in a front k-2 sample,

Expression data in k-1 sample,

,

Set be

Wherein The vector that represents the i dimension, "＜" expression lexicographic order;

4. a kind of clustering method towards the SNP data according to claim 1, it is characterized in that obtaining among the described step D being divided the SNP data of good class, refer to that each class is the set in the SNP site of coexpression after the cluster, the operation steps that cluster is carried out in the described subspace that step C is obtained is as follows:

Represent the interior SNP site set of dense cell i of subspace, the limit of figure It is coplanar representing dense cell i, j; Described

Representing dense cell i, j is coplanar referring to: two dense cell i are identical with the expression data of j in k-1 sample, and only the expression data in a sample is not identical; The SNP data have the characteristic of two equipotential polymorphisms, so the dense cell under the one dimension state is coplanar;

5. a kind of clustering method towards the SNP data according to claim 1 is characterized in that cluster result refers to total what clusters in the described step e, and what clusters every one-dimensional subspace has, and what the set in SNP site is in each cluster.