CN102419774B - Method for clustering single nucleotide polymorphism (SNP) data - Google Patents
Method for clustering single nucleotide polymorphism (SNP) data Download PDFInfo
- Publication number
- CN102419774B CN102419774B CN 201110418812 CN201110418812A CN102419774B CN 102419774 B CN102419774 B CN 102419774B CN 201110418812 CN201110418812 CN 201110418812 CN 201110418812 A CN201110418812 A CN 201110418812A CN 102419774 B CN102419774 B CN 102419774B
- Authority
- CN
- China
- Prior art keywords
- data
- snp
- dense cell
- cluster
- dimension
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Abstract
The invention discloses a method for clustering single nucleotide polymorphism (SNP) data. The method comprises the following steps of: firstly, pre-processing original SNP data, and converting the format of the original SNP data into data format which can be processed by using the method; secondly, performing grid division on the pre-processed SNP data, and dividing each dimension of the SNP data into three grids according to an expression value, in each sample, of each SNP site; thirdly, calculating the density of the divided grids to obtain a sub-space which comprises clusters; fourthly, clustering the obtained sub-space to obtain the classified SNP data, wherein each cluster is a set of co-expression SNP sites; and finally, storing a clustering result into a file. By adoption of the method, the problem of clustering of high-dimension classification data is solved, and the SNP data can be quickly clustered with high quality.
Description
Technical field
The present invention relates to extensive higher-dimension classifying type data are carried out the correlation technique of clustering processing, particularly design a kind of clustering method towards the SNP data, belong to field of computer technology.
Background technology
The high dimensional data cluster has become an important research direction in the data mining.Because along with the progress of technology so that Data Collection becomes more and more easier, cause that the database scale is increasing, complicacy is more and more higher, such as various types of trade transaction datas, Web document, gene expression data etc., their dimension (attribute) can reach hundreds and thousands of dimensions usually, even higher.
SNP is the abbreviation of single nucleotide polymorphism, and the meaning is single nucleotide polymorphism, mainly refers on genomic level by the caused dna sequence polymorphism of the variation of single core thuja acid.Present studies show that estimates at 3,000,000 SNP sites in the human genome.The SNP data refer to the expression data of SNP site in sample, belong to higher-dimension classifying type data.
Be subjected to the impact of " dimensionality effect ", manyly be used in the Clustering Effect that often can't obtain on the higher dimensional space at the good clustering method of low-dimensional data space performance.In order to address this problem, R.Agrawal has proposed the concept of subspace clustering first, to solve the clustering problem of high dimensional data.But the subspace clustering algorithm is only applicable to continuous data, and is not suitable for the classifying type data.
Summary of the invention
The problem that exists for solving above-mentioned prior art, the object of the present invention is to provide a kind of clustering method towards the SNP data, the data type of subspace clustering algorithm process is expanded to classifying type, efficiently solve the clustering problem of SNP data, improve the cluster accuracy.For achieving the above object, the technical solution used in the present invention is that its concrete operation step is as follows:
A. original SNP data are carried out pre-service, convert the manageable data layout of clustering method to;
B. pretreated SNP data being carried out grid divides;
C. calculate the density of the grid after dividing, obtain comprising the subspace of cluster;
D. cluster is carried out in the subspace that step C is obtained, and obtains by the SNP data of minute good class;
E. cluster result is saved in the file.
Among the above-mentioned steps A original SNP data are carried out pre-service, the operation steps that converts the manageable data layout of clustering method to is as follows:
A1) data encoding: the data mode that the SNP chip detection derives is such, each SNP site is a kind of somatotype result, always having four kinds of somatotype results, is respectively wild homozygous AA, sudden change heterozygous AB, the homozygous BB of sudden change and somatotype fail flag NC; SNP data AA is encoded to 0, AB to be encoded to 1, BB and to be encoded to 2;
A2) data scrubbing: if full line data all are NC, these full line data are all deleted so, if several NC data are arranged in the delegation, and the data value that is next sample same loci with these several NC data replacements then; Have in the delegation and surpass 10% NC data, then delete entire row data.
Among the above-mentioned steps B pretreated SNP data being carried out the operation steps that grid divides is, according to the expression value of each SNP site in each sample every one dimension of SNP data is divided into 3 grids, represent Spatial Dimension with K, at this moment K=1;
Subspace among the above-mentioned steps C refers to the set of dense cell, and described dense cell refers to mesh-density greater than the grid of density threshold, the density of the grid after calculating among the above-mentioned steps C is divided, and the operation steps of subspace that obtains comprising cluster is as follows:
C1) density of all grids in the calculating K dimension space obtains the dense cell collection in the K dimension space;
C2) according to the create-rule of K dimension candidate dense cell collection, generate K+1 dimension candidate dense cell collection;
C3) judge whether K+1 dimension candidate dense cell collection is empty, does not then make K=K+1 and forwards step C2 to for sky) continue to generate the more subspace of higher-dimension, be the highest n-dimensional subspace n that comprises cluster for sky represents the K n-dimensional subspace n, then forward above-mentioned steps D to;
Above-mentioned steps C2) create-rule of K dimension candidate dense cell collection is in:
To generate like this: appoint and get
Two dense cell
With
, and if only if
With
Expression data in a front k-2 sample is identical, and the expression data in k-1 sample is established not simultaneously
, get
Expression data in a front k-2 sample,
Expression data in k-1 sample,
Expression data in k-1 sample is done attended operation, and the result is exactly K dimension candidate dense cell
,
Set be
Obtain among the above-mentioned steps D by the SNP data of minute good class, refer to that each class is the set in the SNP site of coexpression after the cluster, the operation steps that cluster is carried out in the subspace that above-mentioned steps D obtains C is as follows:
D1) every sub spaces is mapped respectively G=<V, E 〉, wherein G refers to figure, and V refers to the summit of figure, and E refers to the summit of the limit of figure: figure
Represent the interior SNP site set of dense cell i of subspace, the limit of figure
It is coplanar representing dense cell i, j;
D2) to step D1) in the figure that does carry out depth-first search and obtain connected subgraph, a connected subgraph is exactly a cluster, each cluster is the set in the SNP site of coexpression.
Two dense cell i are identical with the expression data of j in k-1 sample, and only the expression data in a sample is not identical.
The SNP data have the characteristic of two equipotential polymorphisms, so the dense cell under the one dimension state is coplanar.
Cluster result refers to total what clusters among the above-mentioned steps E, and what clusters every one-dimensional subspace has, and what the set in SNP site is in each cluster.
Compared with the prior art a kind of clustering method towards the SNP data of the present invention has following apparent outstanding feature and remarkable advantage:
(1) the present invention expands to classifying type with the data type of subspace clustering algorithm process, has effectively solved the clustering problem of SNP data.
(2) the present invention compares with traditional subspace clustering algorithm, does not carry out cut operator, although reduced the treatment effeciency of algorithm, can ensure effective information and not cut, and has improved cluster accuracy and cluster quality.
(3) the present invention compares with the conventional subspace clustering algorithm, with List<Set<Integer〉〉 structure storage dense cell collection, convenience and high-efficiency has been accomplished in additions and deletions operation to the dense cell collection, utilization factor to internal memory also is greatly improved, thereby greatly improved the execution efficient of algorithm, remedied the delay of not carrying out cut operator.
Description of drawings
Fig. 1 is the process flow diagram of a kind of clustering method towards the SNP data of the present invention;
Fig. 2 carries out pretreated process flow diagram with original SNP data among the present invention;
Fig. 3 is the density of the grid after the calculating among the present invention is divided, and obtains comprising the process flow diagram of the subspace of cluster;
Fig. 4 carries out the process flow diagram of cluster to the subspace that obtains among the present invention.
Embodiment
The present invention is further detailed explanation below in conjunction with Figure of description and specific embodiment.
Embodiment one:
Referring to Fig. 1, this is characterized in that towards the clustering method of SNP data:
A. original SNP data are carried out pre-service, convert the manageable data layout of clustering method to;
B. pretreated SNP data being carried out grid divides;
C. calculate the density of the grid after dividing, obtain comprising the subspace of cluster;
D. cluster is carried out in the subspace that step C is obtained, and obtains by the SNP data of minute good class;
E. cluster result is saved in the file.
Embodiment two:
Present embodiment and embodiment one are basic identical, and special feature is as follows:
Referring to Fig. 2~Fig. 4, in the described steps A original SNP data are carried out pre-service, the operation steps that converts the manageable data layout of clustering method to is as follows:
A1) data encoding: the data mode that the SNP chip detection derives is such, and each SNP site is a kind of somatotype result, always has four kinds of somatotype results, is respectively wild homozygous AA, sudden change heterozygous AB, the homozygous BB of sudden change and somatotype fail flag NC; SNP data AA is encoded to 0, AB to be encoded to 1, BB and to be encoded to 2;
A2) data scrubbing: full line data that have all are NC, and these full line data are all deleted so, have plenty of several NC data are arranged in the delegation, the data value that is next sample same loci with these several NC data replacements then; Have in the delegation and surpass 10% NC data, then delete entire row data.
Among the described step B pretreated SNP data being carried out the operation steps that grid divides is, according to the expression value of each SNP site in each sample every one dimension of SNP data is divided into 3 grids, represent Spatial Dimension with K, at this moment K=1.
Subspace among the described step C refers to the set of dense cell, and described dense cell refers to that mesh-density is greater than the grid of density threshold; The density of the grid after described calculating is divided, the operation steps of subspace that obtains comprising cluster is as follows:
C1) density of all grids in the calculating K dimension space obtains the dense cell collection in the K dimension space;
C2) according to the create-rule of K dimension candidate dense cell collection, generate K+1 dimension candidate dense cell collection;
C3) judge whether K+1 dimension candidate dense cell collection is empty, does not then make K=K+1 and forwards step C2 to for sky) continue to generate the more subspace of higher-dimension, be the highest n-dimensional subspace n that comprises cluster for sky represents the K n-dimensional subspace n, then forward described step D to.
Described step C2) create-rule of K dimension candidate dense cell collection is in:
To generate like this: appoint and get
Two dense cell
With
, and if only if
With
Expression data in a front k-2 sample is identical, and the expression data in k-1 sample is established not simultaneously
, get
Expression data in a front k-2 sample,
Expression data in k-1 sample,
Expression data in k-1 sample is done attended operation, and the result is exactly K dimension candidate dense cell
,
Set be
Obtain among the described step D by the SNP data of minute good class, refer to that each class is the set in the SNP site of coexpression after the cluster, the operation steps that cluster is carried out in the described subspace that step C is obtained is as follows:
D1) every sub spaces is mapped respectively G=<V, E 〉, wherein G refers to figure, and V refers to the summit of figure, and E refers to the summit of the limit of figure: figure
Represent the interior SNP site set of dense cell i of subspace, the limit of figure
It is coplanar representing dense cell i, j;
D2) to step D1) in the figure that does carry out depth-first search and obtain connected subgraph, a connected subgraph is exactly a cluster, each cluster is the set in the SNP site of coexpression.
Described step D1) in
Representing dense cell i, j is coplanar referring to:
Two dense cell i are identical with the expression data of j in k-1 sample, and only the expression data in a sample is not identical;
The SNP data have the characteristic of two equipotential polymorphisms, so the dense cell under the one dimension state is coplanar.
Cluster result refers to total what clusters in the described step e, and what clusters every one-dimensional subspace has, and what the set in SNP site is in each cluster.
Embodiment three:
With reference to Fig. 1~Fig. 4, a kind of clustering method towards the SNP data of the present invention, take the SNP data clusters of patients with hypertension as example, its concrete steps are as follows:
(1) original SNP data are carried out pre-service, convert the manageable data layout of clustering method to, as shown in Figure 2, its concrete steps are as follows:
A) data encoding: the data mode that the SNP chip detection derives is such, each SNP site is a kind of somatotype result, always having four kinds of somatotype results, is respectively wild homozygous AA, sudden change heterozygous AB, the homozygous BB of sudden change and somatotype fail flag NC; SNP data AA is encoded to 0, AB to be encoded to 1, BB and to be encoded to 2;
B) data scrubbing: full line data that have all are NC, and these full line data are all deleted so, have plenty of several NC data are arranged in the delegation, the data value that is next sample same loci with these several NC data replacements then; Have in the delegation and surpass 10% NC data, then delete entire row data.
(2) pretreated SNP data being carried out grid divides.
Its concrete operation step is, according to the expression value of each SNP site in each sample every one dimension of SNP data is divided into 3 grids, represents Spatial Dimension with K, this moment K=1;
(3) density of the grid after calculate dividing obtains comprising the subspace of cluster.
Wherein the subspace refers to the set of dense cell, and described dense cell refers to that mesh-density is greater than the grid of density threshold.As shown in Figure 3, its concrete steps are as follows:
A) density of all grids in the calculating K dimension space obtains the dense cell collection in the K dimension space;
B) according to the create-rule of K dimension candidate dense cell collection, generate K+1 dimension candidate dense cell collection;
Wherein the create-rule of K dimension candidate dense cell collection is:
To generate like this: appoint and get
Two dense cell
With
, and if only if
With
Expression data in a front k-2 sample is identical, and the expression data in k-1 sample is established not simultaneously
, get
Expression data in a front k-2 sample,
Expression data in k-1 sample,
Expression data in k-1 sample is done attended operation, and the result is exactly K dimension candidate dense cell
,
Set be
C) judge whether K+1 dimension candidate dense cell collection is empty, does not continue to generate the more subspace of higher-dimension for sky then makes K=K+1 and forwards step (b) to, and representing the K n-dimensional subspace n for sky is the highest n-dimensional subspace n that comprises cluster, then forwards step (4) to;
(4) cluster is carried out in the subspace that step (3) is obtained, and obtains by the SNP data of minute good class.
Wherein obtain by the SNP data of minute good class, refer to that each class is the set in the SNP site of coexpression after the cluster.As shown in Figure 4, its concrete steps are as follows:
A) every sub spaces is mapped respectively G=<V, E 〉, wherein G refers to figure, and V refers to the summit of figure, and E refers to the summit of the limit of figure: figure
Represent the interior SNP site set of dense cell i of subspace, the limit of figure
It is coplanar representing dense cell i, j;
Two dense cell i are identical with the expression data of j in k-1 sample, and only the expression data in a sample is not identical.
The SNP data have the characteristic of two equipotential polymorphisms, so the dense cell under the one dimension state is coplanar.
B) figure that does in the step (a) is carried out depth-first search and obtain connected subgraph, a connected subgraph is exactly a cluster, and each cluster is the set in the SNP site of coexpression.
(5) cluster result is saved in the file.
Wherein cluster result refers to total what clusters, and what clusters every one-dimensional subspace has, and what the set in SNP site is in each cluster.
Experimental result shows that the present invention expands to classifying type with the data type of subspace clustering algorithm process, has effectively solved the clustering problem of SNP data, and has improved cluster accuracy and cluster quality.
More than a kind of clustering method towards the SNP data of the present invention is described in detail, just be used for helping to understand method of the present invention and core concept; Simultaneously, for one of ordinary skill in the art, according to method of the present invention and thought, all can change to some extent on embodiment and range of application, in sum, this description should not be construed as limitation of the present invention.
Claims (5)
1. clustering method towards the SNP data is characterized in that concrete operation step is as follows:
A. original SNP data are carried out pre-service, convert the manageable data layout of clustering method to, concrete steps are:
A1) data encoding: the data mode that the SNP chip detection derives is such, and each SNP site is a kind of somatotype result, always has four kinds of somatotype results, is respectively wild homozygous AA, sudden change heterozygous AB, the homozygous BB of sudden change and somatotype fail flag NC; SNP data AA is encoded to 0, AB to be encoded to 1, BB and to be encoded to 2;
A2) data scrubbing: full line data that have all are NC, and these full line data are all deleted so, have plenty of several NC data are arranged in the delegation, the data value that is next sample same loci with these several NC data replacements then; Have in the delegation and surpass 10% NC data, then delete entire row data;
B. pretreated SNP data being carried out grid divides;
C. calculate the density of the grid after dividing, obtain comprising the subspace of cluster;
D. cluster is carried out in the subspace that step C is obtained, and obtains by the SNP data of minute good class;
E. cluster result is saved in the file.
2. a kind of clustering method towards the SNP data according to claim 1, it is characterized in that the operation steps of among the described step B pretreated SNP data being carried out the grid division is, according to the expression value of each SNP site in each sample every one dimension of SNP data is divided into 3 grids, represent Spatial Dimension with K, at this moment K=1.
3. a kind of clustering method towards the SNP data according to claim 1 is characterized in that subspace among the described step C refers to the set of dense cell, and described dense cell refers to that mesh-density is greater than the grid of density threshold; The density of the grid after described calculating is divided, the operation steps of subspace that obtains comprising cluster is as follows:
C1) density of all grids in the calculating K dimension space obtains the dense cell collection in the K dimension space;
C2) according to the create-rule of K dimension candidate dense cell collection, generate K+1 dimension candidate dense cell collection, concrete steps are:
Input refers to the dense cell collection of all K-1 dimensions;
To generate like this: appoint two dense cell of getting
With
, and if only if
With
Expression data in a front k-2 sample is identical, and the expression data in k-1 sample is established not simultaneously
, get
Expression data in a front k-2 sample,
Expression data in k-1 sample,
Expression data in k-1 sample is done attended operation, and the result is exactly K dimension candidate dense cell
,
Set be
Wherein
The vector that represents the i dimension, "<" expression lexicographic order;
C3) judge whether K+1 dimension candidate dense cell collection is empty, does not then make K=K+1 and forwards step C2 to for sky) continue to generate the more subspace of higher-dimension, be the highest n-dimensional subspace n that comprises cluster for sky represents the K n-dimensional subspace n, then forward described step D to.
4. a kind of clustering method towards the SNP data according to claim 1, it is characterized in that obtaining among the described step D being divided the SNP data of good class, refer to that each class is the set in the SNP site of coexpression after the cluster, the operation steps that cluster is carried out in the described subspace that step C is obtained is as follows:
D1) every sub spaces is mapped respectively G=<V, E 〉, wherein G refers to figure, and V refers to the summit of figure, and E refers to the summit of the limit of figure: figure
Represent the interior SNP site set of dense cell i of subspace, the limit of figure
It is coplanar representing dense cell i, j; Described
Representing dense cell i, j is coplanar referring to: two dense cell i are identical with the expression data of j in k-1 sample, and only the expression data in a sample is not identical; The SNP data have the characteristic of two equipotential polymorphisms, so the dense cell under the one dimension state is coplanar;
D2) to step D1) in the figure that does carry out depth-first search and obtain connected subgraph, a connected subgraph is exactly a cluster, each cluster is the set in the SNP site of coexpression.
5. a kind of clustering method towards the SNP data according to claim 1 is characterized in that cluster result refers to total what clusters in the described step e, and what clusters every one-dimensional subspace has, and what the set in SNP site is in each cluster.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201110418812 CN102419774B (en) | 2011-12-15 | 2011-12-15 | Method for clustering single nucleotide polymorphism (SNP) data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201110418812 CN102419774B (en) | 2011-12-15 | 2011-12-15 | Method for clustering single nucleotide polymorphism (SNP) data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102419774A CN102419774A (en) | 2012-04-18 |
CN102419774B true CN102419774B (en) | 2013-04-03 |
Family
ID=45944187
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 201110418812 Expired - Fee Related CN102419774B (en) | 2011-12-15 | 2011-12-15 | Method for clustering single nucleotide polymorphism (SNP) data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102419774B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106339416B (en) * | 2016-08-15 | 2019-11-08 | 常熟理工学院 | Educational data clustering method based on grid fast searching density peaks |
CN106909942B (en) * | 2017-02-28 | 2022-09-13 | 北京邮电大学 | Subspace clustering method and device for high-dimensionality big data |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101211355A (en) * | 2006-12-30 | 2008-07-02 | 中国科学院计算技术研究所 | Image inquiry method based on clustering |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI396106B (en) * | 2009-08-17 | 2013-05-11 | Univ Nat Pingtung Sci & Tech | Grid-based data clustering method |
-
2011
- 2011-12-15 CN CN 201110418812 patent/CN102419774B/en not_active Expired - Fee Related
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101211355A (en) * | 2006-12-30 | 2008-07-02 | 中国科学院计算技术研究所 | Image inquiry method based on clustering |
Non-Patent Citations (3)
Title |
---|
单核苷酸多态性分析算法的研究与应用;王峻;《哈尔滨工业大学博士学位论文》;20110430;第50至53页 * |
王峻.单核苷酸多态性分析算法的研究与应用.《哈尔滨工业大学博士学位论文》.2011,第50至53页. |
胡泱,陈刚.一种有效的基于网格和密度的聚类分析算法.《计算机应用》.2003,第23卷(第12期),第64页至第67页. * |
Also Published As
Publication number | Publication date |
---|---|
CN102419774A (en) | 2012-04-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102629305B (en) | Feature selection method facing to SNP (Single Nucleotide Polymorphism) data | |
Edla et al. | A prototype-based modified DBSCAN for gene clustering | |
Zheng et al. | Gene differential coexpression analysis based on biweight correlation and maximum clique | |
WO2010042888A1 (en) | A computational method for comparing, classifying, indexing, and cataloging of electronically stored linear information | |
CN106845536B (en) | Parallel clustering method based on image scaling | |
CN102419774B (en) | Method for clustering single nucleotide polymorphism (SNP) data | |
EP3955256A1 (en) | Non-redundant gene clustering method and system, and electronic device | |
US9008974B2 (en) | Taxonomic classification system | |
CN103119606B (en) | A kind of clustering method of large-scale image data and device | |
CN113808669A (en) | Metagenome sequence assembling method | |
Muflikhah et al. | DNA sequence of hepatitis B virus clustering using hierarchical k-means algorithm | |
CN109145111B (en) | Multi-feature text data similarity calculation method based on machine learning | |
Gill et al. | Genetic Algorithm Based Approach To CircuitPartitioning | |
Swiercz et al. | GRASShopPER—An algorithm for de novo assembly based on GPU alignments | |
CN105760478A (en) | Large-scale distributed data clustering method based on machine learning | |
AU2021346093A1 (en) | Method and system for subsampling of cells from single-cell genomics dataset | |
CN108764991B (en) | Supply chain information analysis method based on K-means algorithm | |
US9342653B2 (en) | Identification of ribosomal DNA sequences | |
CN111931861A (en) | Anomaly detection method for heterogeneous data set and computer-readable storage medium | |
Liu et al. | Cellular Similarity based Imputation for Single cell RNA Sequencing Data | |
CN114708919B (en) | Rapid low-loss population single cell big data simplification method | |
Srivastava et al. | Alevin: An integrated method for dscRNA-seq quantification | |
Spitz et al. | Predicting Document Creation Times in News Citation Networks | |
CN113436674B (en) | Incremental community detection method-TSEIA based on TOPSIS seed expansion | |
Eyüpoğlu | Clustering of mitochondrial D-loop sequences using similarity matrix, PCA and K-means algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20130403 Termination date: 20151215 |
|
EXPY | Termination of patent right or utility model |