CN110245717A - A kind of gene expression Spectral Clustering based on machine learning - Google Patents
A kind of gene expression Spectral Clustering based on machine learning Download PDFInfo
- Publication number
- CN110245717A CN110245717A CN201910539449.8A CN201910539449A CN110245717A CN 110245717 A CN110245717 A CN 110245717A CN 201910539449 A CN201910539449 A CN 201910539449A CN 110245717 A CN110245717 A CN 110245717A
- Authority
- CN
- China
- Prior art keywords
- data
- clustering
- algorithm
- cluster
- rate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
- G16B25/10—Gene or protein expression profiling; Expression-ratio estimation or normalisation
Abstract
The invention belongs to machine learning field, the cluster being related in machine learning, in particular to the gene expression spectral clustering based on machine learning belong to application of the machine learning in biological big data analysis.In such a way that a variety of single clustering algorithms to be formed to a kind of mixing clustering method, solves the situation in the single clustering algorithm of tradition due to causing Clustering Effect undesirable when clustering algorithm is not suitable for data group, and mixing clustering method can be by comparing the cluster result of variant single algorithm, the cluster result with optimal solution is obtained, to solve the problems, such as cluster optimal solution.This method has the advantages that applied widely, can be adapted for all data groups, and this method has very high transplantability, any single clustering algorithm can be applicable in.
Description
Technical field:
The invention belongs to machine learning field, the cluster being related in machine learning, in particular to based on the base of machine learning
Because expressing Spectral Clustering, belong to application of the machine learning in biological big data analysis.
Background technique:
Because human gene express spectra includes the full gene of mankind's group, gene expression profile data itself just has
Largely, higher-dimension, complexity characteristic.The data of gene expression profile are analyzed by the method for cluster, to research human gene
Expression, various genetic diseases and due to caused by cytopathy disease be of great significance.
Under normal circumstances, clustering is mainly using the distance between sample as judgment basis, it is therefore intended that in sample
Data set is divided into inhomogeneity (or cluster).During carrying out clustering to data, make to be divided to same class (or cluster)
Data sample between it is as similar as possible while as different as possible between the data sample of inhomogeneity (or cluster).With optimal poly-
The cluster result of class effect should have maximum similitude, and inhomogeneity between the data sample of same class (or cluster)
Data sample between (or cluster) has maximum otherness.
It is generally for the universal pattern that gene expression profile is clustered: using single clustering method to gene expression profile
Data carry out clustering, then keep cluster result more ideal by way of improving single clustering method.But for this
There are some problems for traditional clustering method, firstly, since the characteristic of gene expression profile data itself, not all cluster
Method is suitable for gene expression profile data, thus for select a kind of clustering method carry out data analysis there is it is artificial because
Element does not have science;Gene expression profile data is analyzed secondly, only using a kind of clustering method, cluster result has
Very big uncertainty will lead to poly- because single clustering method may not be able to calculate optimal cluster result
The ineffectivity of class result.
Summary of the invention:
Clustering Effect caused by it is an object of the invention to solve due to data self character and single clustering method
Undesirable situation.Accordingly, the invention proposes a kind of clustering methods that can solve the problems, such as both simultaneously --- a variety of clusters
The hybrid clustering method of algorithm composition, this clustering method do not need the algorithm for artificially selecting to be suitble to data itself, and
By comparing the cluster result of a variety of clustering algorithms, the algorithm with optimum cluster effect can be selected, while solving cluster
The problem of optimal solution.
The technical scheme is that
A kind of gene expression Spectral Clustering based on machine learning, comprising the following steps:
Step 1: initial data is pre-processed, comprising the following steps:
(1) class label is stamped to the data for the gene expression profile for belonging to same cell line, is respectively labeled as { t0,t1,…,
ti,…,tn, wherein tiIt indicates same category data markers to be the entitled t of classiClass, for distinguishing the data of different cell lines;
(2) data of the gene expression profile of different cell lines are sufficiently mixed, and upset variant cell line
Data make same cell line for dividing the training set data and test set data that can embody each cell line data distribution
Data can disperse enough, the data of different cell lines can merge enough;
(3) by the total data after being sufficiently mixed respectively by the 50% of the 70% of 30% and data volume of data volume, data volume
Total data is divided into set evidence and test group with 30% ratio of the 50% of data volume, the 70% of data volume and data volume
Data;
Step 2: training clustering algorithm, comprising the following steps:
(1) based on KMeans, MiniBatchKMeans, Hierarchical clustering, GMM and Birch this five
Kind clustering algorithm forms hybrid clustering method, human factor when avoiding selecting single clustering algorithm, and selects with highest
The cluster result of accuracy rate;
(2) by set according to inputting in hybrid clustering method and being trained, both can be used this five kinds it is single poly-
Class algorithm can also select other different clustering algorithms according to their needs and data type, all single to be suitable for
Clustering algorithm;
Step 3: the cluster of sample data is carried out using test data, comprising the following steps:
(1) after set is according to cluster practice is completed, then by the clustering algorithm trained of test group data input, to survey
Examination group data carry out forecast analysis;
(2) it by comparing the class label of each test group data sample prediction and the class label of each sample itself, calculates
Predictablity rate of the every kind of clustering algorithm to test set data sample class label out;
(3) it after calculating each clustering algorithm to the predictablity rate of data, selects with highest prediction accuracy rate
Algorithm exports the predictablity rate of this algorithm, if having the clustering algorithm of multiple highest accuracys rate having the same simultaneously, while defeated
These algorithm names and its predictablity rate out;
Step 4: by optimum cluster prediction result with graphic software platform
Cluster result with optimum prediction result and highest prediction accuracy rate is shown in a manner of patterned, is gathered
Clustering algorithm and its predictablity rate are indicated on the figure of class result, if there is the poly- of multiple highest accuracys rate having the same simultaneously
Class algorithm, while showing the cluster result of these algorithms, and the name of clustering algorithm is indicated on the figure of a cluster result
Title and its predictablity rate, in order to observe final cluster result.
Detailed description of the invention:
Fig. 1 is the process flow diagram of gene expression profile data in the present invention;
Fig. 2 is the illustraton of model of the novel clustering method of the present invention.
Specific embodiment:
The present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments:
It is an object of the invention to solve to cause in tradition cluster due to data self character and single clustering algorithm
The undesirable situation of Clustering Effect.
As shown in Figure 1 and Figure 2, a kind of gene expression Spectral Clustering based on machine learning provided by the invention, including with
Lower step:
Step 1: initial data is pre-processed, comprising the following steps:
(1) class label is stamped to the data for the gene expression profile for belonging to same cell line, is respectively labeled as { t0,t1,…,
ti,…,tn, wherein tiIt indicates same category data markers to be the entitled t of classiClass, for distinguishing the data of different cell lines;
(2) data of the gene expression profile of different cell lines are sufficiently mixed, and upset variant cell line
Data make same cell line for dividing the training set data and test set data that can embody each cell line data distribution
Data can disperse enough, the data of different cell lines can merge enough;
(3) by the total data after being sufficiently mixed respectively by the 50% of the 70% of 30% and data volume of data volume, data volume
Total data is divided into set evidence and test group with 30% ratio of the 50% of data volume, the 70% of data volume and data volume
Data;
Step 2: training clustering algorithm, comprising the following steps:
(1) based on KMeans, MiniBatchKMeans, Hierarchical clustering, GMM and Birch this five
Kind clustering algorithm forms hybrid clustering method, human factor when avoiding selecting single clustering algorithm, and selects with highest
The cluster result of accuracy rate;
(2) by set according to inputting in hybrid clustering method and being trained, both can be used this five kinds it is single poly-
Class algorithm can also select other different clustering algorithms according to their needs and data type, all single to be suitable for
Clustering algorithm;
Step 3: the cluster of sample data is carried out using test data, comprising the following steps:
(1) after set is according to cluster practice is completed, then by the clustering algorithm trained of test group data input, to survey
Examination group data carry out forecast analysis;
(2) it by comparing the class label of each test group data sample prediction and the class label of each sample itself, calculates
Predictablity rate of the every kind of clustering algorithm to test set data sample class label out;
(3) it after calculating each clustering algorithm to the predictablity rate of data, selects with highest prediction accuracy rate
Algorithm exports the predictablity rate of this algorithm, if having the clustering algorithm of multiple highest accuracys rate having the same simultaneously, while defeated
These algorithm names and its predictablity rate out;
Step 4: by optimum cluster prediction result with graphic software platform
Cluster result with optimum prediction result and highest prediction accuracy rate is shown in a manner of patterned, is gathered
Clustering algorithm and its predictablity rate are indicated on the figure of class result, if there is the poly- of multiple highest accuracys rate having the same simultaneously
Class algorithm, while showing the cluster result of these algorithms, and the name of clustering algorithm is indicated on the figure of a cluster result
Title and its predictablity rate, in order to observe final cluster result.
Claims (1)
1. a kind of gene expression Spectral Clustering based on machine learning, which is characterized in that comprise the following steps:
Step 1: initial data is pre-processed, comprising the following steps:
(1) class label is stamped to the data for the gene expression profile for belonging to same cell line, is respectively labeled as { t0,t1,…,
ti,…,tn, wherein tiIt indicates same category data markers to be the entitled t of classiClass, for distinguishing the data of different cell lines;
(2) data of the gene expression profile of different cell lines are sufficiently mixed, and upset the data of variant cell line,
For dividing the training set data and test set data that can embody each cell line data distribution, make the data of same cell line
It can disperse enough, the data of different cell lines can merge enough;
(3) by the total data after being sufficiently mixed respectively by the 70% of 30% and data volume of data volume, 50% sum number of data volume
Total data is divided into set evidence and test group number by 30% ratio according to the 50% of amount, the 70% of data volume and data volume
According to;
Step 2: training clustering algorithm, comprising the following steps:
(1) poly- based on KMeans, this five kinds of MiniBatchKMeans, Hierarchical clustering, GMM and Birch
Class algorithm forms hybrid clustering method, human factor when avoiding selecting single clustering algorithm, and select it is accurate with highest
The cluster result of rate;
(2) this five kinds single clusters both can be used to calculate according to inputting in hybrid clustering method and being trained set
Method can also select other different clustering algorithms according to their needs and data type, to be suitable for all single clusters
Algorithm;
Step 3: the cluster of sample data is carried out using test data, comprising the following steps:
(1) after set is according to cluster practice is completed, then by the clustering algorithm trained of test group data input, to test group
Data carry out forecast analysis;
(2) it by comparing the class label of each test group data sample prediction and the class label of each sample itself, calculates every
Predictablity rate of the kind clustering algorithm to test set data sample class label;
(3) after calculating each clustering algorithm to the predictablity rate of data, the algorithm with highest prediction accuracy rate is selected,
The predictablity rate of this algorithm is exported, if there is the clustering algorithm of multiple highest accuracys rate having the same simultaneously, while exporting this
A little algorithm names and its predictablity rate;
Step 4: by optimum cluster prediction result with graphic software platform
Cluster result with optimum prediction result and highest prediction accuracy rate is shown in a manner of patterned, cluster knot
Clustering algorithm and its predictablity rate are indicated on the figure of fruit, if there is the cluster of multiple highest accuracys rate having the same to calculate simultaneously
Method, while showing the cluster result of these algorithms, and indicate on the figure of a cluster result clustering algorithm title and
Its predictablity rate, in order to observe final cluster result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910539449.8A CN110245717A (en) | 2019-06-20 | 2019-06-20 | A kind of gene expression Spectral Clustering based on machine learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910539449.8A CN110245717A (en) | 2019-06-20 | 2019-06-20 | A kind of gene expression Spectral Clustering based on machine learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110245717A true CN110245717A (en) | 2019-09-17 |
Family
ID=67888428
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910539449.8A Pending CN110245717A (en) | 2019-06-20 | 2019-06-20 | A kind of gene expression Spectral Clustering based on machine learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110245717A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112735536A (en) * | 2020-12-23 | 2021-04-30 | 湖南大学 | Single cell integrated clustering method based on subspace randomization |
CN113010615A (en) * | 2021-04-12 | 2021-06-22 | 安徽农业大学 | Hierarchical data visualization method based on Gaussian mixture model clustering algorithm |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160070950A1 (en) * | 2014-09-10 | 2016-03-10 | Agency For Science, Technology And Research | Method and system for automatically assigning class labels to objects |
CN105468638A (en) * | 2014-09-09 | 2016-04-06 | 中国银联股份有限公司 | Data classification method, system and classifier implementation method |
CN105745659A (en) * | 2013-09-16 | 2016-07-06 | 佰欧迪塞克斯公司 | Classifier generation method using combination of mini-classifiers with regularization and uses thereof |
CN108874959A (en) * | 2018-06-06 | 2018-11-23 | 电子科技大学 | A kind of user's dynamic interest model method for building up based on big data technology |
CN109887272A (en) * | 2018-12-26 | 2019-06-14 | 阿里巴巴集团控股有限公司 | A kind of prediction technique and device of traffic flow of the people |
-
2019
- 2019-06-20 CN CN201910539449.8A patent/CN110245717A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105745659A (en) * | 2013-09-16 | 2016-07-06 | 佰欧迪塞克斯公司 | Classifier generation method using combination of mini-classifiers with regularization and uses thereof |
CN105468638A (en) * | 2014-09-09 | 2016-04-06 | 中国银联股份有限公司 | Data classification method, system and classifier implementation method |
US20160070950A1 (en) * | 2014-09-10 | 2016-03-10 | Agency For Science, Technology And Research | Method and system for automatically assigning class labels to objects |
CN108874959A (en) * | 2018-06-06 | 2018-11-23 | 电子科技大学 | A kind of user's dynamic interest model method for building up based on big data technology |
CN109887272A (en) * | 2018-12-26 | 2019-06-14 | 阿里巴巴集团控股有限公司 | A kind of prediction technique and device of traffic flow of the people |
Non-Patent Citations (2)
Title |
---|
彭绍亮 等: "生物效应大数据评估聚类算法的并行化", 《大数据》 * |
路东方 等: "生物大数据中的聚类方法分析", 《上海大学学报(自然科学版)》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112735536A (en) * | 2020-12-23 | 2021-04-30 | 湖南大学 | Single cell integrated clustering method based on subspace randomization |
CN113010615A (en) * | 2021-04-12 | 2021-06-22 | 安徽农业大学 | Hierarchical data visualization method based on Gaussian mixture model clustering algorithm |
CN113010615B (en) * | 2021-04-12 | 2021-10-01 | 安徽农业大学 | Hierarchical data visualization method based on Gaussian mixture model clustering algorithm |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106202891B (en) | A kind of big data method for digging towards Evaluation of Medical Quality | |
CN107133960A (en) | Image crack dividing method based on depth convolutional neural networks | |
Sokal et al. | The two taxonomies: areas of agreement and conflict | |
CN107278877B (en) | A kind of full-length genome selection and use method of corn seed-producing rate | |
CN107391963A (en) | Eucaryon based on calculating cloud platform is without ginseng transcript profile interaction analysis system and method | |
CN110245717A (en) | A kind of gene expression Spectral Clustering based on machine learning | |
CN109492774A (en) | A kind of cloud resource dispatching method based on deep learning | |
CN105373606A (en) | Unbalanced data sampling method in improved C4.5 decision tree algorithm | |
CN106919951A (en) | A kind of Weakly supervised bilinearity deep learning method merged with vision based on click | |
CN109528197A (en) | The individuation prediction technique and system of across the Species migration carry out mental disease of monkey-people based on brain function map | |
CN109919177A (en) | Feature selection approach based on stratification depth network | |
CN108875812A (en) | A kind of driving behavior classification method based on branch's convolutional neural networks | |
CN108898225A (en) | Data mask method based on man-machine coordination study | |
CN109214437A (en) | A kind of IVF-ET early pregnancy embryonic development forecasting system based on machine learning | |
CN107368707A (en) | Gene chip expression data analysis system and method based on US ELM | |
CN110161194A (en) | It is a kind of based on odiferous information BP fuzzy neuron identification the recognition methods of fruit freshness, apparatus and system | |
CN110442954A (en) | The super high strength stainless steel design method of lower machine learning is instructed based on physical metallurgy | |
CN109948548B (en) | Lipstick recommendation method and system based on color matching of machine learning | |
Liu et al. | Improved K-means algorithm based on hybrid rice optimization algorithm | |
CN111445991A (en) | Method for clinical immune monitoring based on cell transcriptome data | |
CN110288041A (en) | Chinese herbal medicine classification model construction method and system based on deep learning | |
CN109448842B (en) | The determination method, apparatus and electronic equipment of human body intestinal canal Dysbiosis | |
CN110250094A (en) | A kind of group breeding method of mutton sheep | |
CN109948569A (en) | A kind of three-dimensional hybrid expression recognition method using particle filter frame | |
CN110188662A (en) | A kind of AI intelligent identification Method of water meter number |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190917 |
|
WD01 | Invention patent application deemed withdrawn after publication |