CN110245717A - A kind of gene expression Spectral Clustering based on machine learning - Google Patents

A kind of gene expression Spectral Clustering based on machine learning Download PDF

Info

Publication number
CN110245717A
CN110245717A CN201910539449.8A CN201910539449A CN110245717A CN 110245717 A CN110245717 A CN 110245717A CN 201910539449 A CN201910539449 A CN 201910539449A CN 110245717 A CN110245717 A CN 110245717A
Authority
CN
China
Prior art keywords
data
clustering
algorithm
cluster
rate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910539449.8A
Other languages
Chinese (zh)
Inventor
彭绍亮
潘佳铭
张磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN201910539449.8A priority Critical patent/CN110245717A/en
Publication of CN110245717A publication Critical patent/CN110245717A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation

Abstract

The invention belongs to machine learning field, the cluster being related in machine learning, in particular to the gene expression spectral clustering based on machine learning belong to application of the machine learning in biological big data analysis.In such a way that a variety of single clustering algorithms to be formed to a kind of mixing clustering method, solves the situation in the single clustering algorithm of tradition due to causing Clustering Effect undesirable when clustering algorithm is not suitable for data group, and mixing clustering method can be by comparing the cluster result of variant single algorithm, the cluster result with optimal solution is obtained, to solve the problems, such as cluster optimal solution.This method has the advantages that applied widely, can be adapted for all data groups, and this method has very high transplantability, any single clustering algorithm can be applicable in.

Description

A kind of gene expression Spectral Clustering based on machine learning
Technical field:
The invention belongs to machine learning field, the cluster being related in machine learning, in particular to based on the base of machine learning Because expressing Spectral Clustering, belong to application of the machine learning in biological big data analysis.
Background technique:
Because human gene express spectra includes the full gene of mankind's group, gene expression profile data itself just has Largely, higher-dimension, complexity characteristic.The data of gene expression profile are analyzed by the method for cluster, to research human gene Expression, various genetic diseases and due to caused by cytopathy disease be of great significance.
Under normal circumstances, clustering is mainly using the distance between sample as judgment basis, it is therefore intended that in sample Data set is divided into inhomogeneity (or cluster).During carrying out clustering to data, make to be divided to same class (or cluster) Data sample between it is as similar as possible while as different as possible between the data sample of inhomogeneity (or cluster).With optimal poly- The cluster result of class effect should have maximum similitude, and inhomogeneity between the data sample of same class (or cluster) Data sample between (or cluster) has maximum otherness.
It is generally for the universal pattern that gene expression profile is clustered: using single clustering method to gene expression profile Data carry out clustering, then keep cluster result more ideal by way of improving single clustering method.But for this There are some problems for traditional clustering method, firstly, since the characteristic of gene expression profile data itself, not all cluster Method is suitable for gene expression profile data, thus for select a kind of clustering method carry out data analysis there is it is artificial because Element does not have science;Gene expression profile data is analyzed secondly, only using a kind of clustering method, cluster result has Very big uncertainty will lead to poly- because single clustering method may not be able to calculate optimal cluster result The ineffectivity of class result.
Summary of the invention:
Clustering Effect caused by it is an object of the invention to solve due to data self character and single clustering method Undesirable situation.Accordingly, the invention proposes a kind of clustering methods that can solve the problems, such as both simultaneously --- a variety of clusters The hybrid clustering method of algorithm composition, this clustering method do not need the algorithm for artificially selecting to be suitble to data itself, and By comparing the cluster result of a variety of clustering algorithms, the algorithm with optimum cluster effect can be selected, while solving cluster The problem of optimal solution.
The technical scheme is that
A kind of gene expression Spectral Clustering based on machine learning, comprising the following steps:
Step 1: initial data is pre-processed, comprising the following steps:
(1) class label is stamped to the data for the gene expression profile for belonging to same cell line, is respectively labeled as { t0,t1,…, ti,…,tn, wherein tiIt indicates same category data markers to be the entitled t of classiClass, for distinguishing the data of different cell lines;
(2) data of the gene expression profile of different cell lines are sufficiently mixed, and upset variant cell line Data make same cell line for dividing the training set data and test set data that can embody each cell line data distribution Data can disperse enough, the data of different cell lines can merge enough;
(3) by the total data after being sufficiently mixed respectively by the 50% of the 70% of 30% and data volume of data volume, data volume Total data is divided into set evidence and test group with 30% ratio of the 50% of data volume, the 70% of data volume and data volume Data;
Step 2: training clustering algorithm, comprising the following steps:
(1) based on KMeans, MiniBatchKMeans, Hierarchical clustering, GMM and Birch this five Kind clustering algorithm forms hybrid clustering method, human factor when avoiding selecting single clustering algorithm, and selects with highest The cluster result of accuracy rate;
(2) by set according to inputting in hybrid clustering method and being trained, both can be used this five kinds it is single poly- Class algorithm can also select other different clustering algorithms according to their needs and data type, all single to be suitable for Clustering algorithm;
Step 3: the cluster of sample data is carried out using test data, comprising the following steps:
(1) after set is according to cluster practice is completed, then by the clustering algorithm trained of test group data input, to survey Examination group data carry out forecast analysis;
(2) it by comparing the class label of each test group data sample prediction and the class label of each sample itself, calculates Predictablity rate of the every kind of clustering algorithm to test set data sample class label out;
(3) it after calculating each clustering algorithm to the predictablity rate of data, selects with highest prediction accuracy rate Algorithm exports the predictablity rate of this algorithm, if having the clustering algorithm of multiple highest accuracys rate having the same simultaneously, while defeated These algorithm names and its predictablity rate out;
Step 4: by optimum cluster prediction result with graphic software platform
Cluster result with optimum prediction result and highest prediction accuracy rate is shown in a manner of patterned, is gathered Clustering algorithm and its predictablity rate are indicated on the figure of class result, if there is the poly- of multiple highest accuracys rate having the same simultaneously Class algorithm, while showing the cluster result of these algorithms, and the name of clustering algorithm is indicated on the figure of a cluster result Title and its predictablity rate, in order to observe final cluster result.
Detailed description of the invention:
Fig. 1 is the process flow diagram of gene expression profile data in the present invention;
Fig. 2 is the illustraton of model of the novel clustering method of the present invention.
Specific embodiment:
The present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments:
It is an object of the invention to solve to cause in tradition cluster due to data self character and single clustering algorithm The undesirable situation of Clustering Effect.
As shown in Figure 1 and Figure 2, a kind of gene expression Spectral Clustering based on machine learning provided by the invention, including with Lower step:
Step 1: initial data is pre-processed, comprising the following steps:
(1) class label is stamped to the data for the gene expression profile for belonging to same cell line, is respectively labeled as { t0,t1,…, ti,…,tn, wherein tiIt indicates same category data markers to be the entitled t of classiClass, for distinguishing the data of different cell lines;
(2) data of the gene expression profile of different cell lines are sufficiently mixed, and upset variant cell line Data make same cell line for dividing the training set data and test set data that can embody each cell line data distribution Data can disperse enough, the data of different cell lines can merge enough;
(3) by the total data after being sufficiently mixed respectively by the 50% of the 70% of 30% and data volume of data volume, data volume Total data is divided into set evidence and test group with 30% ratio of the 50% of data volume, the 70% of data volume and data volume Data;
Step 2: training clustering algorithm, comprising the following steps:
(1) based on KMeans, MiniBatchKMeans, Hierarchical clustering, GMM and Birch this five Kind clustering algorithm forms hybrid clustering method, human factor when avoiding selecting single clustering algorithm, and selects with highest The cluster result of accuracy rate;
(2) by set according to inputting in hybrid clustering method and being trained, both can be used this five kinds it is single poly- Class algorithm can also select other different clustering algorithms according to their needs and data type, all single to be suitable for Clustering algorithm;
Step 3: the cluster of sample data is carried out using test data, comprising the following steps:
(1) after set is according to cluster practice is completed, then by the clustering algorithm trained of test group data input, to survey Examination group data carry out forecast analysis;
(2) it by comparing the class label of each test group data sample prediction and the class label of each sample itself, calculates Predictablity rate of the every kind of clustering algorithm to test set data sample class label out;
(3) it after calculating each clustering algorithm to the predictablity rate of data, selects with highest prediction accuracy rate Algorithm exports the predictablity rate of this algorithm, if having the clustering algorithm of multiple highest accuracys rate having the same simultaneously, while defeated These algorithm names and its predictablity rate out;
Step 4: by optimum cluster prediction result with graphic software platform
Cluster result with optimum prediction result and highest prediction accuracy rate is shown in a manner of patterned, is gathered Clustering algorithm and its predictablity rate are indicated on the figure of class result, if there is the poly- of multiple highest accuracys rate having the same simultaneously Class algorithm, while showing the cluster result of these algorithms, and the name of clustering algorithm is indicated on the figure of a cluster result Title and its predictablity rate, in order to observe final cluster result.

Claims (1)

1. a kind of gene expression Spectral Clustering based on machine learning, which is characterized in that comprise the following steps:
Step 1: initial data is pre-processed, comprising the following steps:
(1) class label is stamped to the data for the gene expression profile for belonging to same cell line, is respectively labeled as { t0,t1,…, ti,…,tn, wherein tiIt indicates same category data markers to be the entitled t of classiClass, for distinguishing the data of different cell lines;
(2) data of the gene expression profile of different cell lines are sufficiently mixed, and upset the data of variant cell line, For dividing the training set data and test set data that can embody each cell line data distribution, make the data of same cell line It can disperse enough, the data of different cell lines can merge enough;
(3) by the total data after being sufficiently mixed respectively by the 70% of 30% and data volume of data volume, 50% sum number of data volume Total data is divided into set evidence and test group number by 30% ratio according to the 50% of amount, the 70% of data volume and data volume According to;
Step 2: training clustering algorithm, comprising the following steps:
(1) poly- based on KMeans, this five kinds of MiniBatchKMeans, Hierarchical clustering, GMM and Birch Class algorithm forms hybrid clustering method, human factor when avoiding selecting single clustering algorithm, and select it is accurate with highest The cluster result of rate;
(2) this five kinds single clusters both can be used to calculate according to inputting in hybrid clustering method and being trained set Method can also select other different clustering algorithms according to their needs and data type, to be suitable for all single clusters Algorithm;
Step 3: the cluster of sample data is carried out using test data, comprising the following steps:
(1) after set is according to cluster practice is completed, then by the clustering algorithm trained of test group data input, to test group Data carry out forecast analysis;
(2) it by comparing the class label of each test group data sample prediction and the class label of each sample itself, calculates every Predictablity rate of the kind clustering algorithm to test set data sample class label;
(3) after calculating each clustering algorithm to the predictablity rate of data, the algorithm with highest prediction accuracy rate is selected, The predictablity rate of this algorithm is exported, if there is the clustering algorithm of multiple highest accuracys rate having the same simultaneously, while exporting this A little algorithm names and its predictablity rate;
Step 4: by optimum cluster prediction result with graphic software platform
Cluster result with optimum prediction result and highest prediction accuracy rate is shown in a manner of patterned, cluster knot Clustering algorithm and its predictablity rate are indicated on the figure of fruit, if there is the cluster of multiple highest accuracys rate having the same to calculate simultaneously Method, while showing the cluster result of these algorithms, and indicate on the figure of a cluster result clustering algorithm title and Its predictablity rate, in order to observe final cluster result.
CN201910539449.8A 2019-06-20 2019-06-20 A kind of gene expression Spectral Clustering based on machine learning Pending CN110245717A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910539449.8A CN110245717A (en) 2019-06-20 2019-06-20 A kind of gene expression Spectral Clustering based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910539449.8A CN110245717A (en) 2019-06-20 2019-06-20 A kind of gene expression Spectral Clustering based on machine learning

Publications (1)

Publication Number Publication Date
CN110245717A true CN110245717A (en) 2019-09-17

Family

ID=67888428

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910539449.8A Pending CN110245717A (en) 2019-06-20 2019-06-20 A kind of gene expression Spectral Clustering based on machine learning

Country Status (1)

Country Link
CN (1) CN110245717A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112735536A (en) * 2020-12-23 2021-04-30 湖南大学 Single cell integrated clustering method based on subspace randomization
CN113010615A (en) * 2021-04-12 2021-06-22 安徽农业大学 Hierarchical data visualization method based on Gaussian mixture model clustering algorithm

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160070950A1 (en) * 2014-09-10 2016-03-10 Agency For Science, Technology And Research Method and system for automatically assigning class labels to objects
CN105468638A (en) * 2014-09-09 2016-04-06 中国银联股份有限公司 Data classification method, system and classifier implementation method
CN105745659A (en) * 2013-09-16 2016-07-06 佰欧迪塞克斯公司 Classifier generation method using combination of mini-classifiers with regularization and uses thereof
CN108874959A (en) * 2018-06-06 2018-11-23 电子科技大学 A kind of user's dynamic interest model method for building up based on big data technology
CN109887272A (en) * 2018-12-26 2019-06-14 阿里巴巴集团控股有限公司 A kind of prediction technique and device of traffic flow of the people

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105745659A (en) * 2013-09-16 2016-07-06 佰欧迪塞克斯公司 Classifier generation method using combination of mini-classifiers with regularization and uses thereof
CN105468638A (en) * 2014-09-09 2016-04-06 中国银联股份有限公司 Data classification method, system and classifier implementation method
US20160070950A1 (en) * 2014-09-10 2016-03-10 Agency For Science, Technology And Research Method and system for automatically assigning class labels to objects
CN108874959A (en) * 2018-06-06 2018-11-23 电子科技大学 A kind of user's dynamic interest model method for building up based on big data technology
CN109887272A (en) * 2018-12-26 2019-06-14 阿里巴巴集团控股有限公司 A kind of prediction technique and device of traffic flow of the people

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
彭绍亮 等: "生物效应大数据评估聚类算法的并行化", 《大数据》 *
路东方 等: "生物大数据中的聚类方法分析", 《上海大学学报(自然科学版)》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112735536A (en) * 2020-12-23 2021-04-30 湖南大学 Single cell integrated clustering method based on subspace randomization
CN113010615A (en) * 2021-04-12 2021-06-22 安徽农业大学 Hierarchical data visualization method based on Gaussian mixture model clustering algorithm
CN113010615B (en) * 2021-04-12 2021-10-01 安徽农业大学 Hierarchical data visualization method based on Gaussian mixture model clustering algorithm

Similar Documents

Publication Publication Date Title
CN106202891B (en) A kind of big data method for digging towards Evaluation of Medical Quality
CN107133960A (en) Image crack dividing method based on depth convolutional neural networks
Sokal et al. The two taxonomies: areas of agreement and conflict
CN107278877B (en) A kind of full-length genome selection and use method of corn seed-producing rate
CN107391963A (en) Eucaryon based on calculating cloud platform is without ginseng transcript profile interaction analysis system and method
CN110245717A (en) A kind of gene expression Spectral Clustering based on machine learning
CN109492774A (en) A kind of cloud resource dispatching method based on deep learning
CN105373606A (en) Unbalanced data sampling method in improved C4.5 decision tree algorithm
CN106919951A (en) A kind of Weakly supervised bilinearity deep learning method merged with vision based on click
CN109528197A (en) The individuation prediction technique and system of across the Species migration carry out mental disease of monkey-people based on brain function map
CN109919177A (en) Feature selection approach based on stratification depth network
CN108875812A (en) A kind of driving behavior classification method based on branch's convolutional neural networks
CN108898225A (en) Data mask method based on man-machine coordination study
CN109214437A (en) A kind of IVF-ET early pregnancy embryonic development forecasting system based on machine learning
CN107368707A (en) Gene chip expression data analysis system and method based on US ELM
CN110161194A (en) It is a kind of based on odiferous information BP fuzzy neuron identification the recognition methods of fruit freshness, apparatus and system
CN110442954A (en) The super high strength stainless steel design method of lower machine learning is instructed based on physical metallurgy
CN109948548B (en) Lipstick recommendation method and system based on color matching of machine learning
Liu et al. Improved K-means algorithm based on hybrid rice optimization algorithm
CN111445991A (en) Method for clinical immune monitoring based on cell transcriptome data
CN110288041A (en) Chinese herbal medicine classification model construction method and system based on deep learning
CN109448842B (en) The determination method, apparatus and electronic equipment of human body intestinal canal Dysbiosis
CN110250094A (en) A kind of group breeding method of mutton sheep
CN109948569A (en) A kind of three-dimensional hybrid expression recognition method using particle filter frame
CN110188662A (en) A kind of AI intelligent identification Method of water meter number

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190917

WD01 Invention patent application deemed withdrawn after publication