CN110245717A

CN110245717A - A kind of gene expression Spectral Clustering based on machine learning

Info

Publication number: CN110245717A
Application number: CN201910539449.8A
Authority: CN
Inventors: 彭绍亮; 潘佳铭; 张磊
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2019-06-20
Filing date: 2019-06-20
Publication date: 2019-09-17

Abstract

The invention belongs to machine learning field, the cluster being related in machine learning, in particular to the gene expression spectral clustering based on machine learning belong to application of the machine learning in biological big data analysis.In such a way that a variety of single clustering algorithms to be formed to a kind of mixing clustering method, solves the situation in the single clustering algorithm of tradition due to causing Clustering Effect undesirable when clustering algorithm is not suitable for data group, and mixing clustering method can be by comparing the cluster result of variant single algorithm, the cluster result with optimal solution is obtained, to solve the problems, such as cluster optimal solution.This method has the advantages that applied widely, can be adapted for all data groups, and this method has very high transplantability, any single clustering algorithm can be applicable in.

Description

A kind of gene expression Spectral Clustering based on machine learning

Technical field:

The invention belongs to machine learning field, the cluster being related in machine learning, in particular to based on the base of machine learning Because expressing Spectral Clustering, belong to application of the machine learning in biological big data analysis.

Background technique:

Because human gene express spectra includes the full gene of mankind's group, gene expression profile data itself just has Largely, higher-dimension, complexity characteristic.The data of gene expression profile are analyzed by the method for cluster, to research human gene Expression, various genetic diseases and due to caused by cytopathy disease be of great significance.

Under normal circumstances, clustering is mainly using the distance between sample as judgment basis, it is therefore intended that in sample Data set is divided into inhomogeneity (or cluster).During carrying out clustering to data, make to be divided to same class (or cluster) Data sample between it is as similar as possible while as different as possible between the data sample of inhomogeneity (or cluster).With optimal poly- The cluster result of class effect should have maximum similitude, and inhomogeneity between the data sample of same class (or cluster) Data sample between (or cluster) has maximum otherness.

It is generally for the universal pattern that gene expression profile is clustered: using single clustering method to gene expression profile Data carry out clustering, then keep cluster result more ideal by way of improving single clustering method.But for this There are some problems for traditional clustering method, firstly, since the characteristic of gene expression profile data itself, not all cluster Method is suitable for gene expression profile data, thus for select a kind of clustering method carry out data analysis there is it is artificial because Element does not have science；Gene expression profile data is analyzed secondly, only using a kind of clustering method, cluster result has Very big uncertainty will lead to poly- because single clustering method may not be able to calculate optimal cluster result The ineffectivity of class result.

Summary of the invention:

Clustering Effect caused by it is an object of the invention to solve due to data self character and single clustering method Undesirable situation.Accordingly, the invention proposes a kind of clustering methods that can solve the problems, such as both simultaneously --- a variety of clusters The hybrid clustering method of algorithm composition, this clustering method do not need the algorithm for artificially selecting to be suitble to data itself, and By comparing the cluster result of a variety of clustering algorithms, the algorithm with optimum cluster effect can be selected, while solving cluster The problem of optimal solution.

The technical scheme is that

A kind of gene expression Spectral Clustering based on machine learning, comprising the following steps:

Step 1: initial data is pre-processed, comprising the following steps:

(1) class label is stamped to the data for the gene expression profile for belonging to same cell line, is respectively labeled as { t₀,t₁,…, t_i,…,t_n, wherein t_iIt indicates same category data markers to be the entitled t of class_iClass, for distinguishing the data of different cell lines；

(2) data of the gene expression profile of different cell lines are sufficiently mixed, and upset variant cell line Data make same cell line for dividing the training set data and test set data that can embody each cell line data distribution Data can disperse enough, the data of different cell lines can merge enough；

(3) by the total data after being sufficiently mixed respectively by the 50% of the 70% of 30% and data volume of data volume, data volume Total data is divided into set evidence and test group with 30% ratio of the 50% of data volume, the 70% of data volume and data volume Data；

Step 2: training clustering algorithm, comprising the following steps:

(1) based on KMeans, MiniBatchKMeans, Hierarchical clustering, GMM and Birch this five Kind clustering algorithm forms hybrid clustering method, human factor when avoiding selecting single clustering algorithm, and selects with highest The cluster result of accuracy rate；

(2) by set according to inputting in hybrid clustering method and being trained, both can be used this five kinds it is single poly- Class algorithm can also select other different clustering algorithms according to their needs and data type, all single to be suitable for Clustering algorithm；

Step 3: the cluster of sample data is carried out using test data, comprising the following steps:

(1) after set is according to cluster practice is completed, then by the clustering algorithm trained of test group data input, to survey Examination group data carry out forecast analysis；

(2) it by comparing the class label of each test group data sample prediction and the class label of each sample itself, calculates Predictablity rate of the every kind of clustering algorithm to test set data sample class label out；

(3) it after calculating each clustering algorithm to the predictablity rate of data, selects with highest prediction accuracy rate Algorithm exports the predictablity rate of this algorithm, if having the clustering algorithm of multiple highest accuracys rate having the same simultaneously, while defeated These algorithm names and its predictablity rate out；

Step 4: by optimum cluster prediction result with graphic software platform

Cluster result with optimum prediction result and highest prediction accuracy rate is shown in a manner of patterned, is gathered Clustering algorithm and its predictablity rate are indicated on the figure of class result, if there is the poly- of multiple highest accuracys rate having the same simultaneously Class algorithm, while showing the cluster result of these algorithms, and the name of clustering algorithm is indicated on the figure of a cluster result Title and its predictablity rate, in order to observe final cluster result.

Detailed description of the invention:

Fig. 1 is the process flow diagram of gene expression profile data in the present invention；

Fig. 2 is the illustraton of model of the novel clustering method of the present invention.

Specific embodiment:

The present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments:

It is an object of the invention to solve to cause in tradition cluster due to data self character and single clustering algorithm The undesirable situation of Clustering Effect.

As shown in Figure 1 and Figure 2, a kind of gene expression Spectral Clustering based on machine learning provided by the invention, including with Lower step:

Step 1: initial data is pre-processed, comprising the following steps:

Step 2: training clustering algorithm, comprising the following steps:

Step 4: by optimum cluster prediction result with graphic software platform

Claims

1. a kind of gene expression Spectral Clustering based on machine learning, which is characterized in that comprise the following steps:

Step 1: initial data is pre-processed, comprising the following steps:

(2) data of the gene expression profile of different cell lines are sufficiently mixed, and upset the data of variant cell line, For dividing the training set data and test set data that can embody each cell line data distribution, make the data of same cell line It can disperse enough, the data of different cell lines can merge enough；

(3) by the total data after being sufficiently mixed respectively by the 70% of 30% and data volume of data volume, 50% sum number of data volume Total data is divided into set evidence and test group number by 30% ratio according to the 50% of amount, the 70% of data volume and data volume According to；

Step 2: training clustering algorithm, comprising the following steps:

(1) poly- based on KMeans, this five kinds of MiniBatchKMeans, Hierarchical clustering, GMM and Birch Class algorithm forms hybrid clustering method, human factor when avoiding selecting single clustering algorithm, and select it is accurate with highest The cluster result of rate；

(2) this five kinds single clusters both can be used to calculate according to inputting in hybrid clustering method and being trained set Method can also select other different clustering algorithms according to their needs and data type, to be suitable for all single clusters Algorithm；

(1) after set is according to cluster practice is completed, then by the clustering algorithm trained of test group data input, to test group Data carry out forecast analysis；

(2) it by comparing the class label of each test group data sample prediction and the class label of each sample itself, calculates every Predictablity rate of the kind clustering algorithm to test set data sample class label；

(3) after calculating each clustering algorithm to the predictablity rate of data, the algorithm with highest prediction accuracy rate is selected, The predictablity rate of this algorithm is exported, if there is the clustering algorithm of multiple highest accuracys rate having the same simultaneously, while exporting this A little algorithm names and its predictablity rate；

Step 4: by optimum cluster prediction result with graphic software platform

Cluster result with optimum prediction result and highest prediction accuracy rate is shown in a manner of patterned, cluster knot Clustering algorithm and its predictablity rate are indicated on the figure of fruit, if there is the cluster of multiple highest accuracys rate having the same to calculate simultaneously Method, while showing the cluster result of these algorithms, and indicate on the figure of a cluster result clustering algorithm title and Its predictablity rate, in order to observe final cluster result.