CN111612101B - Gene expression data clustering method, device and equipment of nonparametric Watson mixed model - Google Patents

Gene expression data clustering method, device and equipment of nonparametric Watson mixed model Download PDF

Info

Publication number
CN111612101B
CN111612101B CN202010499785.7A CN202010499785A CN111612101B CN 111612101 B CN111612101 B CN 111612101B CN 202010499785 A CN202010499785 A CN 202010499785A CN 111612101 B CN111612101 B CN 111612101B
Authority
CN
China
Prior art keywords
watson
gene expression
nonparametric
expression data
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010499785.7A
Other languages
Chinese (zh)
Other versions
CN111612101A (en
Inventor
范文涛
侯文娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huaqiao University
Original Assignee
Huaqiao University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huaqiao University filed Critical Huaqiao University
Priority to CN202010499785.7A priority Critical patent/CN111612101B/en
Publication of CN111612101A publication Critical patent/CN111612101A/en
Application granted granted Critical
Publication of CN111612101B publication Critical patent/CN111612101B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a gene expression data clustering method, a device and equipment of a nonparametric Watton mixed model, wherein the method comprises the following steps: s101, acquiring a gene data set to be clustered; the gene data set comprises N gene expression data vectors; s102, modeling a gene expression data vector by using a nonparametric Watson mixed model; s103, estimating model parameters of the nonparametric Watson mixed model through a variational Bayes inference algorithm; s104, judging whether the nonparametric Watson mixed model converges according to the estimated model parameters; if not, returning to the step S103, if so, executing the step S105; and S105, judging the category of each gene expression data vector according to the posterior probability of the indicator factor, and clustering the gene expression data vectors according to the category. The present embodiment may achieve better clustering results than methods based on DP mixture models when dealing with unbalanced data, due to the discount parameter that may be used to control the generation of new class numbers.

Description

Gene expression data clustering method, device and equipment of nonparametric Watson mixed model
Technical Field
The invention relates to the field of data mining, in particular to a gene expression data clustering method, a gene expression data clustering device and gene expression data clustering equipment for a nonparametric Watton mixed model.
Background
With the development of modern biotechnology, especially the implementation of genome project, people have continuously obtained a great deal of gene expression data in recent years. However, in the large amount of gene sequence data obtained, the functions of only a few genes are known, and the functions of most genes are unknown. Therefore, it is necessary to group gene sequences having similar functions in the same class by a cluster analysis technique. Since the gene sequences in the same class have similar functions, one can use the known functions of genes in the same class to predict the functions of the unknown functional genes in the same class. The Watson distribution is applicable to describe axial data (axial data) that appears in the direction data, i.e., data whose unit vectors are the same. The performance of the Watson mixed model obtained when the data containing axial symmetry (such as gene expression data after L2 normalization) is subjected to clustering analysis is obviously better than that of other common mixed models (such as a Gaussian mixed model).
In the prior art, wentao Fan et al propose a clustering method of a Watson mixed model based on a Dirichlet Process (DP) and apply the clustering method to gene expression data clustering analysis. In this method gene expression data were pre-processed and then normalized using the L2 norm, and each gene data was assumed to obey the DP-based Watson mixed model. The model estimates model parameters using a variational Bayesian inference method. It has the following disadvantages:
the clustering method proposed in the prior art is based on a Watson hybrid model constructed by a Dirichlet Process (DP) framework. Clustering analysis of unbalanced data is not efficient because DP mixture models typically cannot identify classes that contain only a small number of data samples.
Disclosure of Invention
In view of the above, the present invention provides a method, an apparatus and a device for clustering gene expression data in a nonparametric Watson mixture model, which have discount parameters for controlling generation of new category numbers, so that the method is more advantageous than the method based on a DP mixture model in processing unbalanced data, and thus better clustering results can be obtained.
The embodiment of the invention provides a gene expression data clustering method of a nonparametric Watson mixed model, which comprises the following steps:
s101, acquiring a gene data set to be clustered; wherein the gene data set comprises N gene expression data vectors;
s102, modeling a gene expression data vector by using a nonparametric Watson mixed model;
s103, estimating model parameters of the nonparametric Watson mixed model through a variational Bayes inference algorithm;
s104, judging whether the nonparametric Watson mixed model converges according to the estimated model parameters; if not, returning to the step S103, if so, executing the step S105;
and S105, judging the category of each gene expression data vector according to the posterior probability of the indicator factor, and clustering the gene expression data vectors according to the category.
Preferably, the modeling of the gene expression data vector by using a non-parametric Watson hybrid model specifically includes:
for D-dimensional vectors obeying Watson probability distribution
Figure BDA0002524363890000031
The probability density function is defined as:
Figure BDA0002524363890000032
wherein,
Figure BDA0002524363890000033
for a data set containing N gene expression data vectors,
Figure BDA0002524363890000034
is a position parameter and satisfies a condition
Figure BDA0002524363890000035
| | | is the calculation of the L2 norm; gamma is a scale parameter and satisfies the condition gamma>0, gamma function, M (·) Kummer function;
and (3) subjecting each gene expression data vector obeying the nonparametric Watson mixed model to the nonparametric Watson mixed model, wherein the probability density function expression of each gene expression data vector obeying the nonparametric Watson mixed model is as follows:
Figure BDA0002524363890000036
the nonparametric Watson mixed model consists of infinite mixed components, and each mixed component corresponds to one Watson probability distribution
Figure BDA0002524363890000037
Is a parameter of the kth mixing element, and pi k >0 is the corresponding "mixing coefficient" and satisfies the condition
Figure BDA0002524363890000038
Expressing data vectors for each gene
Figure BDA0002524363890000039
Assigning a binary hidden variable
Figure BDA00025243638900000310
As an indicator factor: when Z is nk =1, indicates a gene expression data vector
Figure BDA00025243638900000311
Belong to the kth category; otherwise, Z nk =0; in which the hidden variables are
Figure BDA00025243638900000312
Has a probability distribution of
Figure BDA00025243638900000313
Parameters to nonparametric Watson hybrid models
Figure BDA0002524363890000041
And
Figure BDA0002524363890000042
assigning a prior probability distribution; wherein, watson-Gamma distribution is adopted as parameter
Figure BDA0002524363890000043
Joint prior distribution of (c):
Figure BDA0002524363890000044
wherein p is g (. Cndot.) is a Gamma distribution;
obtaining a total probability expression of a nonparametric Watson mixed model based on a Pitman-Yor process model:
Figure BDA0002524363890000045
preferably, the non-parametric Watson hybrid model is constructed based on a Pitman-Yor process model adopting a Stick-Breaking representation method; in a Pitman-Yor process model based on a Stick-Breaking representation method, a mixing coefficient pi k Is represented as follows:
Figure BDA0002524363890000046
Figure BDA0002524363890000047
following the Beta distribution, the expression format is as follows:
Figure BDA0002524363890000048
wherein p is b (. Smallcap.) is Beta distribution, zeta is a discount parameter in the Pitman-Yor process model and meets the condition that zeta is more than or equal to 0 and less than or equal to 1, and zeta is a density parameter and meets the condition zeta>-ζ。
Preferably, the first and second electrodes are formed of a metal,
said estimating model parameters of said non-parametric Watson hybrid model by a variational Bayesian inference algorithm, an
Judging whether the nonparametric Watson mixed model converges according to the estimated model parameters;
the method specifically comprises the following steps:
initializing model parameters; the method comprises the following steps of initializing truncation layer number K =15; initializing hyperparameter 0<a k <1,0<b k <1,
Figure BDA0002524363890000051
β k =1,ζ k =0.5,ξ k =0.5; initializing r using K-Means algorithm nk (ii) a Initialization
Figure BDA0002524363890000052
Updating the variation posterior and the expected value by using the current model parameter;
obtaining updated expected values
Figure BDA0002524363890000053
Obtaining a variation lower bound generated by the current iteration;
and comparing the variation lower bound generated by the current iteration with the variation lower bound generated by the last iteration to judge whether the nonparametric Watton mixed model converges.
Preferably, the updating of the variation posteriori and the expected value by using the current model parameter is specifically as follows:
defining the lower bound of variation as:
L(q)=<lnp(Θ|X)>-<lnq(Θ)>
wherein,<·>for the calculation of the desired value of the value,
Figure BDA0002524363890000054
a set of all random variables and hidden variables; q (Θ) is an approximate distribution of the real posterior distribution p (Θ | X), namely a variational posterior; the expression of the variation posterior q (theta) is as follows
Figure BDA0002524363890000055
Truncating the hybrid component from an infinite dimensional space to a K dimensional space using a truncation technique:
π′ K =1,
Figure BDA0002524363890000056
when k is>Pi at K k =0;
Wherein K is the number of truncation layers, namely the number of categories; the value of K will reach the optimum value at convergence;
all variational posteriors were optimized by maximizing the lower bound of variational L (q):
Figure BDA0002524363890000057
Figure BDA0002524363890000058
Figure BDA0002524363890000061
the hyperparameters in the formula are calculated by formulas (4) to (11):
Figure BDA0002524363890000062
Figure BDA0002524363890000063
Figure BDA0002524363890000064
Figure BDA0002524363890000065
Figure BDA0002524363890000066
Figure BDA0002524363890000067
Figure BDA0002524363890000068
Figure BDA0002524363890000069
Figure BDA00025243638900000610
middle maximum eigenvalue (12)
Figure BDA00025243638900000611
Characteristic vector (13)
The calculation of the expected value in the above equation is calculated by the following equation:
<Z nk >=r nk (14)
Figure BDA0002524363890000071
Figure BDA0002524363890000072
<lnπ′ k >=Ψ(g k )-Ψ(g k +h k ) (17)
<ln(1-π′ k )>=Ψ(h k )-Ψ(g k +h k ) (18)
where Ψ (. Cndot.) is a Digamma function.
Preferably, comparing the lower bound of variation generated by the current iteration with the lower bound of variation generated by the last iteration to determine whether the non-parametric Watson hybrid model converges specifically is:
whether the difference between the lower variation bound generated by the current iteration and the lower variation bound generated by the last iteration is smaller than a preset threshold value or not; the preset threshold value is 0.0001
If yes, judging that the nonparametric Watson mixed model converges;
if not, judging that the nonparametric Watson mixed model does not converge.
Preferably, the class of each gene expression data vector is determined according to the posterior probability of the indicator, so that the gene expression data vectors are clustered according to the class, specifically:
obtaining the posterior probability r of the indicative factor nk ,r nk Expression data vector representing the nth gene
Figure BDA0002524363890000074
Probability of belonging to the kth class;
selecting the class with the highest probability as the gene expression data vector
Figure BDA0002524363890000073
Of (c).
Preferably, the method further comprises the following steps:
preprocessing the gene expression data vector:
removing data containing empty items in the gene expression data vector;
removing data with characteristic value 'NAN' in the gene expression data vector;
removing data in the gene expression data vector which has small change with time;
the remaining gene expression data vectors were normalized to the L2 norm.
The embodiment of the invention also provides a gene expression data clustering device of the nonparametric Watson mixed model, which comprises the following steps:
the gene data set acquisition unit is used for acquiring a gene data set to be clustered; wherein the gene data set comprises N gene expression data vectors;
the modeling unit is used for modeling the gene expression data vector by using a nonparametric Watson mixed model;
the model parameter estimation unit is used for estimating the model parameters of the nonparametric Watson mixed model through a variational Bayes inference algorithm;
a convergence judging unit, configured to judge whether the nonparametric Watson hybrid model converges according to the estimated model parameter; if not, informing the model parameter estimation unit, if yes, informing the classification unit;
and the classification unit is used for judging the category of each gene expression data vector according to the posterior probability of the indicator factor so as to cluster the gene expression data vectors according to the category.
The embodiment of the invention also provides gene expression data clustering equipment of the non-parametric Watson mixed model, which comprises a memory and a processor, wherein a gene data set to be clustered and a computer program are stored in the memory, and the computer program can be executed by the processor so as to realize the gene expression data clustering method of the non-parametric Watson mixed model.
In summary, the embodiment constructs a non-parametric hybrid model based on Watson probability distribution by adopting a non-parametric model framework based on the Pitman-Yor process, and performs cluster analysis on the gene expression data vector by using the model. The Watson hybrid model constructed in the embodiment refers to a probability distribution formed by a weighted combination mode of a plurality of Watson probability distributions. In the method, each piece of gene expression data is preprocessed and then normalized by using an L2 norm, and then a Watson mixed model is used for modeling the obtained gene data vector. In order to flexibly and automatically adjust the number of gene data categories according to the size of data, the method uses a nonparametric model framework named as Pitman-Yor process to construct a nonparametric mixed model based on Watson probability distribution. The parameters of the proposed nonparametric Watson hybrid model are estimated by a variational Bayesian inference algorithm. Compared with the prior art, the mixture model based on the Pitman-Yor process provided by the embodiment has the discount parameters which can be used for controlling the generation of the new category number, so that the method has more advantages than a method based on a DP mixture model when processing unbalanced data, and a better clustering result can be obtained.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flow chart of a gene expression data clustering method for a non-parametric walton mixture model according to a first embodiment of the present invention.
Fig. 2 is another schematic flow chart of the gene expression data clustering method of the non-parametric walton mixture model according to the first embodiment of the present invention.
Fig. 3 is a schematic diagram of program modules of a gene expression data clustering device of a non-parametric walton mixture model according to a second embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, a first embodiment of the present invention provides a method for clustering gene expression data of a non-parametric walton mixture model, which is performed by a gene expression data clustering device (hereinafter referred to as a clustering device) of the non-parametric walton mixture model, and at least includes:
s101, acquiring a gene data set to be clustered; wherein the gene data set comprises N gene expression data vectors.
In this embodiment, the clustering device may be a computer device with a data processing function, such as a laptop, a desktop, or a server, and the computer device may implement the gene expression data clustering method of the non-parametric walton mixture model by executing a predetermined program.
In this embodiment, after the gene data set is obtained, the gene data set may be further preprocessed to remove some interference data, where the preprocessing includes at least one of:
data containing empty entries in the gene expression data vector are removed.
Removing the data with characteristic value 'NAN' in the gene expression data vector.
Data in the gene expression data vector that has less variation over time is removed.
The remaining gene expression data vectors were normalized to the L2 norm.
And S102, modeling the gene expression data vector by using a nonparametric Watson mixed model.
In this embodiment, step S102 specifically includes:
s1021, for D-dimensional vectors obeying Watson probability distribution
Figure BDA0002524363890000111
Defining its probability density function as:
Figure BDA0002524363890000112
wherein,
Figure BDA0002524363890000113
for a data set containing N gene expression data vectors,
Figure BDA0002524363890000114
is a position parameter and satisfies a condition
Figure BDA0002524363890000115
| L | · | is the calculation of the L2 norm; gamma is a scale parameter and satisfies the condition gamma>0, Γ (·) is a Gamma function, and M (·) is a Kummer function.
S1022, subjecting each gene expression data vector obeying the nonparametric Watson mixed model to the nonparametric Watson mixed model, wherein the probability density function expression of each gene expression data vector obeying the nonparametric Watson mixed model is as follows:
Figure BDA0002524363890000116
wherein the non-parametric Watson mixture model is composed of an infinite number of mixture components (also called mixture components), each corresponding to a Watson probability distribution
Figure BDA0002524363890000117
Wherein
Figure BDA0002524363890000118
Is a parameter of the kth mixing element, and pi k >0 is the corresponding mixing coefficient (mixing coefficient) and satisfies the condition
Figure BDA0002524363890000119
In the embodiment, the non-parametric Watson hybrid model is constructed based on a Pitman-Yor process model using a Stick-Breaking representation method. In a Pitman-Yor process model based on a Stick-Breaking representation method, a mixing coefficient pi k Is represented as follows:
Figure BDA0002524363890000121
Figure BDA0002524363890000122
following the Beta distribution, the expression format is as follows:
Figure BDA0002524363890000123
wherein p is b (.) is Beta distribution, zeta is the Discount (Discount) parameter in the Pitman-Yor process model and satisfies the condition 0 ≦ zeta ≦ 1, and xi is the density parameter satisfying the condition xi>-ζ。
S1023, expressing data vector for each gene
Figure BDA0002524363890000124
Assigning a binary hidden variable
Figure BDA0002524363890000125
As an indicator factor: when Z is nk When =1, the gene expression data vector is indicated
Figure BDA0002524363890000126
Belong to the kth category; otherwise, Z nk =0; in which the hidden variables are
Figure BDA0002524363890000127
Has a probability distribution of
Figure BDA0002524363890000128
S1024, giving parameters of the non-parametric Watson mixed model
Figure BDA0002524363890000129
And
Figure BDA00025243638900001210
assigning a prior probability distribution; wherein, watson-Gamma distribution is adopted as parameter
Figure BDA00025243638900001211
Joint prior distribution of (c):
Figure BDA00025243638900001212
wherein p is g (. Cndot.) is a Gamma distribution.
S1025, obtaining a total probability expression of a nonparametric Watson mixed model based on a Pitman-Yor process model:
Figure BDA00025243638900001213
thus, a nonparametric Watton mixture model based on the Pitman-Yor process model is obtained.
S103, estimating model parameters of the nonparametric Watton mixed model through a variational Bayes inference algorithm.
S104, judging whether the nonparametric Watton mixed model converges according to the inferred model parameters; if not, the process returns to step S103, and if so, step S105 is executed.
Specifically, the method comprises the following steps:
firstly, initializing model parameters; the method comprises the following steps of initializing truncation layer number K =15; initialization of hyper-parameter 0<a k <1,0<b k <1,
Figure BDA0002524363890000131
β k =1,ζ k =0.5,ξ k =0.5; initializing r using K-Means algorithm nk (ii) a Initialization
Figure BDA0002524363890000132
The variation posteriori and the expected value are then updated with the current model parameters.
Wherein, the lower bound of the variation is defined as follows:
L(q)=<lnp(Θ|X)>-<lnq(Θ)>
here, ,<·>for calculation of desired values
Figure BDA0002524363890000133
A set of all random variables and hidden variables; q (Θ) is an approximate distribution of the real posterior distribution p (Θ | X), namely a variational posterior; the expression of variation posteriori q (Θ) is as follows
Figure BDA0002524363890000138
The hybrid component is then truncated from the infinite dimensional space to the K dimensional space using a truncation technique:
π′ K =1,
Figure BDA0002524363890000134
when k is>Pi at K k =0;
Wherein K is the number of truncation layers, namely the number of categories; the value of K will reach an optimal value when converging;
then, all the variational posteriors are optimized by maximizing the lower variational bound L (q):
Figure BDA0002524363890000135
Figure BDA0002524363890000136
Figure BDA0002524363890000137
the hyperparameters in the formula are calculated from formulas (4) to (11):
Figure BDA0002524363890000141
Figure BDA0002524363890000142
Figure BDA0002524363890000143
Figure BDA0002524363890000144
Figure BDA0002524363890000145
Figure BDA0002524363890000146
Figure BDA0002524363890000147
Figure BDA0002524363890000148
Figure BDA0002524363890000149
middle maximum eigenvalue (12)
Figure BDA00025243638900001410
Characteristic vector (13)
The calculation of the expected value in the above equation is calculated by the following equation:
<Z nk >=r nk (14)
Figure BDA00025243638900001411
Figure BDA0002524363890000151
<lnπ′ k >=Ψ(g k )-Ψ(g k +h k ) (17)
<ln(1-π′ k )>=Ψ(h k )-Ψ(g k +h k ) (18)
where Ψ (. Cndot.) is a Digamma function.
Then, the updated expected value is obtained
Figure BDA0002524363890000152
Obtaining the updated expected value by equations (14) to (18)
Figure BDA0002524363890000153
And then, obtaining a variation lower bound generated by the current iteration.
After obtaining the updated model parameters and the expected values, the lower bounds of variation generated by the current iteration can be obtained by equations (1) - (18).
And finally, comparing the variation lower bound generated by the current iteration with the variation lower bound generated by the last iteration to judge whether the nonparametric Watson mixed model converges.
Specifically, judging whether the difference between the variation lower bound generated by the current iteration and the variation lower bound generated by the last iteration is smaller than a preset threshold value; if yes, judging that the nonparametric Watton mixed model is converged, and the number of truncation layers reaches an optimal value; if not, judging that the nonparametric Watson mixed model does not converge, and performing the next iteration at the moment.
In a preferred embodiment of the present invention, the preset threshold may be 0.0001, but it should be noted that the preset threshold may also be other values, and the smaller the preset threshold is, the higher the iteration precision is, and the present invention is not limited specifically.
And S105, judging the category of each gene expression data vector according to the posterior probability of the indicator factor, and clustering the gene expression data vectors according to the category.
In this embodiment, after the model converges, the posterior probability r of the indicative factor in the converged model parameters is obtained nk Posterior probability r of the indicator nk Expression data vector representing the nth gene
Figure BDA0002524363890000162
Probability of belonging to the kth class, in this case according to r nk Selecting the class with the highest probability as the gene expression data vector
Figure BDA0002524363890000163
Based on the type of the gene expression data vector, the gene data can be classifiedClustering of different gene expression data vectors in the set.
In order to facilitate understanding of the present invention, the following description will be given of an application of the present embodiment as a practical example.
In this example, verification will be performed on two public gene datasets (Diauxic Shift dataset and Yeast Cell Cycle dataset).
In this embodiment, the Windows10 system is used as an experimental platform, matlab is used as a programming language, and the parameter setting is described in the embodiments of the present invention. Clustering results are measured as Normalized Mutual Information (NMI).
In the embodiment, the PYP-WMM gene data clustering method, the classic clustering algorithm K-means and a method based on a DP process and a Watson mixed model, namely (DP-WMM) are compared. Each method was repeated 10 times and the average was taken as a comparison index. The results of the experiment are shown in table 1. As can be seen from the comparison result, compared with the classical clustering algorithm K-means and the prior similar technology, the invention can obtain better gene expression data clustering result (higher NMI value).
TABLE 1
Figure BDA0002524363890000161
In summary, the embodiment constructs a non-parametric hybrid model based on Watson probability distribution by using a non-parametric model framework based on the Pitman-Yor process, and performs cluster analysis on the gene expression data vector by using the model. The Watson hybrid model constructed in the embodiment refers to a probability distribution formed by a weighted combination mode of a plurality of Watson probability distributions. In the method, each piece of gene expression data is preprocessed and then normalized by using an L2 norm, and then a Watson mixed model is used for modeling the obtained gene data vector. In order to flexibly and automatically adjust the number of gene data categories according to the size of data, the method uses a nonparametric model framework named as Pitman-Yor process to construct a nonparametric mixed model based on Watson probability distribution. The parameters of the proposed nonparametric Watson hybrid model are estimated by a variational Bayesian inference algorithm. Compared with the prior art, the mixture model based on the Pitman-Yor process provided by the embodiment has discount parameters which can be used for controlling the generation of new category quantities, so that the method is more advantageous than a method based on a DP mixture model when processing unbalanced data, and a better clustering result can be obtained.
The second embodiment of the present invention further provides a gene expression data clustering device of a nonparametric Watson mixture model, including:
a gene data set obtaining unit 210, configured to obtain a gene data set to be clustered; wherein the gene data set comprises N gene expression data vectors;
a modeling unit 220, configured to model the gene expression data vector using a non-parametric Watson hybrid model;
a model parameter estimation unit 230 for estimating model parameters of the non-parametric Watson hybrid model by a variational Bayesian inference algorithm;
a convergence judging unit 240, configured to judge whether the nonparametric Watson hybrid model converges according to the estimated model parameter; if not, informing the model parameter estimation unit, if yes, informing the classification unit;
and the classification unit 250 is configured to judge a category to which each gene expression data vector belongs according to the posterior probability of the indicator, so as to cluster the gene expression data vectors according to the category to which the gene expression data vectors belong.
The embodiment of the invention also provides gene expression data clustering equipment of the non-parametric Watson mixed model, which comprises a memory and a processor, wherein a gene data set to be clustered and a computer program are stored in the memory, and the computer program can be executed by the processor so as to realize the gene expression data clustering method of the non-parametric Watson mixed model.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus and method embodiments described above are illustrative only, as the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, the functional modules in the embodiments of the present invention may be integrated together to form an independent part, or each module may exist alone, or two or more modules may be integrated to form an independent part.
The functions may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, an electronic device, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising one of 8230; \8230;" 8230; "does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (9)

1. A gene expression data clustering method of a nonparametric Watson mixed model is characterized by comprising the following steps:
s101, acquiring a gene data set to be clustered; wherein the gene data set comprises N gene expression data vectors;
s102, modeling a gene expression data vector by using a nonparametric Watson mixed model; wherein, specifically include: for D-dimensional vectors obeying Watson probability distribution
Figure FDA0003973212420000011
Defining its probability density function as:
Figure FDA0003973212420000012
wherein,
Figure FDA0003973212420000013
containing N genesA data set that expresses a data vector is,
Figure FDA0003973212420000014
is a position parameter and satisfies a condition
Figure FDA0003973212420000015
| L | · | is the calculation of the L2 norm; gamma is a scale parameter and satisfies the condition gamma>0, gamma (·) is a Gamma function, and M (·) is a Kummer function;
and (3) subjecting each gene expression data vector obeying the nonparametric Watson mixed model to the nonparametric Watson mixed model, wherein the probability density function expression of each gene expression data vector obeying the nonparametric Watson mixed model is as follows:
Figure FDA0003973212420000016
the nonparametric Watson mixed model consists of infinite mixed components, and each mixed component corresponds to one Watson probability distribution
Figure FDA0003973212420000017
Figure FDA0003973212420000018
Is a parameter of the kth mixing element, and pi k >0 is the corresponding "mixing coefficient" and satisfies the condition
Figure FDA0003973212420000019
Expressing the data vector for each gene
Figure FDA00039732124200000110
Assigning a binary hidden variable
Figure FDA00039732124200000111
As an indicator factor: when Z is nk =1, indicates a gene expression data vector
Figure FDA00039732124200000112
Belong to the kth category; otherwise, Z nk =0; in which the hidden variables are
Figure FDA00039732124200000113
Has a probability distribution of
Figure FDA00039732124200000114
Parameters to nonparametric Watson hybrid models
Figure FDA00039732124200000115
And
Figure FDA00039732124200000116
assigning a prior probability distribution; wherein, watson-Gamma distribution is adopted as parameter
Figure FDA00039732124200000117
Joint prior distribution of (c):
Figure FDA00039732124200000118
wherein p is g (. Cndot.) is a Gamma distribution;
obtaining a total probability expression of a nonparametric Watton mixed model based on a Pitman-Yor process model:
Figure FDA0003973212420000021
s103, estimating model parameters of the nonparametric Watson mixed model through a variational Bayes inference algorithm;
s104, judging whether the nonparametric Watson mixed model converges or not according to the estimated model parameters; if not, returning to the step S103, if so, executing the step S105;
and S105, judging the category of each gene expression data vector according to the posterior probability of the indicator factor, and clustering the gene expression data vectors according to the category.
2. The method for clustering gene expression data of a non-parametric Watson mixture model according to claim 1,
the nonparametric Watson mixed model is constructed on the basis of a Pitman-Yor process model adopting a Stick-Breaking representation method; in the Pitman-Yor process model based on the Stick-Breaking representation method,
coefficient of mixing pi k Is represented as follows:
Figure FDA0003973212420000031
Figure FDA0003973212420000032
following the Beta distribution, the expression format is as follows:
Figure FDA0003973212420000033
wherein p is b (.) is Beta distribution, zeta is the discount parameter in the Pitman-Yor process model and satisfies the condition 0 ≦ zeta ≦ 1, and zeta is the density parameter satisfying the condition ξ>-ζ。
3. The method for clustering gene expression data of a nonparametric Watson mixture model according to claim 2,
said estimating model parameters of said non-parametric Watson hybrid model by a variational Bayesian inference algorithm, an
Judging whether the nonparametric Watson mixed model converges or not according to the estimated model parameters;
the method specifically comprises the following steps:
initializing model parameters; the method comprises the following steps of initializing truncation layer number K =15; initialization of hyper-parameter 0<a k <1,0<b k <1,
Figure FDA0003973212420000034
β k =1,ζ k =0.5,ξ k =0.5; initializing r using K-Means algorithm nk (ii) a Initialization
Figure FDA0003973212420000035
Figure FDA0003973212420000036
Updating the variation posterior and the expected value by using the current model parameters;
obtaining updated values from updated expected values
Figure FDA0003973212420000037
Obtaining a variation lower bound generated by the current iteration;
and comparing the variation lower bound generated by the current iteration with the variation lower bound generated by the last iteration to judge whether the nonparametric Watson mixed model converges.
4. The method for clustering gene expression data of non-parametric Watson mixture models according to claim 3, wherein the updating of the posterior variational and expected values using the current model parameters is specifically:
defining the lower bound of variation as:
L(q)=<lnp(Θ|X)>-<lnq(Θ)>
wherein,<·>in order to calculate the expected value of the quantity,
Figure FDA0003973212420000041
a set of all random variables and hidden variables; q (Θ) is an approximate distribution of the true posterior distribution p (Θ | X),namely, variation posterior test; the expression of variation posteriori q (Θ) is as follows
Figure FDA0003973212420000042
Truncating the hybrid component from an infinite dimensional space to a K dimensional space using a truncation technique:
π′ K =1,
Figure FDA0003973212420000043
when k is>Pi at K k =0;
Wherein K is the number of truncation layers, namely the number of categories; the value of K will reach an optimal value when converging;
all variational posteriors were optimized by maximizing the lower bound of variational L (q):
Figure FDA0003973212420000044
Figure FDA0003973212420000045
Figure FDA0003973212420000046
the hyperparameters in the formula are calculated by formulas (4) to (11):
Figure FDA0003973212420000047
Figure FDA0003973212420000048
Figure FDA0003973212420000049
Figure FDA00039732124200000410
Figure FDA00039732124200000411
Figure FDA0003973212420000051
Figure FDA0003973212420000052
Figure FDA0003973212420000053
Figure FDA0003973212420000054
middle maximum eigenvalue (12)
Figure FDA0003973212420000055
Characteristic vector (13)
The calculation of the expected value in the above equation is calculated by the following equation:
<Z nk >=r nk (14)
Figure FDA0003973212420000056
Figure FDA0003973212420000057
<lnπ′ k >=Ψ(g k )-Ψ(g k +h k ) (17)
<ln(1-π′ k )>=Ψ(h k )-Ψ(g k +h k ) (18)
where Ψ (. Cndot.) is a Digamma function.
5. The method for clustering gene expression data of a nonparametric Watson mixture model according to claim 3,
comparing the lower variation bound generated by the current iteration with the lower variation bound generated by the last iteration to judge whether the nonparametric Watson mixed model converges specifically:
whether the difference between the lower variation bound generated by the current iteration and the lower variation bound generated by the last iteration is smaller than a preset threshold value or not; the preset threshold value is 0.0001
If yes, judging that the nonparametric Watson mixed model converges;
and if not, judging that the nonparametric Watson mixed model does not converge.
6. The method for clustering gene expression data of a non-parametric Watson mixture model according to claim 4, wherein the class of each gene expression data vector is determined according to the posterior probability of the indicator, so that the gene expression data vectors are clustered according to the class, specifically:
obtaining the posterior probability r of the indicative factor nk ,r nk Expression data vector representing the nth gene
Figure FDA0003973212420000061
Probability of belonging to the kth class;
selecting the class with the highest probability as the gene expression data vector
Figure FDA0003973212420000062
The category (2).
7. The method for clustering gene expression data of a nonparametric Watson mixture model according to any one of claims 1 to 6, further comprising:
preprocessing the gene expression data vector:
removing data containing empty items in the gene expression data vector;
removing data with characteristic value NAN from the gene expression data vector;
removing data in the gene expression data vector which has small change with time;
the remaining gene expression data vectors were normalized to the L2 norm.
8. A gene expression data clustering device of a nonparametric Watson mixed model is characterized by comprising the following steps:
the gene data set acquisition unit is used for acquiring a gene data set to be clustered; wherein the gene data set comprises N gene expression data vectors;
the modeling unit is used for modeling the gene expression data vector by using a nonparametric Watson mixed model; wherein the modeling unit is specifically configured to: for D-dimensional vectors obeying Watson probability distribution
Figure FDA0003973212420000063
Defining its probability density function as:
Figure FDA0003973212420000064
wherein,
Figure FDA0003973212420000065
for a data set containing N gene expression data vectors,
Figure FDA0003973212420000066
is a position parameter and satisfies a condition
Figure FDA0003973212420000067
| L | · | is the calculation of the L2 norm; gamma is a scale parameter and satisfies the condition gamma>0, gamma function, M (·) Kummer function;
and (3) subjecting each gene expression data vector obeying the nonparametric Watson mixed model to the nonparametric Watson mixed model, wherein the probability density function expression of each gene expression data vector obeying the nonparametric Watson mixed model is as follows:
Figure FDA0003973212420000071
the nonparametric Watson mixed model consists of infinite mixed components, and each mixed component corresponds to one Watson probability distribution
Figure FDA0003973212420000072
Figure FDA0003973212420000073
Is a parameter of the kth mixing element, and pi k >0 is the corresponding "mixing coefficient" and satisfies the condition
Figure FDA0003973212420000074
Expressing the data vector for each gene
Figure FDA0003973212420000075
Assigning a binary hidden variable
Figure FDA0003973212420000076
As an indicator factor: when Z is nk When =1, the gene expression data vector is indicated
Figure FDA0003973212420000077
Belong to the firstk classes; otherwise, Z nk =0; in which the hidden variables are
Figure FDA0003973212420000078
Has a probability distribution of
Figure FDA0003973212420000079
Parameters to nonparametric Watson hybrid models
Figure FDA00039732124200000710
And
Figure FDA00039732124200000711
assigning a prior probability distribution; wherein, watson-Gamma distribution is adopted as parameter
Figure FDA00039732124200000712
Joint prior distribution of (c):
Figure FDA00039732124200000713
wherein p is g (. Cndot.) is a Gamma distribution;
obtaining a total probability expression of a nonparametric Watton mixed model based on a Pitman-Yor process model:
Figure FDA00039732124200000714
the model parameter estimation unit is used for estimating the model parameters of the nonparametric Watson mixed model through a variational Bayes inference algorithm;
a convergence judging unit, configured to judge whether the nonparametric Watson hybrid model converges according to the estimated model parameter; if not, informing the model parameter estimation unit, if yes, informing the classification unit;
and the classification unit is used for judging the category of each gene expression data vector according to the posterior probability of the indicator factor so as to cluster the gene expression data vectors according to the category.
9. A gene expression data clustering device of a non-parametric Watson hybrid model, comprising a memory in which gene data sets to be clustered and a computer program executable by the processor are stored, and a processor, to implement the gene expression data clustering method of the non-parametric Watson hybrid model according to any one of claims 1 to 7.
CN202010499785.7A 2020-06-04 2020-06-04 Gene expression data clustering method, device and equipment of nonparametric Watson mixed model Active CN111612101B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010499785.7A CN111612101B (en) 2020-06-04 2020-06-04 Gene expression data clustering method, device and equipment of nonparametric Watson mixed model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010499785.7A CN111612101B (en) 2020-06-04 2020-06-04 Gene expression data clustering method, device and equipment of nonparametric Watson mixed model

Publications (2)

Publication Number Publication Date
CN111612101A CN111612101A (en) 2020-09-01
CN111612101B true CN111612101B (en) 2023-02-07

Family

ID=72199012

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010499785.7A Active CN111612101B (en) 2020-06-04 2020-06-04 Gene expression data clustering method, device and equipment of nonparametric Watson mixed model

Country Status (1)

Country Link
CN (1) CN111612101B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112508050A (en) * 2020-11-06 2021-03-16 重庆恢恢信息技术有限公司 Construction engineering construction planning working method based on mass data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001202358A (en) * 2000-01-21 2001-07-27 Nippon Telegr & Teleph Corp <Ntt> Bayesian inference method for mixed model and recording medium with recorded bayesian inference program for mixed model
CN103226595A (en) * 2013-04-17 2013-07-31 南京邮电大学 Clustering method for high dimensional data based on Bayes mixed common factor analyzer
CN108804784A (en) * 2018-05-25 2018-11-13 江南大学 A kind of instant learning soft-measuring modeling method based on Bayes's gauss hybrid models

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001202358A (en) * 2000-01-21 2001-07-27 Nippon Telegr & Teleph Corp <Ntt> Bayesian inference method for mixed model and recording medium with recorded bayesian inference program for mixed model
CN103226595A (en) * 2013-04-17 2013-07-31 南京邮电大学 Clustering method for high dimensional data based on Bayes mixed common factor analyzer
CN108804784A (en) * 2018-05-25 2018-11-13 江南大学 A kind of instant learning soft-measuring modeling method based on Bayes's gauss hybrid models

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于无限逆狄利克雷混合模型的变分学习;王景中等;《计算机与数字工程》;20170420(第04期);全文 *

Also Published As

Publication number Publication date
CN111612101A (en) 2020-09-01

Similar Documents

Publication Publication Date Title
Behl et al. Alpha maml: Adaptive model-agnostic meta-learning
CN109754078B (en) Method for optimizing a neural network
Van Der Laan et al. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples
JP6482481B2 (en) Binary classification learning apparatus, binary classification apparatus, method, and program
CN112488791A (en) Individualized recommendation method based on knowledge graph convolution algorithm
CN111583031A (en) Application scoring card model building method based on ensemble learning
CN111612101B (en) Gene expression data clustering method, device and equipment of nonparametric Watson mixed model
JP5017941B2 (en) Model creation device and identification device
CN113590748B (en) Emotion classification continuous learning method based on iterative network combination and storage medium
Ahmed Arafa et al. Logistic regression hyperparameter optimization for cancer classification
KR100869554B1 (en) Domain density description based incremental pattern classification method
Pickering et al. Information FOMO: the unhealthy fear of missing out on information. A method for removing misleading data for healthier models
Trapp et al. Learning deep mixtures of gaussian process experts using sum-product networks
CN111860556A (en) Model processing method and device and storage medium
CN115345303A (en) Convolutional neural network weight tuning method, device, storage medium and electronic equipment
Wang et al. Parameters optimization of classifier and feature selection based on improved artificial bee colony algorithm
JP7477859B2 (en) Calculator, calculation method and program
CN111611389B (en) Text data clustering method, device and equipment based on nonparametric VMF mixed model
CN112766336A (en) Method for improving verifiable defense performance of model under maximum random smoothness
CN112990255A (en) Method and device for predicting equipment failure, electronic equipment and storage medium
CN118429004B (en) Commodity order prediction method in supply chain network and related products
Jia et al. Gene regulatory network inference by point-based gaussian approximation filters incorporating the prior information
CN110059219A (en) A kind of video preference prediction technique, device, equipment and readable storage medium storing program for executing
CN113688950B (en) Multi-target feature selection method, device and storage medium for image classification
Brown et al. Ucspv: principled voting in ucs rule populations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant