CN108875305A

CN108875305A - A kind of leukaemia cancer cell detector of colony intelligence optimizing

Info

Publication number: CN108875305A
Application number: CN201810458511.6A
Authority: CN
Inventors: 刘兴高; 高信腾
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2018-05-14
Filing date: 2018-05-14
Publication date: 2018-11-23

Abstract

The invention discloses the leukaemia cancer cell detector of colony intelligence optimizing, which reads in module, data prediction and feature ordering module, parameter optimization module, model output module by gene microarray and forms.System first pre-processes the gene microarray data of input, then carries out importance ranking to remaining each gene, calculates correlation by statistics score, recycles classifier criterion function to calculate contribution degree, all gene importance are ranked up.Improved optimization method joined under original intelligent optimizing algorithm fitness detection with population disturb, can prevent population diversity scatter and disappear and optimization process fall into local optimum.Then the optimized parameter searched is completed into model construction as classifier parameters and exports result.System speed is fast, is suitble to on-line checking.

Description

A kind of leukaemia cancer cell detector of colony intelligence optimizing

Technical field

The present invention relates to gene microarray data applied technical fields, and in particular, to a kind of white blood of colony intelligence optimizing Sick cancer cell detector.

Background technique

Biochip technology is by microphotography, according to the intermolecular principle specifically to interact, by life section Discontinuous analytic process is integrated in the miniature biochemical analysis system on silicon chip or glass-chip surface in field, with reality Now to accurate, quick, large information capacity the detection of cell, protein, gene and other biological components.According to cured on chip Biochip can be divided into genetic chip, protein-chip, polysaccharide chip and neuron chip by the difference of biomaterial. Currently, most successful biochip form is also to be claimed with gene order for analysis object " microarray (microarray) " For genetic chip (Gene chip) or DNA chip (DNA chip).Leukaemia can be divided into acute and chronic white blood by the emergency of onset Disease.Acute leukemia cells differentiation is stuck in early stage, and based on original and prorubricyte, disease is quickly grown, course of disease number Month.Chronic leukemia cell differentiation is preferable, based on inmature or mature cell, develops slowly, the course of disease several years.By sick cell system Column classification, the T and B cell system of grain, single, red, macronucleus system and Lymphatic System including medullary system.Leukaemia is often clinically divided into lymph Chronic myeloid leukemia, myelocytic leukemia, cell mixing leukaemia etc..Wherein acute leukemia is great to anthropogenic influence, Huan Zhetong It often will be dead in 3 months after the onset.And it is very difficult for the detection of acute leukemia in history and be difficult to accurate.Now Using DNA microarray technology, scientist is expected to be broken through in the field.And this kind of data have the characteristics of higher-dimension small sample, For general classifier and parameter testing, there is the typical case of dimension disaster difficult.How to overcome it, is instantly one big Research hotspot.

Summary of the invention

In order to overcome the shortcomings of to be difficult to search gene microarray data optimal feature subset and classification optimal parameter at present, The purpose of the present invention is to provide a kind of leukaemia cancer cell detectors of colony intelligence optimizing.

The technical solution adopted by the present invention to solve the technical problems is：A kind of leukaemia cancer cell inspection of colony intelligence optimizing Instrument is surveyed, the system is defeated by gene microarray reading module, data prediction and feature ordering module, parameter optimization module, model Module forms out；Wherein：

That gene microarray reads in module reading is the class label Y=[y of all gene microarrays₁,y₂,...,y_m], Middle y_iThe gene microarray expression value of=k, k ∈ (- 1,1) and all samples：

Wherein every a line x_iRepresent the expression value of all genes of sample, corresponding each column x_jA gene is represented to exist Expression value in all samples, subscript i indicate that i-th of sample, in total m, subscript j indicate j-th of gene, in total n.

Data prediction and feature ordering module be the original microarray data of reading is normalized and feature row The treatment process of sequence.Wherein normalization operation is：

Wherein, Min, Max are respectively the minimum value of sample gene expression values, maximum value.And feature ordering selection uses often The contribution degree of a gene pairs classification accuracy is given a mark to realize, in support vector machines theory, by defining a contribution degree letter Number：

Wherein, α=[α₁,...,α_n],H_ij=y_iy_jK(x_i,x_j), α is the corresponding coefficient of normal vector, H is intermediary matrix, J For cost function, I be unit matrix, K is kernel function, y is label value, x is sample characteristics, the transposition of subscript T representing matrix, Subscript i, j respectively indicates i-th of sample and j-th of gene.In fact, the formula represents the square value of classification boundaries size, support The objective function of vector machine is to minimize it.In the case where using kernel function of the linear kernel as support vector machines, then have：

Wherein,w^*It is normal vector, α for optimal supporting vector, w^*Optimal supporting vector is corresponding Coefficient.Above formula is observed, it is available：The significance level of each feature is according to this feature for the contribution of this cost function It determines, i.e., the contribution margin of each feature is：Wherein, δ indicates contribution degree.

It, generally can following approximate calculation when using Non-linear Kernel as kernel function：

Wherein, α value is constant after some feature of reasonable assumption eliminates, and H (- i) indicates the H-matrix after this feature cancellation Value.And when using the hypothesis, the result of obtained result and linear kernel is not much different.It can cycle calculations feature using the formula Contribution degree carries out gene importance ranking.

For one group of training sample set { x_n,t_n(n=1,2 ... N, x ∈ R^d, t ∈ { 0,1 }), wherein x indicates training sample This, t indicates sample moment point.Classification function y (the x of Method Using Relevance Vector Machine；Wr it) is defined as：

Wherein K (x, x_i) it is kernel function, wr_iFor weight, wr₀For initial weight.By logistic sigmoid contiguous function It is applied to former formula, the likelihood probability p (t | wr) that can obtain the data set is estimated as：

Wherein, σ is variance, and in order to avoid over-fitting, Method Using Relevance Vector Machine is that each wr is provided with Gaussian prior probability distribution Constraint condition：

The approximation of posterior probability can be obtained according to Laplace theory：α fixed first, the correspondence maximum possible sought is seemingly The weight w r of right probability MP_MP, since p (wr | t, α) is proportional to p (t | wr) p (wr | α), wr is sought using second order Newton method_MP：

Wherein, y_n=σ { y (x_n；) }, wr A=diag (α₀,α₁,...,α_N).Then Laplace method is utilized, to above formula Carrying out secondary derivation can obtain：

Wherein, Φ is N × (N+1) structure matrix, Φ=[φ (x₁),φ(x₂),...,φ(x_N),]^T, φ (x_n)= [1,K(x_n,x₁),K(x_n,x₂),...,K(x_n,x_N)]^T, B=diag (β₁,β₂,...,β_N), B is a diagonal matrix, β_n=σ { y (X_n)}[1-σ{y(X_n)}].Negative sign is taken to the right formula of above formula, then inverts, covariance matrix Σ can be obtained.Followed by, using Σ and wr_MP, hyper parameter α is updated：

wr_MP=Σ Φ^TBt

In an iterative process, most α_iMeeting is close to infinity, therefore their corresponding wr_iIt is just substantially equal to 0, Its basic function can be left out and reach sparsity.

Parameter optimization module design increases the diversity of population using a kind of improved colony intelligence optimizing algorithm, specifically sets It counts as follows：

1) species information of DE algorithm is initialized：

In population, it is randomly generated：

X in above formula_i(0) represent the expression value of i-th of individual chromosome gene primary, x_j,i(0) i-th primary is represented In body the expression value of j-th of chromosomal gene, rand (0,1) be uniform random number, NP in (0,1) section be Population Size, Subscript L, U respectively indicate floor value, upper dividing value.

2) mutation operation (Mutation)：DE algorithm distinguishes over the characteristics of genetic algorithm (GA) and is that it uses plan of checking the mark It slightly makes a variation, by the difference of two individuals of random selection, carries out vector sum with target individual after scaling, that is,

v_i(g+1)=x_r1(g)+F·(x_r2(g)-x_r3(g)),i≠r1≠r2≠r3

In above formula, g represents g generation, and F is the zoom factor of two random vector differences, v_iIt (g+1) is variation intermediate variable, x_r1(g)、x_r2(g)、x_r3(g) expression value of g the 1st, 2,3 r-th of chromosomal gene of individual of generation is respectively indicated.

3) crossover operation (Crossover)：By g for population x_i(g) the intermediate variable v generated with step 2)_i(g+1) into Row intersects, and generates

In above formula, CR is the crossing-over rate of setting, u_j,iIt (g+1) is to intersect intermediate variable.

4) selection operation (Selection)：Differential evolution algorithm carries out retaining the next generation using common greedy algorithm, such as Fruit intersects the population's fitness f (u generated_i(g+1) it is greater than the population's fitness f (x of previous generation_i(g)), then retain, otherwise population It is constant, i.e.,

In order to avoid there is precocious situation, an adaptive operator λ is designed：

G in above formula_maxMaximum number of iterations is represented, G represents current iteration number, F₀For mutation operator, initial stage the value compared with Greatly, guarantee sample diversity, the later period gradually becomes smaller, it is intended that protects the excellent information of evolutionary process.In differential evolution algorithm, such as Fruit fitness can not be more than that history is optimal after certain the number of iterations always, then it is assumed that fall into local optimum, utilize gunz at this time Energy algorithm jumps out differential evolution algorithm：

5) current location point information initializing is given to ant colony intelligence algorithm, wherein ant individual amount is：M, pheromones are dense Degree is：τ_ij=c (c>0).

6) all ants 1,2... are simulated, m is mobile to terminal, and each ant moves from current location i to next position j Dynamic probabilityFor：

7) when an iteration is completed, i.e., when all ants cover path, current information element concentration is updated：

ρ is pheromone concentration volatility coefficient in above formula,Represent the pheromone concentration that ant k leaves on the ij of path, root It is inversely proportional to the relationship of path length according to pheromone concentration, can be defined as follows：

In above formula, C is proportionality constant, and L is path length.

8) it after obtaining new candidate solution, is most preferably compared with history and more new historical is best.

9) iteration operation above procedure is until reach maximum algebra.Then most using history optimal parameter as parameter optimization Terminate fruit input model output module.

Model output module, the model obtained using above procedure, directly inputs patient data, can be obtained according to label value Result out.

Beneficial effects of the present invention are mainly manifested in：The present invention sets monitored parameters increase in intelligent searching process Population diversity, to increase the probability for searching optimized parameter, speed is fast, is suitble to on-line checking.

Detailed description of the invention

Fig. 1 is structural schematic diagram of the invention；

Fig. 2 is flow chart of the invention.

Specific embodiment

The present invention is illustrated below according to attached drawing.

Referring to Fig.1, a kind of leukaemia cancer cell detector of colony intelligence optimizing, the system read in module by gene microarray 1, data prediction and feature ordering module 2, parameter optimization module 3, model output module 4 form；Wherein：

That gene microarray reads in the reading of module 1 is the class label Y=[y of all gene microarrays₁,y₂,...,y_m], Wherein y_iThe gene microarray expression value of=k, k ∈ (- 1,1) and all samples：

Data prediction and feature ordering module 2 be the original microarray data of reading is normalized and feature row The treatment process of sequence.Wherein normalization operation is：

wr_MP=Σ Φ^TBt

Parameter optimization module 3 designs the diversity for increasing population using a kind of improved colony intelligence optimizing algorithm, specifically It designs as follows：

1) species information of DE algorithm is initialized：

In population, it is randomly generated：

v_i(g+1)=x_r1(g)+F·(x_r2(g)-x_r3(g)),i≠r1≠r2≠r3

In above formula, C is proportionality constant, and L is path length.

9) iteration operation above procedure is until reach maximum algebra.Then most using history optimal parameter as parameter optimization The fruit that terminates is defeated

Enter model output module.

The parameter that parameter optimization module 3 exports enters model output module 4, and the parameter as diagnostic equipment uses.Model is defeated Module analyzes the practical leukaemia sufferer gene microarray data of subsequent input out.

Above-described embodiment is used to illustrate the present invention, rather than limits the invention, in spirit of the invention and In scope of protection of the claims, to any modifications and changes that the present invention makes, protection scope of the present invention is both fallen within.

Claims

1. a kind of leukaemia cancer cell detector of colony intelligence optimizing, it is characterised in that：The system reads in mould by gene microarray Block, data prediction and feature ordering module, parameter optimization module, model output module composition.

2. the leukaemia cancer cell detector of colony intelligence optimizing according to claim 1, it is characterised in that：The micro- battle array of gene Column read in the class label Y=[y that module reads in all gene microarrays₁,y₂,...,y_m], wherein y_i=k, k ∈ (- 1,1), with And the gene microarray expression value of all samples：

Wherein every a line x_iRepresent the expression value of all genes of sample, corresponding each column x_jA gene is represented all Expression value in sample, subscript i indicate that i-th of sample, in total m, subscript j indicate j-th of gene, in total n.

3. the leukaemia cancer cell detector of colony intelligence optimizing according to claim 1, it is characterised in that：The data are located in advance It manages and the original microarray data of reading is normalized feature ordering module and the treatment process of feature ordering.Wherein return One, which changes operation, is：

Wherein, Min, Max are respectively the minimum value of sample gene expression values, maximum value.And feature ordering selection uses each base Because being realized to the contribution degree marking of classification accuracy, in support vector machines theory, by defining a contribution degree function：

Wherein, α=[α₁,...,α_n],H_ij=y_iy_jK(x_i,x_j), α is the corresponding coefficient of normal vector, H is intermediary matrix, and J is generation Valence function, I are unit matrix, K is kernel function, y is label value, x is sample characteristics, the transposition of subscript T representing matrix, subscript I, j respectively indicates i-th of sample and j-th of gene.In fact, the formula represents the square value of classification boundaries size, supporting vector The objective function of machine is to minimize it.In the case where using kernel function of the linear kernel as support vector machines, then have：

Wherein,w^*It is normal vector, α for optimal supporting vector, w^*The corresponding system of optimal supporting vector Number.Above formula is observed, it is available：The significance level of each feature according to this feature for this cost function contribution come It determines, i.e., the contribution margin of each feature is：Wherein, δ indicates contribution degree.

Wherein, α value is constant after some feature of reasonable assumption eliminates, and H (- i) indicates the H-matrix value after this feature cancellation.And When using the hypothesis, the result of obtained result and linear kernel is not much different.It can cycle calculations signature contributions using the formula Degree carries out gene importance ranking.

For one group of training sample set { x_n,t_n(n=1,2 ... N, x ∈ R^d, t ∈ { 0,1 }), wherein x indicates training sample, t Indicate sample moment point.Classification function y (the x of Method Using Relevance Vector Machine；Wr it) is defined as：

Wherein K (x, x_i) it is kernel function, wr_iFor weight, wr₀For initial weight.By logistic sigmoid contiguous function application To former formula, the likelihood probability p (t | wr) that can obtain the data set is estimated as：

Wherein, σ is variance, and in order to avoid over-fitting, Method Using Relevance Vector Machine is that each wr is constrained provided with Gaussian prior probability distribution Condition：

The approximation of posterior probability can be obtained according to Laplace theory：α fixed first, the correspondence maximum possible likelihood sought are general The weight w r of rate MP_MP, since p (wr | t, α) is proportional to p (t | wr) p (wr | α), wr is sought using second order Newton method_MP：

Wherein, y_n=σ { y (x_n；) }, wr A=diag (α₀,α₁,...,α_N).Then Laplace method is utilized, above formula is carried out Secondary derivation can obtain：

Wherein, Φ is N × (N+1) structure matrix, Φ=[φ (x₁),φ(x₂),...,φ(x_N),]^T, φ (x_n)=[1, K (x_n,x₁),K(x_n,x₂),...,K(x_n,x_N)]^T, B=diag (β₁,β₂,...,β_N), B is a diagonal matrix, β_n=σ { y (X_n)}[1-σ{y(X_n)}].Negative sign is taken to the right formula of above formula, then inverts, covariance matrix Σ can be obtained.Followed by, using Σ and wr_MP, hyper parameter α is updated：

wr_MP=Σ Φ^TBt

In an iterative process, most α_iMeeting is close to infinity, therefore their corresponding wr_iIt is just substantially equal to 0, it can be with Leave out its basic function and reaches sparsity.

4. the leukaemia cancer cell detector of colony intelligence optimizing according to claim 1, it is characterised in that：The parameter optimization Module increases the diversity of population using improved colony intelligence optimizing algorithm, specific as follows：

1) species information of DE algorithm is initialized：

In population, it is randomly generated：

X in above formula_i(0) represent the expression value of i-th of individual chromosome gene primary, x_j,i(0) it represents in i-th of individual primary The expression value of j-th of chromosomal gene, rand (0,1) are that uniform random number, the NP in (0,1) section are Population Size, subscript L, U respectively indicates floor value, upper dividing value.

2) mutation operation (Mutation)：DE algorithm distinguish over the characteristics of genetic algorithm (GA) be that its use check the mark strategy into Row variation carries out vector sum with target individual after scaling by the difference of two individuals of random selection, that is,

v_i(g+1)=x_r1(g)+F·(x_r2(g)-x_r3(g)),i≠r1≠r2≠r3

In above formula, g represents g generation, and F is the zoom factor of two random vector differences, v_iIt (g+1) is variation intermediate variable, x_r1 (g)、x_r2(g)、x_r3(g) expression value of g the 1st, 2,3 r-th of chromosomal gene of individual of generation is respectively indicated.

3) crossover operation (Crossover)：By g for population x_i(g) the intermediate variable v generated with step 2)_i(g+1) it is handed over Fork generates

4) selection operation (Selection)：Differential evolution algorithm carries out retaining the next generation using common greedy algorithm, if handed over Pitch the population's fitness f (u generated_i(g+1) it is greater than the population's fitness f (x of previous generation_i(g)), then retain, otherwise population is constant, I.e.

G in above formula_maxMaximum number of iterations is represented, G represents current iteration number, F₀For mutation operator, initial stage, the value was larger, guaranteed Sample diversity, later period gradually become smaller, it is intended that protect the excellent information of evolutionary process.In differential evolution algorithm, if adapted to Degree can not be more than that history is optimal after certain the number of iterations always, then it is assumed that fall into local optimum, utilize swarm intelligence algorithm at this time Jump out differential evolution algorithm：

5) current location point information initializing is given to ant colony intelligence algorithm, wherein ant individual amount is：M, pheromone concentration are： τ_ij=c (c>0).

6) all ants 1,2... are simulated, m is moved to terminal, what each ant was moved from current location i to next position j ProbabilityFor：

ρ is pheromone concentration volatility coefficient in above formula,The pheromone concentration that ant k leaves on the ij of path is represented, according to letter The relationship that plain concentration is inversely proportional to path length is ceased, can be defined as follows：

In above formula, C is proportionality constant, and L is path length.

9) iteration operation above procedure is until reach maximum algebra.Then using history optimal parameter as the most termination of parameter optimization Fruit input model output module.

5. the leukaemia cancer cell detector of colony intelligence optimizing according to claim 1, it is characterised in that：The model output The model that module is obtained using parameter optimization module, directly inputs patient data, can be obtained a result according to label value.