CN108664763A

CN108664763A - A kind of lung cancer carcinoma cell detection instrument that parameter is optimal

Info

Publication number: CN108664763A
Application number: CN201810458000.4A
Authority: CN
Inventors: 刘兴高; 高信腾; 孙元萌
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2018-05-14
Filing date: 2018-05-14
Publication date: 2018-10-16

Abstract

The invention discloses a kind of lung cancer carcinoma cell detection instrument that parameter is optimal, which reads in module, data prediction and feature ordering module, parameter optimization module, model output module by gene microarray and forms.System first pre-processes the gene microarray data of input, then carries out importance ranking to remaining each gene, calculates correlation by counting score, recycles grader criterion function to calculate contribution degree, all gene importance are ranked up.Improved optimization method added under original intelligent optimizing algorithm fitness detection with population disturb, can prevent population diversity scatter and disappear and optimization process be absorbed in local optimum.Then the optimized parameter searched is completed into model construction as classifier parameters and exports result.System speed is fast, is suitble to on-line checking.

Description

Optimal parameter lung cancer cell detector

Technical Field

The invention relates to the technical field of gene microarray data application, in particular to a lung cancer cell detector with optimal parameters.

Background

The 21 st century is a century of life science, data on a DNA Microarray (also called gene Microarray) has great research value potential, and in the aspect of basic medicine, the Microarray can be used for rapidly detecting the expression value of a huge amount of genes, comparing the expression difference of different typical samples, and carrying out discovery research, gene detection and the like of disease pathogenic genes. Clinically, the early stage of the tumor shows the change of the gene expression map in the cell, and the microarray data research can achieve the effect of early finding and treating so as to guide the clinical practice; lung cancer is one of the most rapidly growing malignancies that threaten human health and life. In many countries, the incidence and mortality of lung cancer have been reported to be significantly higher in recent 50 years, with lung cancer incidence and mortality in men accounting for the first of all malignancies, in women accounting for the second, and mortality accounting for the second. The etiology of lung cancer is not completely clear to date, and according to the us estimate in 1985, 80% of lung cancer in men and 79% of lung cancer in women are due to smoking. Nicotine, benzopyrene, nitrosamine, and a small amount of radioactive element polonium in smoke are carcinogenic, and are especially prone to squamous cell carcinoma and small cell carcinoma. The risk factors of lung tumor are many, and the medical field supposes that lung tumor is a multifactorial disease. Hypoimmunity, endocrine dyscrasia, mood depression and family inheritance may all cause lung tumors to occur. Research has shown that the incidence and mortality of lung cancer has increased year by year in recent years. How to find the pathogenic gene of lung cancer is a significant work. Gene microarray data typically have high dimensional and few sample features, usually the number of genes observed in each sample is thousands or even tens of thousands, and only tens of samples are observed in one experiment. In the pattern recognition problem, the dimension is too large, which leads to dimension disaster, on one hand, the algorithm time is exponentially increased along with the increase of the dimension, and the traditional informatics algorithm related to the probability density estimation cannot be carried out if the number of samples is too small. Finding a feasible method, reducing the feature dimension, selecting the optimal feature gene subset, and making the data separable most under the feature subspace is the most urgent requirement. And under the condition of selecting the optimal feature subset, how to select the parameters of the classifier avoids the inefficiency and randomness caused by manual parameter adjustment, and is also a great research hotspot at present.

Disclosure of Invention

In order to overcome the defects that the optimal characteristic subset of gene microarray data and the optimal parameters for classification are difficult to search at present, the invention aims to provide a lung cancer cell detector with optimal parameters.

The technical scheme adopted by the invention for solving the technical problems is as follows: the lung cancer cell detector with optimal parameters consists of a gene microarray reading module, a data preprocessing and feature sorting module, a parameter optimizing module and a model output module; wherein:

the gene microarray reading module reads in the category labels Y ═ Y of all the gene microarrays₁,y₂,...,y_m]Wherein y is_iK, k ∈ (-1,1), and gene microarray expression values for all samples:

wherein each row x_iRepresenting the expression values of all genes in a sample, corresponding to each column x_jRepresents the expression value of one gene in all samples, the index i represents the ith sample, and m in total, and the index j represents the jth gene, and n in total.

The data preprocessing and feature sorting module is used for performing normalization and feature sorting processing on read-in original microarray data. Wherein the normalization operation is:

wherein Min and Max are respectively the minimum value and the maximum value of the gene expression value of the sample. And feature ordering selection is achieved by scoring the contribution of each gene to classification accuracy by defining a contribution function:

wherein α ═ α₁,...,α_n],H_ij＝y_iy_jK(x_i,x_j) in fact, the formula represents the square value of the classification boundary size, and then the formula has the following components:

definition ofw is the normal vector of the classification absolute plane, w^*corresponding for optimal normal vector, alpha being normal vectorcoefficient, alpha^*And the optimal normal vector corresponds to the coefficient. Observing the above formula, one can obtain: the importance of each feature is determined according to the contribution of the feature to the cost function, that is, the contribution value of each feature is:where δ represents the degree of contribution.

When using a non-linear kernel as the kernel function, the following approximation can be generally calculated:

in this case, it is reasonable to assume that the α value is unchanged after a certain feature is eliminated, and H (-i) represents the H matrix value after the feature is eliminatedWherein x_iRepresenting an n × 1 input feature vector, t_iRepresenting an m x 1 target vector. Given an activation function g (x) and the number of nodes in the hidden layerThen the ELM gene detection system is:

wherein, ω is_iRepresenting the weight vector between the i-th hidden layer node and the input layer, b_irepresents the bias of the i-th hidden layer node, β_iRepresenting the weight vector between the i-th hidden layer node and the output layer, o_jRepresenting the target output for the jth input data. In addition, ω_i·x_jRepresents omega_iAnd x_jThe inner product of (d).

The output of the network can be infinitely close to the N samples of the input, i.e.:

the following can be obtained:

the above formula can be expressed in matrix form, H β ═ T

Wherein H represents the output matrix of the hidden layer, the ith column of H represents the ith node of the hidden layer corresponding to N input x₁,x₂,…,x_Nthe input weights of the single hidden layer feedforward neural network (SLFNs) and the bias of the hidden layer need not be adjusted during the network training process and can be arbitrarily givenObtaining:

the solution of the equation can be quickly solved by using a linear method, as shown in the formula:

wherein,a Moore-Penrose generalized inverse matrix representing H,represents the minimum norm least squares solution, which is exactly the solution with the minimum norm in the least squares solution. Compared with a plurality of existing gene detection systems, the extreme learning machine can achieve a good training effect at a very high speed through the solution of the Moore-Penrose generalized inverse.

The parameter optimization module design uses an improved parameter optimization algorithm to increase the diversity of the population, and the specific design is as follows:

1) initializing the population information of the DE algorithm:

in the population, random generation:

in the above formula x_i(0) Represents the expression value of the i-th individual chromosomal gene of the first generation, x_j,i(0) The expression value of the jth chromosomal gene in the ith individual of the initial generation, rand (0,1) is a uniform random number in the interval (0,1), NP is the population size, and superscript L, U represents the lower and upper bound values, respectively.

2) Mutation operation (Mutation): the DE algorithm is distinguished from the Genetic Algorithm (GA) in that it is carried out using a scoring strategy

Variation, by randomly choosing the difference between two individuals, scaled and vector-summed with the target individual, i.e.

v_i(g+1)＝x_r1(g)+F·(x_r2(g)-x_r3(g)),i≠r1≠r2≠r3

In the above formula, g represents the g-th generation, F is the scaling factor of two random vector differences, v_i(g +1) is a variant intermediate variable, x_r1(g)、x_r2(g)、x_r3(g) The expression values of the r-th chromosomal gene of the 1 st, 2 nd and 3 rd individuals of the g-th generation are shown, respectively.

3) Crossover operation (Crossover): the g generation population x_i(g) And the intermediate variable v generated in step 2)_i(g +1) are crossed to give

In the above formula, CR is the set crossover rate, u_j,i(g +1) is a crossover intermediate variable.

4) Selection operation (Selection): the differential evolution algorithm uses a common greedy algorithm to reserve the next generation if the population fitness f (u) is generated in a crossed manner_i(g +1) is greater than the population fitness f (x) of the previous generation_i(g) Otherwise, the population is unchanged, i.e.

In order to avoid the premature situation, an adaptive operator lambda is designed:

in the above formula G_maxRepresenting the maximum number of iterations, G representing the current number of iterations, F₀The value is a mutation operator, the value is larger at the initial stage, the sample diversity is ensured, and the value is gradually reduced at the later stageProtecting the good information of the evolution process. In the differential evolution algorithm, if the fitness can not exceed the historical optimum all the time after a certain number of iterations, the algorithm is considered to be involved in the local optimum, and at the moment, the group intelligent algorithm is utilized to jump out of the differential evolution algorithm:

5) initializing the current position point information to an ant colony intelligent algorithm, wherein the number of the ant individuals is as follows: m, pheromone concentration: tau is_ij＝c(c>0)。

6) Simulating the probability that all ants 1,2, m move to the end point and each ant moves from the current position i to the next position jComprises the following steps:

7) when one iteration is finished, namely when all ants finish the path, updating the current pheromone concentration:

in the above formula, rho is the pheromone concentration volatilization coefficient,representing the concentration of pheromones left by ant k on path ij, which can be defined as follows according to the relationship that pheromone concentration is inversely proportional to path length:

in the above formula, C is a proportional constant, and L is a path length.

8) After a new candidate solution is obtained, the historical best is compared with the historical best and updated.

9) The above process is iteratively run until a maximum algebra is reached. And then inputting the historical optimal parameters as final results of parameter optimization into a model output module.

And the model output module directly inputs the patient data by using the model obtained in the process, and a result can be obtained according to the label value.

The invention has the following beneficial effects: in the intelligent optimization process, the invention sets monitoring variables to increase the diversity of the population, thereby increasing the probability of searching the optimal parameters, having high system speed and being suitable for online detection.

Drawings

FIG. 1 is a schematic structural view of the present invention;

fig. 2 is a flow chart of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings.

Referring to fig. 1, a lung cancer cell detector with optimal parameters, which comprises a gene microarray reading module 1, a data preprocessing and feature sorting module 2, a parameter optimizing module 3 and a model output module 4; wherein:

the gene microarray reading module 1 reads in the category labels Y ═ Y of all gene microarrays₁,y₂,...,y_m]Wherein y is_iK, k ∈ (-1,1), and gene microarray expression values for all samples:

wherein each row x_iRepresenting a sample officeWith the expression value of the gene, corresponding to each column x_jRepresents the expression value of one gene in all samples, the index i represents the ith sample, and m in total, and the index j represents the jth gene, and n in total.

The data preprocessing and feature sorting module 2 is a processing process for normalizing and feature sorting the read-in original microarray data. Wherein the normalization operation is:

definition ofw is the normal vector of the classification absolute plane, w^*is the optimal normal vector, α is the coefficient corresponding to the normal vector, α^*And the optimal normal vector corresponds to the coefficient. Observe the above formulaIt is possible to obtain: the importance of each feature is determined according to the contribution of the feature to the cost function, that is, the contribution value of each feature is:where δ represents the degree of contribution.

the following can be obtained:

the above formula can be expressed in matrix form, H β ═ T

The parameter optimizing module 3 is designed to use an improved parameter optimizing algorithm to increase the diversity of the population, and is specifically designed as follows:

1) initializing the population information of the DE algorithm:

in the population, random generation:

2) Mutation operation (Mutation): the DE algorithm is distinguished from the Genetic Algorithm (GA) in that it employs a scoring strategy for mutation, by randomly selecting the difference between two individuals, scaling and then summing the scaled differences with the target individual, i.e. the DE algorithm is characterized by the fact that

v_i(g+1)＝x_r1(g)+F·(x_r2(g)-x_r3(g)),i≠r1≠r2≠r3

in the above formula G_maxRepresenting the maximum number of iterations, G representing the current number of iterations, F₀The method is a mutation operator, the value is large at the initial stage, the diversity of the sample is ensured, and the value is gradually reduced at the later stage, so that the excellent information in the evolution process is protected. In the differential evolution algorithm, if the fitness can not exceed the experience all the time after a certain number of iterationsAnd (3) the history is optimal, namely the history is considered to be trapped into local optimal, and at the moment, a group intelligent algorithm is utilized to jump out of a differential evolution algorithm:

in the above formula, C is a proportional constant, and L is a path length.

9) The above process is iteratively run until a maximum algebra is reached. The historical optimum parameters are then input to the model output module 4 as the final result of the parameter optimization.

The parameters output by the parameter optimizing module 3 enter the model output module 4 to be used as the parameters of the model. And the model output module 4 analyzes and analyzes the subsequently input actual lung cancer patient gene microarray data.

The above-described embodiments are intended to illustrate rather than to limit the invention, and any modifications and variations of the present invention are within the spirit of the invention and the scope of the appended claims.

Claims

1. An optimal parameter lung cancer cell detector is characterized in that: the system consists of a gene microarray reading module, a data preprocessing and feature sorting module, a parameter optimizing module and a model output module.

2. The parameter optimized lung cancer cell detector of claim 1, wherein: the gene microarray reading module reads in all the class labels Y ═ Y of the gene microarrays₁,y₂,...,y_m]Wherein y is_iK, k ∈ (-1,1), and gene microarray expression values for all samples:

3. The parameter optimized lung cancer cell detector of claim 1, wherein: the data preprocessing and feature sorting module is used for normalizing and feature sorting the original microarray data read in by the gene microarray reading module. Wherein the normalization operation is:

definition ofw is the normal vector of the classification absolute plane, w^*is the optimal normal vector, α is the coefficient corresponding to the normal vector, α^*And the optimal normal vector corresponds to the coefficient. Observing the above formula, one can obtain: the importance of each feature is determined according to the contribution of the feature to the cost function, that is, the contribution value of each feature is:where δ represents the degree of contribution.

in which, it is reasonable to assume that the alpha value is unchanged after a certain feature is eliminated, H (-i) represents the H matrix value after the feature is eliminated, and when this assumption is used, the obtained result is not much different from the result of the linear kernelWherein x_iRepresenting an n × 1 input feature vector, t_iRepresenting an m x 1 target vector. Given an activation function g (x) and the number of nodes in the hidden layerThen the ELM gene detection system is:

wherein, ω is_iRepresenting the ith hidden layer sectionWeight vector between point and input layer, b_irepresents the bias of the i-th hidden layer node, β_iRepresenting the weight vector between the i-th hidden layer node and the output layer, o_jRepresenting the target output for the jth input data. In addition, ω_i·x_jRepresents omega_iAnd x_jThe inner product of (d).

the following can be obtained:

the above formula can be expressed in matrix form, H β ═ T

4. The parameter optimized lung cancer cell detector of claim 1, wherein: the parameter optimizing module increases the diversity of the population by using an improved parameter optimizing algorithm, which is as follows:

1) initializing the population information of the DE algorithm:

in the population, random generation:

v_i(g+1)＝x_r1(g)+F·(x_r2(g)-x_r3(g)),i≠r1≠r2≠r3

in the above formula G_maxRepresenting the maximum number of iterations, G representing the current number of iterations, F₀The method is a mutation operator, the value is large at the initial stage, the diversity of the sample is ensured, and the value is gradually reduced at the later stage, so that the excellent information in the evolution process is protected. In the differential evolution algorithm, if the fitness can not exceed the historical optimum all the time after a certain iteration number, the fitness is considered to be involved in the local optimum, and at the moment, the crowd intelligence is utilizedThe algorithm can jump out of the differential evolution algorithm:

in the above formula, C is a proportional constant, and L is a path length.

5. The parameter optimized lung cancer cell detector of claim 1, wherein: the model output module directly inputs patient data by using the model obtained by the parameter optimizing module, and a result can be obtained according to the label value.