CN108763873A

CN108763873A - A kind of gene sorting method and relevant device

Info

Publication number: CN108763873A
Application number: CN201810522807.XA
Authority: CN
Inventors: 张莉; 黄晓娟; 王邦军; 张召; 李凡长
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2018-05-28
Filing date: 2018-05-28
Publication date: 2018-11-06

Abstract

The invention discloses a kind of gene sorting method and relevant device, method includes：The gene data as training sample of input is standardized, gene data includes several attributes；With the difference at the interval of training sample and foreign peoples's neighbour's sample and training sample and the interval of similar neighbour's sample, and the weight vectors of corresponding attribute establish the logistic regression majorized function for the expectation interval difference for indicating all training samples, and minimum Optimized model is established as norm constraint item using the weight vectors for corresponding to attribute, the weight vectors of corresponding attribute are calculated by interative computation；It is sorted to attributive character according to the value of obtained weight vectors, classification based training is carried out to the gene data as training sample according to the attributive character after sequence, the optimal characteristics collection as classification basis is obtained, to classify to gene data to be sorted.Gene sorting method and relevant device provided by the invention, can obtain higher nicety of grading.

Description

A kind of gene sorting method and relevant device

Technical field

The present invention relates to technical field of data processing, more particularly to a kind of gene sorting method and system.The present invention is also It is related to a kind of gene Clustering device and a kind of computer readable storage medium.

Background technology

In recent years, the research of Cancerous disease is closed by medical domain and the more and more researchers of field of biology Note.Wherein, DNA microarray technology can study Cancerous disease from gene angle, be currently employed research means it One.DNA microarray (DNA microarray) is also referred to as genetic chip, is that DNA microarray painting is distributed in one piece of special substrate Layer can determine the expression data of thousands of a genes, data basis is provided for disease research via primary test.

After obtaining a large amount of gene expression data, need therefrom to select the gene for carrying important diseases information, i.e., The information carried according to gene classifies to gene, is carried out by machine learning method in the prior art.However, due to The cancer gene expression data sample studied is seldom, and the gene dimension that each sample includes is thousands of, therefore this side Method needs a large amount of computing resource；Also, it is decisive to gene Clustering problem there was only portion gene in these genes Effect.These problems are resulted in has very big difficulty using traditional machine learning method, can not obtain satisfactory Classification results.

Invention content

In view of this, a kind of gene sorting method of present invention offer and relevant device, can obtain higher classification essence Degree.

To achieve the above object, the present invention provides the following technical solutions：

A kind of gene sorting method, including：

The gene data as training sample of input is standardized, the gene data includes several categories Property；

With the difference at the interval of training sample and foreign peoples's neighbour's sample and training sample and the interval of similar neighbour's sample, And the weight vectors of corresponding attribute establish the logistic regression majorized function for the expectation interval difference for indicating all training samples, and Weight vectors to correspond to attribute are established as norm constraint item minimizes Optimized model, and correspondence is calculated by interative computation The weight vectors of attribute；

It is sorted to attributive character according to the value of obtained weight vectors, according to the attributive character after sequence to as training The gene data of sample carries out classification based training, obtains the optimal characteristics collection as basis of classifying, with to gene data to be sorted into Row classification.

Optionally, include to the method for input being standardized as the gene data of training sample：To input The gene data as training sample carry out deviation standardization or standard deviation standardization.

Optionally, deviation standardization is carried out to the gene data as training sample of input, transfer function is such as Under：

Wherein, the training sample set of input is expressed asx_i∈R^I, y_i∈ 1,2 ..., and C } indicate sample x_i's Label, for showing sample x_iClassification, N indicate training sample total number, I indicate gene data dimension；x_ijIndicate the The value of i sample attribute j,Expression takes the maximum value of attribute j in all training samples,Expression takes the minimum value of attribute j in all training samples.

Optionally, the minimum Optimized model established indicates as follows：

Wherein, w^t+1Indicate to correspond to the weight vectors of attribute, z when the t+1 times iteration_i ^t+1=| x_i-α_iH_i ^NM|-|x_i-β_iH_i ^NH |, H_i ^NM∈R^I×kIndicate sample x_iNeighbour's sample matrix in foreign peoples's sample, H_i ^NH∈R^I×kIndicate sample x_iIn similar sample In neighbour's sample matrix, α_iIndicate foreign peoples's sample about sample x_iCoefficient vector, β_iIndicate similar sample about sample x_i Coefficient vector, k indicate priori setting neighbour's number, T indicate setting iterations；Wherein, the training sample set of input It is expressed asx_i∈R^I, y_i∈ 1,2 ..., and C } indicate sample x_iLabel, for showing sample x_iClassification, N tables Show that the total number of training sample, I indicate the dimension of gene data.

Optionally, α is obtained respectively by solving the following Optimized model that minimizes_iAnd β_i：

Wherein, w^tIndicate to correspond to the weight vectors of attribute when the t times iteration.

Optionally, the attributive character according to after sequence carries out classification based training to the gene data as training sample, It obtains and includes as the optimal characteristics collection of classification basis：

According to the attributive character after sequence, ten folding cross validations are carried out to the gene data as training sample, selection makes The best attribute set of classifying quality forms the optimal characteristics collection.

Optionally, further include：Gene data according to the optimal characteristics set pair as training sample carries out feature choosing It selects, regains training sample set；

The method classified to gene data to be sorted includes：

Gene data to be sorted is standardized；

According to the optimal characteristics set pair, treated that gene data to be sorted carries out feature selecting；

To the gene data to be sorted after feature selecting, is concentrated in the training sample regained and find its arest neighbors sample This, according to the classification of gene data to be sorted described in the class prediction of the nearest samples.

A kind of gene Clustering system, including：

Submodule is handled, is standardized for the gene data as training sample to input, the gene Data include several attributes；

Operation submodule, for the interval of training sample and foreign peoples's neighbour's sample and training sample and similar neighbour's sample The difference at this interval, and correspond to the logic of the expectation interval difference of all training samples of weight vectors foundation expression of attribute Regression optimization function, and minimum Optimized model is established as norm constraint item using the weight vectors for corresponding to attribute, it is transported by iteration Calculate the weight vectors that corresponding attribute is calculated；

Training submodule, for being sorted to attributive character according to the value of obtained weight vectors, according to the attribute after sequence Feature carries out classification based training to the gene data as training sample, the optimal characteristics collection as classification basis is obtained, to treat Classification gene data is classified.

A kind of gene Clustering device, including：

Memory, for storing computer program；

Processor, the step of gene sorting method as described above is realized when for executing the computer program.

A kind of computer readable storage medium is stored with computer program on the computer readable storage medium, described The step of gene sorting method as described above is realized when computer program is executed by processor.

As shown from the above technical solution, gene sorting method provided by the present invention trains sample to the conduct of input first This gene data is standardized, and then, the minimum for establishing the weight vectors for corresponding to attribute for gene data is excellent Change model, the weight vectors of corresponding attribute is calculated by interative computation, specially with training sample and foreign peoples's neighbour's sample Interval and training sample and the interval of similar neighbour's sample difference, and the weight vectors of corresponding attribute establish and indicate institute There is the logistic regression majorized function of the expectation interval difference of training sample, and to correspond to the weight vectors of attribute as norm constraint item It establishes and minimizes Optimized model；It is then sorted to attributive character according to the value of obtained weight vectors, according to the attribute after sequence Feature carries out classification based training to the gene data as training sample, the optimal characteristics collection as classification basis is obtained, to treat Classification gene data is classified.

Gene sorting method provided by the invention, based on interval principle is maximized, with training sample and foreign peoples's neighbour's sample Interval and training sample and the interval of similar neighbour's sample difference, and the weight vectors of corresponding attribute establish and indicate institute There is the logistic regression majorized function of the expectation interval difference of training sample, and to correspond to the weight vectors of attribute as norm constraint item It establishes and minimizes Optimized model, to solve the weight vectors of corresponding attribute, the weight vectors that solution obtains attribute is made to have more Sparsity so that higher nicety of grading can be obtained by carrying out gene Clustering.

A kind of gene Clustering system provided by the invention can reach above-mentioned advantageous effect.

A kind of gene Clustering device provided by the invention, can reach above-mentioned advantageous effect.

A kind of computer readable storage medium provided by the invention, can reach above-mentioned advantageous effect.

Description of the drawings

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will to embodiment or Attached drawing needed to be used in the description of the prior art is briefly described, it should be apparent that, the accompanying drawings in the following description is only Some embodiments of the present invention, for those of ordinary skill in the art, without creative efforts, also It can be obtain other attached drawings according to these attached drawings.

Fig. 1 is a kind of flow chart of gene sorting method provided in an embodiment of the present invention；

Classify to gene data to be sorted in a kind of Fig. 2 gene sorting methods provided in an embodiment of the present invention Method flow diagram；

Fig. 3 is a kind of schematic diagram of gene Clustering system provided in an embodiment of the present invention.

Specific implementation mode

In order to make those skilled in the art more fully understand the technical solution in the present invention, below in conjunction with the present invention Attached drawing in embodiment, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described reality It is only a part of the embodiment of the present invention to apply example, instead of all the embodiments.Based on the embodiments of the present invention, this field The every other embodiment that those of ordinary skill is obtained without making creative work, should all belong to this hair The range of bright protection.

Referring to FIG. 1, a kind of gene sorting method provided in an embodiment of the present invention, includes the following steps：

S10：The gene data as training sample of input is standardized, if the gene data includes Dry attribute.

The training sample set of the gene expression data of input is expressed asWherein, x_i∈R^l, y_i∈{1,2,…, C } indicate sample x_iLabel, for showing sample x_iClassification, N indicates that the total number of training sample, I indicate gene data Dimension indicates that gene data includes I attribute.Wherein each classification represents a kind of disease.

It refers to the unit limitation for removing attribute data to be standardized to gene data, is converted into dimensionless Pure values.

Feature unit due to gene data per dimension is different, by being concentrated to training sample in this step Gene data be standardized, remove each attribute data unit limitation, be converted into nondimensional pure values, In order to which the index of commensurate or magnitude can not be compared and weight.

In the specific implementation, include deviation mark to the method being standardized as the gene data of training sample Standardization processing or standard deviation standardization, but not limited to this, other standardization processing methods can also be used, are also all existed In the scope of the present invention.

In a kind of specific implementation mode of this method, using deviation standardization processing method to input in this step Gene data as training sample is standardized, and transfer function is as follows：

Wherein, x_ijIndicate the value of i-th of sample attribute j,Expression, which takes in all training samples, to be belonged to The maximum value of property j,Expression takes the minimum value of attribute j in all training samples.

S11：With the difference at the interval of training sample and foreign peoples's neighbour's sample and training sample and the interval of similar neighbour's sample Value, and the logistic regression majorized function of the expectation interval difference of all training samples of weight vectors foundation expression of attribute is corresponded to, And established as norm constraint item using to correspond to the weight vectors of attribute and minimize Optimized model, it is calculated pair by interative computation Answer the weight vectors of attribute.

In this step, the weight vectors of corresponding gene data attribute are calculated according to training sample, are specially directed to gene number It is established according to the weight vectors of corresponding attribute and minimizes Optimized model, the weight of corresponding attribute is calculated by solving model Vector.

Wherein, foreign peoples neighbour sample refer to training sample concentrate with current training sample be not belonging to it is same category of simultaneously And be neighbour's sample of current training sample, similar neighbour's sample refers to concentrating to belong to current training sample in training sample Neighbour's sample same category of and for current training sample.In this method, in the minimum Optimized model of foundation, with The difference of training sample and the interval and training sample and the interval of similar neighbour's sample of foreign peoples's neighbour's sample, and corresponding category Property weight vectors establish the logistic regression majorized function of the expectation interval difference for indicating all training samples.

In one embodiment, this step specifically includes following procedure：

S110：In t=0, w is initialized^t=[w₁,w₂,…,w_l]^T=[1/I, 1/I ..., 1/I]^T, w_jIndicate attribute j's Weight.Iterations are set as T.

S111：Established minimum Optimized model is solved, the minimum Optimized model established indicates as follows：

Wherein, w^t+1Indicate to correspond to the weight vectors of attribute, z when the t+1 times iteration_i ^t+1=| x_i-α_iH_i ^NM|-|x_i-β_iH_i ^NH |, H_i ^NM∈R^I×kIndicate sample x_iNeighbour's sample matrix in foreign peoples's sample, H_i ^NH∈R^I×kIndicate sample x_iIn similar sample In neighbour's sample matrix, α_iIndicate foreign peoples's sample about sample x_iCoefficient vector, β_iIndicate similar sample about sample x_i Coefficient vector, k indicate priori setting neighbour's number,

In the present embodiment method, based on interval principle is maximized, with the interval and instruction of training sample and foreign peoples's neighbour's sample Practice the difference of sample and the interval of similar neighbour's sample, and the weight vectors of corresponding attribute establish all training samples of expression Expectation interval difference logistic regression majorized function, and to correspond to the weight vectors of attribute as norm constraint item, present embodiment In specifically with correspond to attribute weight vectors be L1 norm constraint items, establish minimize Optimized model, to solve corresponding attribute Weight vectors.The weight vectors that this method obtains have more sparsity.

In present embodiment, specifically α can be obtained respectively by solving the following Optimized model that minimizes_iAnd β_i：

S112：If | | w^t+1-w^t| |≤θ then exports weight vectors, and enables w=w^t+1, otherwise enable t=t+1, return to step S111。

S12：It is sorted to attributive character according to the value of obtained weight vectors, according to the attributive character after sequence to conduct The gene data of training sample carries out classification based training, the optimal characteristics collection as classification basis is obtained, with to gene number to be sorted According to classifying.

In this step, according to the weight vectors that interative computation in previous step obtains, according to the value of weight vectors to attribute Feature ordering.

In the specific implementation, according to the attributive character after sequence, ten can be carried out to the gene data as training sample Cross validation is rolled over, the attribute set that selection keeps classifying quality best forms the optimal characteristics collection.KNN points can specifically be used Class device carries out classification based training to training sample.

Above in gene sorting method provided by the invention according to gene data training sample carry out model training mistake Journey is described, below to how the process classified to the gene data to be sorted of input is described.

Further include step S13 after obtaining optimal characteristics collection in the present embodiment gene sorting method：According to it is described most Excellent feature set carries out feature selecting to the gene data as training sample, regains training sample set.

Obtained optimal characteristics set representations areIt is trained using obtained optimal characteristics set pair in this step Gene data in sample set carries out feature selecting, regains training sample set, is expressed as

Referring to FIG. 2, carrying out classification assessment to gene data sample to be sorted, following procedure is specifically included：

S20：Gene data to be sorted is standardized.

The gene data to be sorted of input is expressed as x, x ∈ R^l。

In the specific implementation, the method being standardized to gene data to be sorted includes deviation standardization Or standard deviation standardization, but not limited to this, other standardization processing methods can also be used, are all protected in the present invention In range.

In a kind of specific implementation mode of the present embodiment, using deviation standardization processing method to the to be sorted of input Gene data is standardized, and transfer function is as follows：

Wherein, x_jIndicate the value of gene data attribute j to be sorted,Expression takes all training samples The maximum value of middle attribute j,Expression takes the minimum value of attribute j in all training samples.

S21：According to the optimal characteristics set pair, treated that gene data to be sorted carries out feature selecting.

According to obtained optimal characteristics collectionFeature selecting is carried out to gene data x to be sorted, obtains base Because of data x '.

S22：To the gene data to be sorted after feature selecting, is concentrated in the training sample regained and find it recently Adjacent sample, according to the classification of gene data to be sorted described in the class prediction of the nearest samples.

To the gene data x ' to be sorted after feature selecting, in training sample setMiddle its arest neighbors of searching, into One step predicts the classification of gene data to be sorted according to the classification of obtained nearest samples.

Gene sorting method of the present invention is described in detail with a specific example below.

The purpose of this specific example is to differentiate two different leukaemia, i.e. acute lymphoblastic leukemia (Acute Lymphoblastic Leukemia, ALL) and acute myeloid leukemia (Acute Myeloid Leukemia, AML).It provides Data set be divided into two subsets：38 training samples (27 ALL, 14 AML), for selecting gene and adjustment grader Weight；34 test samples (20 ALL, 14 AML) are used for the performance of evaluation system acquired results.Each sample There are 7129 features, corresponding normalized gene expression values are extracted from microarray images.ALL is considered as the 1st class, AML is regarded For the 2nd class.Specific implementation process is as follows：

Model training process includes the following steps：

(a) training sample set of the gene expression data inputted is expressed asWherein, x_i∈R^l, y_i∈{1, 2 ..., C } indicate sample x_iLabel, for showing sample x_iClassification, N indicate training sample total number, I indicate gene The dimension of data indicates that gene data includes I attribute.Wherein each classification represents a kind of disease, in this specific example Including two class of ALL and AML, N=38 and I=7129.

(b) deviation standardization is carried out to the gene data that training sample is concentrated, transfer function indicates as follows：

Wherein, x_ijIndicate the value of i-th of sample attribute j,Expression takes attribute in all training samples The maximum value of j,Expression takes the minimum value of attribute j in all training samples.

(c) the corresponding weight vectors of computation attribute.

(1) in t=0, initializationw_jIndicate attribute j's Weight.Iterations are set as T=38, and set allowable error θ=0.01.

(2) established minimum Optimized model is solved, the minimum Optimized model established indicates as follows：

Wherein, w^t+1Indicate to correspond to the weight vectors of attribute, z when the t+1 times iteration_i ^t+1=| x_i-α_iH_i ^NM|-|x_i-β_iH_i ^NH |, H_i ^NM∈R^I×kIndicate sample x_iNeighbour's sample matrix in foreign peoples's sample, H_i ^NH∈R^I×kIndicate sample x_iIn similar sample In neighbour's sample matrix, α_iIndicate foreign peoples's sample about sample x_iCoefficient vector, β_iIndicate similar sample about sample x_i Coefficient vector, k indicate priori setting neighbour's number.

In this specific example, k can be chosen by leave one cross validation in set { 2,4 ..., 10 }.

(3) if | | w^t+1-w^t| |≤θ then exports weight vectors, and enables w=w^t+1, otherwise enable t=t+1, return to step (2)。

(d) optimal characteristics collection is obtained.The value for the weight vectors being calculated according to above-mentioned steps arranges attributive character Sequence carries out ten folding cross validations using KNN graders according to the attributive character after sequence on training sample set, and selection can make Obtain the best attribute set composition optimal characteristics collection F of classifying quality.

(e) feature selecting is carried out to training sample set according to optimal characteristics collection F, obtains the training after carrying out feature selecting Sample set

Evaluation process includes the following steps：

(a) gene data x, x ∈ R to be sorted are inputted⁷¹²⁹。

(b) deviation standardization is carried out to gene data to be sorted, transfer function indicates as follows：

(c) feature selecting is carried out to gene data x to be sorted according to optimal characteristics collection F, obtained gene data is expressed as x′。

(d) to x ' in the training sample set regainedMiddle its nearest samples of searching, according to arest neighbors sample The classification of this class prediction gene data x to be sorted.

Classified by the test samples of the dimension of this method pair 34 7129, this method and traditional Relief algorithms, The results contrast that LH-Relief algorithms are tested on identical data set, as shown in table 1.

Table 1

	Discrimination (%)	Accurate rate (%)	Recall rate (%)	F-measure (%)
					This method	99.13	99.14	98.75	98.97
Relief algorithms	74.35	72.19	71.88	71.73
					LH-Relief algorithms	76.09	75.53	72.04	72.69

It can be seen from the above results compared with conventional method, using this method to gene data Classification and Identification rate more Height, nicety of grading higher.

Correspondingly, referring to FIG. 3, the embodiment of the present invention also provides a kind of gene Clustering system, including：

Submodule 30 is handled, is standardized for the gene data as training sample to input, the base Because data include several attributes.

Operation submodule 31, for the interval of training sample and foreign peoples's neighbour's sample and training sample and similar neighbour The difference at the interval of sample, and the weight vectors of corresponding attribute establish patrolling for the expectation interval difference for indicating all training samples Regression optimization function is collected, and minimum Optimized model is established as norm constraint item using the weight vectors for corresponding to attribute, passes through iteration The weight vectors of corresponding attribute are calculated in operation.

Training submodule 32, for being sorted to attributive character according to the value of obtained weight vectors, according to the category after sequence Property feature to as training sample gene data carry out classification based training, obtain as classify basis optimal characteristics collection, with right Gene data to be sorted is classified.

Gene Clustering system provided in this embodiment is based on maximizing interval principle it can be seen from the above, with instruction Practice the difference of sample and the interval and training sample and the interval of similar neighbour's sample of foreign peoples's neighbour's sample, and corresponding attribute Weight vectors establish the logistic regression majorized function of the expectation interval difference for indicating all training samples, and to correspond to attribute Weight vectors are norm constraint item foundation minimum Optimized model makes solution obtain to solve the weight vectors of corresponding attribute The weight vectors of attribute have more sparsity so that higher nicety of grading can be obtained by carrying out gene Clustering.

In the specific implementation, the processing method that each module carries out gene data in the present embodiment gene Clustering system is equal The detailed description in the above-mentioned embodiment about gene sorting method is can refer to, details are not described herein.

Correspondingly, the embodiment of the present invention also provides a kind of gene Clustering device, including：

Memory, for storing computer program；

Gene Clustering device provided in this embodiment, based on interval principle is maximized, with training sample and foreign peoples's neighbour's sample The difference at this interval and training sample and the interval of similar neighbour's sample, and the weight vectors foundation of corresponding attribute indicate The logistic regression majorized function of the expectation interval difference of all training samples, and to correspond to the weight vectors of attribute as norm constraint Item, which is established, minimizes Optimized model, to solve the weight vectors of corresponding attribute, the weight vectors that solution obtains attribute is made to have more There is sparsity so that higher nicety of grading can be obtained by carrying out gene Clustering.

Correspondingly, the embodiment of the present invention also provides a kind of computer readable storage medium, the computer-readable storage medium Computer program is stored in matter, the computer program realizes gene sorting method as described above when being executed by processor Step.

Computer readable storage medium provided in this embodiment, when the computer program stored thereon is executed by processor, It realizes based on interval principle is maximized, with the interval of training sample and foreign peoples's neighbour's sample and training sample and similar neighbour's sample The difference at this interval, and correspond to the logic of the expectation interval difference of all training samples of weight vectors foundation expression of attribute Regression optimization function, and established as norm constraint item using to correspond to the weight vectors of attribute and minimize Optimized model, to solve pair The weight vectors for answering attribute make the weight vectors that solution obtains attribute have more sparsity so that carrying out gene Clustering can obtain Obtain higher nicety of grading.

A kind of gene sorting method provided by the present invention and relevant device are described in detail above.Herein Applying specific case, principle and implementation of the present invention are described, and the explanation of above example is only intended to sides Assistant solves the method and its core concept of the present invention.It should be pointed out that for those skilled in the art, Without departing from the principles of the invention, can be with several improvements and modifications are made to the present invention, these improvement and modification It falls into the protection domain of the claims in the present invention.

Claims

1. a kind of gene sorting method, which is characterized in that including：

The gene data as training sample of input is standardized, the gene data includes several attributes；

With the difference at the interval of training sample and foreign peoples's neighbour's sample and training sample and the interval of similar neighbour's sample and right It answers the weight vectors of attribute to establish the logistic regression majorized function for the expectation interval difference for indicating all training samples, and is belonged to corresponding Property weight vectors be norm constraint item establish minimize Optimized model, the weight of corresponding attribute is calculated by interative computation Vector；

It is sorted to attributive character according to the value of obtained weight vectors, according to the attributive character after sequence to as training sample Gene data carries out classification based training, the optimal characteristics collection as classification basis is obtained, to classify to gene data to be sorted.

2. gene sorting method according to claim 1, which is characterized in that the gene number as training sample of input Include according to the method being standardized：To input as the gene data of training sample carry out deviation standardization or Person's standard deviation standardization.

3. gene sorting method according to claim 1, which is characterized in that the gene number as training sample of input According to deviation standardization is carried out, transfer function is as follows：

Wherein, the training sample set of input is expressed asx_i∈R^I, y_i∈ 1,2 ..., and C } indicate sample x_iLabel, For showing sample x_iClassification, N indicate training sample total number, I indicate gene data dimension；x_ijIndicate i-th of sample The value of this attribute j,

Expression takes the maximum value of attribute j in all training samples,

Expression takes the minimum value of attribute j in all training samples.

4. gene sorting method according to claim 1, which is characterized in that the minimum Optimized model established indicates such as Under：

Wherein, w^t+1Indicate to correspond to the weight vectors of attribute, z when the t+1 times iteration_i ^t+1=| x_i-α_iH_i ^NM|-|x_i-β_iH_i ^NH|, H_i ^NM ∈R^I×kIndicate sample x_iNeighbour's sample matrix in foreign peoples's sample, H_i ^NH∈R^I×kIndicate sample x_iIt is close in similar sample Adjacent sample matrix, α_iIndicate foreign peoples's sample about sample x_iCoefficient vector, β_iIndicate similar sample about sample x_iCoefficient to Amount, k indicate that neighbour's number of priori setting, T indicate the iterations of setting；Wherein, the training sample set of input is expressed asx_i∈R^I, y_i∈ 1,2 ..., and C } indicate sample x_iLabel, for showing sample x_iClassification, N indicate training sample This total number, I indicate the dimension of gene data.

5. gene sorting method according to claim 4, which is characterized in that divided by solving the following Optimized model that minimizes α is not obtained_iAnd β_i：

6. gene sorting method according to claim 1, which is characterized in that the attributive character according to after sequence is to making Classification based training is carried out for the gene data of training sample, acquisition includes as the optimal characteristics collection of classification basis：

According to the attributive character after sequence, to carrying out ten folding cross validations as the gene data of training sample, selection makes classification The best attribute set of effect forms the optimal characteristics collection.

7. gene sorting method according to claim 1, which is characterized in that further include：According to the optimal characteristics set pair Gene data as training sample carries out feature selecting, regains training sample set；

The method classified to gene data to be sorted includes：

Gene data to be sorted is standardized；

To the gene data to be sorted after feature selecting, is concentrated in the training sample regained and find its nearest samples, root The classification of gene data to be sorted described in class prediction according to the nearest samples.

8. a kind of gene Clustering system, which is characterized in that including：

Submodule is handled, is standardized for the gene data as training sample to input, the gene data Including several attributes；

Operation submodule, for between the interval and training sample and similar neighbour's sample of training sample and foreign peoples's neighbour's sample Every difference, and the weight vectors of corresponding attribute establish the logistic regression of the expectation interval difference for indicating all training samples and optimize Function, and minimum Optimized model is established as norm constraint item using the weight vectors for corresponding to attribute, it is calculated by interative computation To the weight vectors of corresponding attribute；

Training submodule, for being sorted to attributive character according to the value of obtained weight vectors, according to the attributive character after sequence Classification based training is carried out to the gene data as training sample, the optimal characteristics collection as classification basis is obtained, with to be sorted Gene data is classified.

9. a kind of gene Clustering device, which is characterized in that including：

Memory, for storing computer program；

Processor, realizing the gene sorting method as described in any one of claim 1 to 7 when for executing the computer program Step.

10. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium Program realizes the step of the gene sorting method as described in any one of claim 1 to 7 when the computer program is executed by processor Suddenly.