CN108763873A - A kind of gene sorting method and relevant device - Google Patents

A kind of gene sorting method and relevant device Download PDF

Info

Publication number
CN108763873A
CN108763873A CN201810522807.XA CN201810522807A CN108763873A CN 108763873 A CN108763873 A CN 108763873A CN 201810522807 A CN201810522807 A CN 201810522807A CN 108763873 A CN108763873 A CN 108763873A
Authority
CN
China
Prior art keywords
sample
gene
training
training sample
attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810522807.XA
Other languages
Chinese (zh)
Inventor
张莉
黄晓娟
王邦军
张召
李凡长
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN201810522807.XA priority Critical patent/CN108763873A/en
Publication of CN108763873A publication Critical patent/CN108763873A/en
Pending legal-status Critical Current

Links

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a kind of gene sorting method and relevant device, method includes:The gene data as training sample of input is standardized, gene data includes several attributes;With the difference at the interval of training sample and foreign peoples's neighbour's sample and training sample and the interval of similar neighbour's sample, and the weight vectors of corresponding attribute establish the logistic regression majorized function for the expectation interval difference for indicating all training samples, and minimum Optimized model is established as norm constraint item using the weight vectors for corresponding to attribute, the weight vectors of corresponding attribute are calculated by interative computation;It is sorted to attributive character according to the value of obtained weight vectors, classification based training is carried out to the gene data as training sample according to the attributive character after sequence, the optimal characteristics collection as classification basis is obtained, to classify to gene data to be sorted.Gene sorting method and relevant device provided by the invention, can obtain higher nicety of grading.

Description

A kind of gene sorting method and relevant device
Technical field
The present invention relates to technical field of data processing, more particularly to a kind of gene sorting method and system.The present invention is also It is related to a kind of gene Clustering device and a kind of computer readable storage medium.
Background technology
In recent years, the research of Cancerous disease is closed by medical domain and the more and more researchers of field of biology Note.Wherein, DNA microarray technology can study Cancerous disease from gene angle, be currently employed research means it One.DNA microarray (DNA microarray) is also referred to as genetic chip, is that DNA microarray painting is distributed in one piece of special substrate Layer can determine the expression data of thousands of a genes, data basis is provided for disease research via primary test.
After obtaining a large amount of gene expression data, need therefrom to select the gene for carrying important diseases information, i.e., The information carried according to gene classifies to gene, is carried out by machine learning method in the prior art.However, due to The cancer gene expression data sample studied is seldom, and the gene dimension that each sample includes is thousands of, therefore this side Method needs a large amount of computing resource;Also, it is decisive to gene Clustering problem there was only portion gene in these genes Effect.These problems are resulted in has very big difficulty using traditional machine learning method, can not obtain satisfactory Classification results.
Invention content
In view of this, a kind of gene sorting method of present invention offer and relevant device, can obtain higher classification essence Degree.
To achieve the above object, the present invention provides the following technical solutions:
A kind of gene sorting method, including:
The gene data as training sample of input is standardized, the gene data includes several categories Property;
With the difference at the interval of training sample and foreign peoples's neighbour's sample and training sample and the interval of similar neighbour's sample, And the weight vectors of corresponding attribute establish the logistic regression majorized function for the expectation interval difference for indicating all training samples, and Weight vectors to correspond to attribute are established as norm constraint item minimizes Optimized model, and correspondence is calculated by interative computation The weight vectors of attribute;
It is sorted to attributive character according to the value of obtained weight vectors, according to the attributive character after sequence to as training The gene data of sample carries out classification based training, obtains the optimal characteristics collection as basis of classifying, with to gene data to be sorted into Row classification.
Optionally, include to the method for input being standardized as the gene data of training sample:To input The gene data as training sample carry out deviation standardization or standard deviation standardization.
Optionally, deviation standardization is carried out to the gene data as training sample of input, transfer function is such as Under:
Wherein, the training sample set of input is expressed asxi∈RI, yi∈ 1,2 ..., and C } indicate sample xi's Label, for showing sample xiClassification, N indicate training sample total number, I indicate gene data dimension;xijIndicate the The value of i sample attribute j,Expression takes the maximum value of attribute j in all training samples,Expression takes the minimum value of attribute j in all training samples.
Optionally, the minimum Optimized model established indicates as follows:
Wherein, wt+1Indicate to correspond to the weight vectors of attribute, z when the t+1 times iterationi t+1=| xiiHi NM|-|xiiHi NH |, Hi NM∈RI×kIndicate sample xiNeighbour's sample matrix in foreign peoples's sample, Hi NH∈RI×kIndicate sample xiIn similar sample In neighbour's sample matrix, αiIndicate foreign peoples's sample about sample xiCoefficient vector, βiIndicate similar sample about sample xi Coefficient vector, k indicate priori setting neighbour's number, T indicate setting iterations;Wherein, the training sample set of input It is expressed asxi∈RI, yi∈ 1,2 ..., and C } indicate sample xiLabel, for showing sample xiClassification, N tables Show that the total number of training sample, I indicate the dimension of gene data.
Optionally, α is obtained respectively by solving the following Optimized model that minimizesiAnd βi
Wherein, wtIndicate to correspond to the weight vectors of attribute when the t times iteration.
Optionally, the attributive character according to after sequence carries out classification based training to the gene data as training sample, It obtains and includes as the optimal characteristics collection of classification basis:
According to the attributive character after sequence, ten folding cross validations are carried out to the gene data as training sample, selection makes The best attribute set of classifying quality forms the optimal characteristics collection.
Optionally, further include:Gene data according to the optimal characteristics set pair as training sample carries out feature choosing It selects, regains training sample set;
The method classified to gene data to be sorted includes:
Gene data to be sorted is standardized;
According to the optimal characteristics set pair, treated that gene data to be sorted carries out feature selecting;
To the gene data to be sorted after feature selecting, is concentrated in the training sample regained and find its arest neighbors sample This, according to the classification of gene data to be sorted described in the class prediction of the nearest samples.
A kind of gene Clustering system, including:
Submodule is handled, is standardized for the gene data as training sample to input, the gene Data include several attributes;
Operation submodule, for the interval of training sample and foreign peoples's neighbour's sample and training sample and similar neighbour's sample The difference at this interval, and correspond to the logic of the expectation interval difference of all training samples of weight vectors foundation expression of attribute Regression optimization function, and minimum Optimized model is established as norm constraint item using the weight vectors for corresponding to attribute, it is transported by iteration Calculate the weight vectors that corresponding attribute is calculated;
Training submodule, for being sorted to attributive character according to the value of obtained weight vectors, according to the attribute after sequence Feature carries out classification based training to the gene data as training sample, the optimal characteristics collection as classification basis is obtained, to treat Classification gene data is classified.
A kind of gene Clustering device, including:
Memory, for storing computer program;
Processor, the step of gene sorting method as described above is realized when for executing the computer program.
A kind of computer readable storage medium is stored with computer program on the computer readable storage medium, described The step of gene sorting method as described above is realized when computer program is executed by processor.
As shown from the above technical solution, gene sorting method provided by the present invention trains sample to the conduct of input first This gene data is standardized, and then, the minimum for establishing the weight vectors for corresponding to attribute for gene data is excellent Change model, the weight vectors of corresponding attribute is calculated by interative computation, specially with training sample and foreign peoples's neighbour's sample Interval and training sample and the interval of similar neighbour's sample difference, and the weight vectors of corresponding attribute establish and indicate institute There is the logistic regression majorized function of the expectation interval difference of training sample, and to correspond to the weight vectors of attribute as norm constraint item It establishes and minimizes Optimized model;It is then sorted to attributive character according to the value of obtained weight vectors, according to the attribute after sequence Feature carries out classification based training to the gene data as training sample, the optimal characteristics collection as classification basis is obtained, to treat Classification gene data is classified.
Gene sorting method provided by the invention, based on interval principle is maximized, with training sample and foreign peoples's neighbour's sample Interval and training sample and the interval of similar neighbour's sample difference, and the weight vectors of corresponding attribute establish and indicate institute There is the logistic regression majorized function of the expectation interval difference of training sample, and to correspond to the weight vectors of attribute as norm constraint item It establishes and minimizes Optimized model, to solve the weight vectors of corresponding attribute, the weight vectors that solution obtains attribute is made to have more Sparsity so that higher nicety of grading can be obtained by carrying out gene Clustering.
A kind of gene Clustering system provided by the invention can reach above-mentioned advantageous effect.
A kind of gene Clustering device provided by the invention, can reach above-mentioned advantageous effect.
A kind of computer readable storage medium provided by the invention, can reach above-mentioned advantageous effect.
Description of the drawings
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will to embodiment or Attached drawing needed to be used in the description of the prior art is briefly described, it should be apparent that, the accompanying drawings in the following description is only Some embodiments of the present invention, for those of ordinary skill in the art, without creative efforts, also It can be obtain other attached drawings according to these attached drawings.
Fig. 1 is a kind of flow chart of gene sorting method provided in an embodiment of the present invention;
Classify to gene data to be sorted in a kind of Fig. 2 gene sorting methods provided in an embodiment of the present invention Method flow diagram;
Fig. 3 is a kind of schematic diagram of gene Clustering system provided in an embodiment of the present invention.
Specific implementation mode
In order to make those skilled in the art more fully understand the technical solution in the present invention, below in conjunction with the present invention Attached drawing in embodiment, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described reality It is only a part of the embodiment of the present invention to apply example, instead of all the embodiments.Based on the embodiments of the present invention, this field The every other embodiment that those of ordinary skill is obtained without making creative work, should all belong to this hair The range of bright protection.
Referring to FIG. 1, a kind of gene sorting method provided in an embodiment of the present invention, includes the following steps:
S10:The gene data as training sample of input is standardized, if the gene data includes Dry attribute.
The training sample set of the gene expression data of input is expressed asWherein, xi∈Rl, yi∈{1,2,…, C } indicate sample xiLabel, for showing sample xiClassification, N indicates that the total number of training sample, I indicate gene data Dimension indicates that gene data includes I attribute.Wherein each classification represents a kind of disease.
It refers to the unit limitation for removing attribute data to be standardized to gene data, is converted into dimensionless Pure values.
Feature unit due to gene data per dimension is different, by being concentrated to training sample in this step Gene data be standardized, remove each attribute data unit limitation, be converted into nondimensional pure values, In order to which the index of commensurate or magnitude can not be compared and weight.
In the specific implementation, include deviation mark to the method being standardized as the gene data of training sample Standardization processing or standard deviation standardization, but not limited to this, other standardization processing methods can also be used, are also all existed In the scope of the present invention.
In a kind of specific implementation mode of this method, using deviation standardization processing method to input in this step Gene data as training sample is standardized, and transfer function is as follows:
Wherein, xijIndicate the value of i-th of sample attribute j,Expression, which takes in all training samples, to be belonged to The maximum value of property j,Expression takes the minimum value of attribute j in all training samples.
S11:With the difference at the interval of training sample and foreign peoples's neighbour's sample and training sample and the interval of similar neighbour's sample Value, and the logistic regression majorized function of the expectation interval difference of all training samples of weight vectors foundation expression of attribute is corresponded to, And established as norm constraint item using to correspond to the weight vectors of attribute and minimize Optimized model, it is calculated pair by interative computation Answer the weight vectors of attribute.
In this step, the weight vectors of corresponding gene data attribute are calculated according to training sample, are specially directed to gene number It is established according to the weight vectors of corresponding attribute and minimizes Optimized model, the weight of corresponding attribute is calculated by solving model Vector.
Wherein, foreign peoples neighbour sample refer to training sample concentrate with current training sample be not belonging to it is same category of simultaneously And be neighbour's sample of current training sample, similar neighbour's sample refers to concentrating to belong to current training sample in training sample Neighbour's sample same category of and for current training sample.In this method, in the minimum Optimized model of foundation, with The difference of training sample and the interval and training sample and the interval of similar neighbour's sample of foreign peoples's neighbour's sample, and corresponding category Property weight vectors establish the logistic regression majorized function of the expectation interval difference for indicating all training samples.
In one embodiment, this step specifically includes following procedure:
S110:In t=0, w is initializedt=[w1,w2,…,wl]T=[1/I, 1/I ..., 1/I]T, wjIndicate attribute j's Weight.Iterations are set as T.
S111:Established minimum Optimized model is solved, the minimum Optimized model established indicates as follows:
Wherein, wt+1Indicate to correspond to the weight vectors of attribute, z when the t+1 times iterationi t+1=| xiiHi NM|-|xiiHi NH |, Hi NM∈RI×kIndicate sample xiNeighbour's sample matrix in foreign peoples's sample, Hi NH∈RI×kIndicate sample xiIn similar sample In neighbour's sample matrix, αiIndicate foreign peoples's sample about sample xiCoefficient vector, βiIndicate similar sample about sample xi Coefficient vector, k indicate priori setting neighbour's number,
In the present embodiment method, based on interval principle is maximized, with the interval and instruction of training sample and foreign peoples's neighbour's sample Practice the difference of sample and the interval of similar neighbour's sample, and the weight vectors of corresponding attribute establish all training samples of expression Expectation interval difference logistic regression majorized function, and to correspond to the weight vectors of attribute as norm constraint item, present embodiment In specifically with correspond to attribute weight vectors be L1 norm constraint items, establish minimize Optimized model, to solve corresponding attribute Weight vectors.The weight vectors that this method obtains have more sparsity.
In present embodiment, specifically α can be obtained respectively by solving the following Optimized model that minimizesiAnd βi
Wherein, wtIndicate to correspond to the weight vectors of attribute when the t times iteration.
S112:If | | wt+1-wt| |≤θ then exports weight vectors, and enables w=wt+1, otherwise enable t=t+1, return to step S111。
S12:It is sorted to attributive character according to the value of obtained weight vectors, according to the attributive character after sequence to conduct The gene data of training sample carries out classification based training, the optimal characteristics collection as classification basis is obtained, with to gene number to be sorted According to classifying.
In this step, according to the weight vectors that interative computation in previous step obtains, according to the value of weight vectors to attribute Feature ordering.
In the specific implementation, according to the attributive character after sequence, ten can be carried out to the gene data as training sample Cross validation is rolled over, the attribute set that selection keeps classifying quality best forms the optimal characteristics collection.KNN points can specifically be used Class device carries out classification based training to training sample.
Above in gene sorting method provided by the invention according to gene data training sample carry out model training mistake Journey is described, below to how the process classified to the gene data to be sorted of input is described.
Further include step S13 after obtaining optimal characteristics collection in the present embodiment gene sorting method:According to it is described most Excellent feature set carries out feature selecting to the gene data as training sample, regains training sample set.
Obtained optimal characteristics set representations areIt is trained using obtained optimal characteristics set pair in this step Gene data in sample set carries out feature selecting, regains training sample set, is expressed as
Referring to FIG. 2, carrying out classification assessment to gene data sample to be sorted, following procedure is specifically included:
S20:Gene data to be sorted is standardized.
The gene data to be sorted of input is expressed as x, x ∈ Rl
In the specific implementation, the method being standardized to gene data to be sorted includes deviation standardization Or standard deviation standardization, but not limited to this, other standardization processing methods can also be used, are all protected in the present invention In range.
In a kind of specific implementation mode of the present embodiment, using deviation standardization processing method to the to be sorted of input Gene data is standardized, and transfer function is as follows:
Wherein, xjIndicate the value of gene data attribute j to be sorted,Expression takes all training samples The maximum value of middle attribute j,Expression takes the minimum value of attribute j in all training samples.
S21:According to the optimal characteristics set pair, treated that gene data to be sorted carries out feature selecting.
According to obtained optimal characteristics collectionFeature selecting is carried out to gene data x to be sorted, obtains base Because of data x '.
S22:To the gene data to be sorted after feature selecting, is concentrated in the training sample regained and find it recently Adjacent sample, according to the classification of gene data to be sorted described in the class prediction of the nearest samples.
To the gene data x ' to be sorted after feature selecting, in training sample setMiddle its arest neighbors of searching, into One step predicts the classification of gene data to be sorted according to the classification of obtained nearest samples.
Gene sorting method of the present invention is described in detail with a specific example below.
The purpose of this specific example is to differentiate two different leukaemia, i.e. acute lymphoblastic leukemia (Acute Lymphoblastic Leukemia, ALL) and acute myeloid leukemia (Acute Myeloid Leukemia, AML).It provides Data set be divided into two subsets:38 training samples (27 ALL, 14 AML), for selecting gene and adjustment grader Weight;34 test samples (20 ALL, 14 AML) are used for the performance of evaluation system acquired results.Each sample There are 7129 features, corresponding normalized gene expression values are extracted from microarray images.ALL is considered as the 1st class, AML is regarded For the 2nd class.Specific implementation process is as follows:
Model training process includes the following steps:
(a) training sample set of the gene expression data inputted is expressed asWherein, xi∈Rl, yi∈{1, 2 ..., C } indicate sample xiLabel, for showing sample xiClassification, N indicate training sample total number, I indicate gene The dimension of data indicates that gene data includes I attribute.Wherein each classification represents a kind of disease, in this specific example Including two class of ALL and AML, N=38 and I=7129.
(b) deviation standardization is carried out to the gene data that training sample is concentrated, transfer function indicates as follows:
Wherein, xijIndicate the value of i-th of sample attribute j,Expression takes attribute in all training samples The maximum value of j,Expression takes the minimum value of attribute j in all training samples.
(c) the corresponding weight vectors of computation attribute.
(1) in t=0, initializationwjIndicate attribute j's Weight.Iterations are set as T=38, and set allowable error θ=0.01.
(2) established minimum Optimized model is solved, the minimum Optimized model established indicates as follows:
Wherein, wt+1Indicate to correspond to the weight vectors of attribute, z when the t+1 times iterationi t+1=| xiiHi NM|-|xiiHi NH |, Hi NM∈RI×kIndicate sample xiNeighbour's sample matrix in foreign peoples's sample, Hi NH∈RI×kIndicate sample xiIn similar sample In neighbour's sample matrix, αiIndicate foreign peoples's sample about sample xiCoefficient vector, βiIndicate similar sample about sample xi Coefficient vector, k indicate priori setting neighbour's number.
In this specific example, k can be chosen by leave one cross validation in set { 2,4 ..., 10 }.
(3) if | | wt+1-wt| |≤θ then exports weight vectors, and enables w=wt+1, otherwise enable t=t+1, return to step (2)。
(d) optimal characteristics collection is obtained.The value for the weight vectors being calculated according to above-mentioned steps arranges attributive character Sequence carries out ten folding cross validations using KNN graders according to the attributive character after sequence on training sample set, and selection can make Obtain the best attribute set composition optimal characteristics collection F of classifying quality.
(e) feature selecting is carried out to training sample set according to optimal characteristics collection F, obtains the training after carrying out feature selecting Sample set
Evaluation process includes the following steps:
(a) gene data x, x ∈ R to be sorted are inputted7129
(b) deviation standardization is carried out to gene data to be sorted, transfer function indicates as follows:
(c) feature selecting is carried out to gene data x to be sorted according to optimal characteristics collection F, obtained gene data is expressed as x′。
(d) to x ' in the training sample set regainedMiddle its nearest samples of searching, according to arest neighbors sample The classification of this class prediction gene data x to be sorted.
Classified by the test samples of the dimension of this method pair 34 7129, this method and traditional Relief algorithms, The results contrast that LH-Relief algorithms are tested on identical data set, as shown in table 1.
Table 1
Discrimination (%) Accurate rate (%) Recall rate (%) F-measure (%)
This method 99.13 99.14 98.75 98.97
Relief algorithms 74.35 72.19 71.88 71.73
LH-Relief algorithms 76.09 75.53 72.04 72.69
It can be seen from the above results compared with conventional method, using this method to gene data Classification and Identification rate more Height, nicety of grading higher.
Correspondingly, referring to FIG. 3, the embodiment of the present invention also provides a kind of gene Clustering system, including:
Submodule 30 is handled, is standardized for the gene data as training sample to input, the base Because data include several attributes.
Operation submodule 31, for the interval of training sample and foreign peoples's neighbour's sample and training sample and similar neighbour The difference at the interval of sample, and the weight vectors of corresponding attribute establish patrolling for the expectation interval difference for indicating all training samples Regression optimization function is collected, and minimum Optimized model is established as norm constraint item using the weight vectors for corresponding to attribute, passes through iteration The weight vectors of corresponding attribute are calculated in operation.
Training submodule 32, for being sorted to attributive character according to the value of obtained weight vectors, according to the category after sequence Property feature to as training sample gene data carry out classification based training, obtain as classify basis optimal characteristics collection, with right Gene data to be sorted is classified.
Gene Clustering system provided in this embodiment is based on maximizing interval principle it can be seen from the above, with instruction Practice the difference of sample and the interval and training sample and the interval of similar neighbour's sample of foreign peoples's neighbour's sample, and corresponding attribute Weight vectors establish the logistic regression majorized function of the expectation interval difference for indicating all training samples, and to correspond to attribute Weight vectors are norm constraint item foundation minimum Optimized model makes solution obtain to solve the weight vectors of corresponding attribute The weight vectors of attribute have more sparsity so that higher nicety of grading can be obtained by carrying out gene Clustering.
In the specific implementation, the processing method that each module carries out gene data in the present embodiment gene Clustering system is equal The detailed description in the above-mentioned embodiment about gene sorting method is can refer to, details are not described herein.
Correspondingly, the embodiment of the present invention also provides a kind of gene Clustering device, including:
Memory, for storing computer program;
Processor, the step of gene sorting method as described above is realized when for executing the computer program.
Gene Clustering device provided in this embodiment, based on interval principle is maximized, with training sample and foreign peoples's neighbour's sample The difference at this interval and training sample and the interval of similar neighbour's sample, and the weight vectors foundation of corresponding attribute indicate The logistic regression majorized function of the expectation interval difference of all training samples, and to correspond to the weight vectors of attribute as norm constraint Item, which is established, minimizes Optimized model, to solve the weight vectors of corresponding attribute, the weight vectors that solution obtains attribute is made to have more There is sparsity so that higher nicety of grading can be obtained by carrying out gene Clustering.
Correspondingly, the embodiment of the present invention also provides a kind of computer readable storage medium, the computer-readable storage medium Computer program is stored in matter, the computer program realizes gene sorting method as described above when being executed by processor Step.
Computer readable storage medium provided in this embodiment, when the computer program stored thereon is executed by processor, It realizes based on interval principle is maximized, with the interval of training sample and foreign peoples's neighbour's sample and training sample and similar neighbour's sample The difference at this interval, and correspond to the logic of the expectation interval difference of all training samples of weight vectors foundation expression of attribute Regression optimization function, and established as norm constraint item using to correspond to the weight vectors of attribute and minimize Optimized model, to solve pair The weight vectors for answering attribute make the weight vectors that solution obtains attribute have more sparsity so that carrying out gene Clustering can obtain Obtain higher nicety of grading.
A kind of gene sorting method provided by the present invention and relevant device are described in detail above.Herein Applying specific case, principle and implementation of the present invention are described, and the explanation of above example is only intended to sides Assistant solves the method and its core concept of the present invention.It should be pointed out that for those skilled in the art, Without departing from the principles of the invention, can be with several improvements and modifications are made to the present invention, these improvement and modification It falls into the protection domain of the claims in the present invention.

Claims (10)

1. a kind of gene sorting method, which is characterized in that including:
The gene data as training sample of input is standardized, the gene data includes several attributes;
With the difference at the interval of training sample and foreign peoples's neighbour's sample and training sample and the interval of similar neighbour's sample and right It answers the weight vectors of attribute to establish the logistic regression majorized function for the expectation interval difference for indicating all training samples, and is belonged to corresponding Property weight vectors be norm constraint item establish minimize Optimized model, the weight of corresponding attribute is calculated by interative computation Vector;
It is sorted to attributive character according to the value of obtained weight vectors, according to the attributive character after sequence to as training sample Gene data carries out classification based training, the optimal characteristics collection as classification basis is obtained, to classify to gene data to be sorted.
2. gene sorting method according to claim 1, which is characterized in that the gene number as training sample of input Include according to the method being standardized:To input as the gene data of training sample carry out deviation standardization or Person's standard deviation standardization.
3. gene sorting method according to claim 1, which is characterized in that the gene number as training sample of input According to deviation standardization is carried out, transfer function is as follows:
Wherein, the training sample set of input is expressed asxi∈RI, yi∈ 1,2 ..., and C } indicate sample xiLabel, For showing sample xiClassification, N indicate training sample total number, I indicate gene data dimension;xijIndicate i-th of sample The value of this attribute j,
Expression takes the maximum value of attribute j in all training samples,
Expression takes the minimum value of attribute j in all training samples.
4. gene sorting method according to claim 1, which is characterized in that the minimum Optimized model established indicates such as Under:
Wherein, wt+1Indicate to correspond to the weight vectors of attribute, z when the t+1 times iterationi t+1=| xiiHi NM|-|xiiHi NH|, Hi NM ∈RI×kIndicate sample xiNeighbour's sample matrix in foreign peoples's sample, Hi NH∈RI×kIndicate sample xiIt is close in similar sample Adjacent sample matrix, αiIndicate foreign peoples's sample about sample xiCoefficient vector, βiIndicate similar sample about sample xiCoefficient to Amount, k indicate that neighbour's number of priori setting, T indicate the iterations of setting;Wherein, the training sample set of input is expressed asxi∈RI, yi∈ 1,2 ..., and C } indicate sample xiLabel, for showing sample xiClassification, N indicate training sample This total number, I indicate the dimension of gene data.
5. gene sorting method according to claim 4, which is characterized in that divided by solving the following Optimized model that minimizes α is not obtainediAnd βi
Wherein, wtIndicate to correspond to the weight vectors of attribute when the t times iteration.
6. gene sorting method according to claim 1, which is characterized in that the attributive character according to after sequence is to making Classification based training is carried out for the gene data of training sample, acquisition includes as the optimal characteristics collection of classification basis:
According to the attributive character after sequence, to carrying out ten folding cross validations as the gene data of training sample, selection makes classification The best attribute set of effect forms the optimal characteristics collection.
7. gene sorting method according to claim 1, which is characterized in that further include:According to the optimal characteristics set pair Gene data as training sample carries out feature selecting, regains training sample set;
The method classified to gene data to be sorted includes:
Gene data to be sorted is standardized;
According to the optimal characteristics set pair, treated that gene data to be sorted carries out feature selecting;
To the gene data to be sorted after feature selecting, is concentrated in the training sample regained and find its nearest samples, root The classification of gene data to be sorted described in class prediction according to the nearest samples.
8. a kind of gene Clustering system, which is characterized in that including:
Submodule is handled, is standardized for the gene data as training sample to input, the gene data Including several attributes;
Operation submodule, for between the interval and training sample and similar neighbour's sample of training sample and foreign peoples's neighbour's sample Every difference, and the weight vectors of corresponding attribute establish the logistic regression of the expectation interval difference for indicating all training samples and optimize Function, and minimum Optimized model is established as norm constraint item using the weight vectors for corresponding to attribute, it is calculated by interative computation To the weight vectors of corresponding attribute;
Training submodule, for being sorted to attributive character according to the value of obtained weight vectors, according to the attributive character after sequence Classification based training is carried out to the gene data as training sample, the optimal characteristics collection as classification basis is obtained, with to be sorted Gene data is classified.
9. a kind of gene Clustering device, which is characterized in that including:
Memory, for storing computer program;
Processor, realizing the gene sorting method as described in any one of claim 1 to 7 when for executing the computer program Step.
10. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium Program realizes the step of the gene sorting method as described in any one of claim 1 to 7 when the computer program is executed by processor Suddenly.
CN201810522807.XA 2018-05-28 2018-05-28 A kind of gene sorting method and relevant device Pending CN108763873A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810522807.XA CN108763873A (en) 2018-05-28 2018-05-28 A kind of gene sorting method and relevant device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810522807.XA CN108763873A (en) 2018-05-28 2018-05-28 A kind of gene sorting method and relevant device

Publications (1)

Publication Number Publication Date
CN108763873A true CN108763873A (en) 2018-11-06

Family

ID=64002900

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810522807.XA Pending CN108763873A (en) 2018-05-28 2018-05-28 A kind of gene sorting method and relevant device

Country Status (1)

Country Link
CN (1) CN108763873A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109670552A (en) * 2018-12-24 2019-04-23 苏州大学 A kind of image classification method, device, equipment and readable storage medium storing program for executing
CN113971604A (en) * 2020-07-22 2022-01-25 中移(苏州)软件技术有限公司 Data processing method, device and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009067655A2 (en) * 2007-11-21 2009-05-28 University Of Florida Research Foundation, Inc. Methods of feature selection through local learning; breast and prostate cancer prognostic markers
CN101923604A (en) * 2010-07-23 2010-12-22 福建师范大学 Classification method for weighted KNN oncogene expression profiles based on neighborhood rough set
CN104598774A (en) * 2015-02-04 2015-05-06 河南师范大学 Feature gene selection method based on logistic and relevant information entropy
CN105938523A (en) * 2016-03-31 2016-09-14 陕西师范大学 Feature selection method and application based on feature identification degree and independence
CN106991296A (en) * 2017-04-01 2017-07-28 大连理工大学 Ensemble classifier method based on the greedy feature selecting of randomization
CN107193993A (en) * 2017-06-06 2017-09-22 苏州大学 The medical data sorting technique and device selected based on local learning characteristic weight
CN107563435A (en) * 2017-08-30 2018-01-09 哈尔滨工业大学深圳研究生院 Higher-dimension unbalanced data sorting technique based on SVM

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009067655A2 (en) * 2007-11-21 2009-05-28 University Of Florida Research Foundation, Inc. Methods of feature selection through local learning; breast and prostate cancer prognostic markers
CN101923604A (en) * 2010-07-23 2010-12-22 福建师范大学 Classification method for weighted KNN oncogene expression profiles based on neighborhood rough set
CN104598774A (en) * 2015-02-04 2015-05-06 河南师范大学 Feature gene selection method based on logistic and relevant information entropy
CN105938523A (en) * 2016-03-31 2016-09-14 陕西师范大学 Feature selection method and application based on feature identification degree and independence
CN106991296A (en) * 2017-04-01 2017-07-28 大连理工大学 Ensemble classifier method based on the greedy feature selecting of randomization
CN107193993A (en) * 2017-06-06 2017-09-22 苏州大学 The medical data sorting technique and device selected based on local learning characteristic weight
CN107563435A (en) * 2017-08-30 2018-01-09 哈尔滨工业大学深圳研究生院 Higher-dimension unbalanced data sorting technique based on SVM

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HONGMIN CAI等: "Feature weight estimation for gene selection: a local hyperlinear learning approach", 《BMC BIOINFORMATICS》 *
YIJUN SUN等: "A Local-Learning-Based Feature Selection for High-Dimensional Data Analysis", 《IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE》 *
潘巍等: "基于间隔损失和L1范数调节的特征选择方法研究", 《智能计算机与应用》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109670552A (en) * 2018-12-24 2019-04-23 苏州大学 A kind of image classification method, device, equipment and readable storage medium storing program for executing
CN113971604A (en) * 2020-07-22 2022-01-25 中移(苏州)软件技术有限公司 Data processing method, device and storage medium

Similar Documents

Publication Publication Date Title
Brancati et al. A deep learning approach for breast invasive ductal carcinoma detection and lymphoma multi-classification in histological images
Derrac et al. Fuzzy nearest neighbor algorithms: Taxonomy, experimental analysis and prospects
US7088854B2 (en) Method and apparatus for generating special-purpose image analysis algorithms
CN114730463A (en) Multi-instance learner for tissue image classification
Argyriou et al. An algorithm for transfer learning in a heterogeneous environment
US20180165413A1 (en) Gene expression data classification method and classification system
Javed et al. Multiplex cellular communities in multi-gigapixel colorectal cancer histology images for tissue phenotyping
CN106991430A (en) A kind of cluster number based on point of proximity method automatically determines Spectral Clustering
CN108629373A (en) A kind of image classification method, system, equipment and computer readable storage medium
CN112800927B (en) Butterfly image fine-granularity identification method based on AM-Softmax loss
Intrator Making a low-dimensional representation suitable for diverse tasks
Qin et al. Spot detection and image segmentation in DNA microarray data
Kumar et al. An amalgam method efficient for finding of cancer gene using CSC from micro array data
CN108763873A (en) A kind of gene sorting method and relevant device
Menaka et al. Chromenet: A CNN architecture with comparison of optimizers for classification of human chromosome images
Yang et al. High throughput analysis of breast cancer specimens on the grid
Vengatesan et al. The performance analysis of microarray data using occurrence clustering
Krishnapuram et al. Joint classifier and feature optimization for cancer diagnosis using gene expression data
Arora Classification of human metaspread images using convolutional neural networks
Weber et al. Perron cluster analysis and its connection to graph partitioning for noisy data
Rathore et al. CBISC: a novel approach for colon biopsy image segmentation and classification
Yao et al. Augdmc: Data augmentation guided deep multiple clustering
CN116612307A (en) Solanaceae disease grade identification method based on transfer learning
CN113177602B (en) Image classification method, device, electronic equipment and storage medium
CN113838519B (en) Gene selection method and system based on adaptive gene interaction regularization elastic network model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20181106

RJ01 Rejection of invention patent application after publication