AU2018100796A4 - A genetic feature identifying system and a search method for identifying features of genetic information - Google Patents
A genetic feature identifying system and a search method for identifying features of genetic information Download PDFInfo
- Publication number
- AU2018100796A4 AU2018100796A4 AU2018100796A AU2018100796A AU2018100796A4 AU 2018100796 A4 AU2018100796 A4 AU 2018100796A4 AU 2018100796 A AU2018100796 A AU 2018100796A AU 2018100796 A AU2018100796 A AU 2018100796A AU 2018100796 A4 AU2018100796 A4 AU 2018100796A4
- Authority
- AU
- Australia
- Prior art keywords
- search
- genetic
- regularization
- feature
- accordance
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
- 238000000034 method Methods 0.000 title claims abstract description 222
- 230000002068 genetic effect Effects 0.000 title claims abstract description 76
- 230000008569 process Effects 0.000 claims abstract description 146
- 238000005457 optimization Methods 0.000 claims abstract description 43
- 238000010801 machine learning Methods 0.000 claims abstract description 22
- 238000012545 processing Methods 0.000 claims abstract description 6
- 230000010354 integration Effects 0.000 claims description 5
- 238000013459 approach Methods 0.000 description 57
- 108090000623 proteins and genes Proteins 0.000 description 32
- 210000000349 chromosome Anatomy 0.000 description 21
- 230000006870 function Effects 0.000 description 13
- 230000014509 gene expression Effects 0.000 description 11
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 10
- 230000035772 mutation Effects 0.000 description 9
- 238000011156 evaluation Methods 0.000 description 8
- 238000010187 selection method Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 7
- 230000002596 correlated effect Effects 0.000 description 7
- 101000684730 Homo sapiens Secreted frizzled-related protein 5 Proteins 0.000 description 6
- 102100023744 Secreted frizzled-related protein 5 Human genes 0.000 description 6
- 238000002493 microarray Methods 0.000 description 6
- 238000012360 testing method Methods 0.000 description 6
- 206010028980 Neoplasm Diseases 0.000 description 5
- 230000000875 corresponding effect Effects 0.000 description 5
- 206010012818 diffuse large B-cell lymphoma Diseases 0.000 description 5
- 201000010099 disease Diseases 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 230000003993 interaction Effects 0.000 description 5
- 238000002790 cross-validation Methods 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 4
- 210000002307 prostate Anatomy 0.000 description 4
- 102000004169 proteins and genes Human genes 0.000 description 4
- 230000035945 sensitivity Effects 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 102100037674 Bis(5'-adenosyl)-triphosphatase Human genes 0.000 description 3
- 102000009512 Cyclin-Dependent Kinase Inhibitor p15 Human genes 0.000 description 3
- 108010009356 Cyclin-Dependent Kinase Inhibitor p15 Proteins 0.000 description 3
- 108020004414 DNA Proteins 0.000 description 3
- 102100036534 Glutathione S-transferase Mu 1 Human genes 0.000 description 3
- 101001071694 Homo sapiens Glutathione S-transferase Mu 1 Proteins 0.000 description 3
- 101000864786 Homo sapiens Secreted frizzled-related protein 2 Proteins 0.000 description 3
- 206010058467 Lung neoplasm malignant Diseases 0.000 description 3
- 206010025323 Lymphomas Diseases 0.000 description 3
- 102100030054 Secreted frizzled-related protein 2 Human genes 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 3
- 108010005713 bis(5'-adenosyl)triphosphatase Proteins 0.000 description 3
- 238000010276 construction Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 201000003444 follicular lymphoma Diseases 0.000 description 3
- 201000005202 lung cancer Diseases 0.000 description 3
- 208000020816 lung neoplasm Diseases 0.000 description 3
- 239000000523 sample Substances 0.000 description 3
- 238000004088 simulation Methods 0.000 description 3
- 210000001519 tissue Anatomy 0.000 description 3
- 102100024154 Cadherin-13 Human genes 0.000 description 2
- 108010009392 Cyclin-Dependent Kinase Inhibitor p16 Proteins 0.000 description 2
- 102100024458 Cyclin-dependent kinase inhibitor 2A Human genes 0.000 description 2
- 101000762243 Homo sapiens Cadherin-13 Proteins 0.000 description 2
- 101001134456 Homo sapiens Pancreatic triacylglycerol lipase Proteins 0.000 description 2
- 101001090065 Homo sapiens Peroxiredoxin-2 Proteins 0.000 description 2
- 101001064864 Homo sapiens Polyunsaturated fatty acid lipoxygenase ALOX12 Proteins 0.000 description 2
- 101000932478 Homo sapiens Receptor-type tyrosine-protein kinase FLT3 Proteins 0.000 description 2
- 101000864743 Homo sapiens Secreted frizzled-related protein 1 Proteins 0.000 description 2
- 101001028730 Homo sapiens Transcription factor JunB Proteins 0.000 description 2
- 101000617285 Homo sapiens Tyrosine-protein phosphatase non-receptor type 6 Proteins 0.000 description 2
- 102100033359 Pancreatic triacylglycerol lipase Human genes 0.000 description 2
- 102100034763 Peroxiredoxin-2 Human genes 0.000 description 2
- 102100031949 Polyunsaturated fatty acid lipoxygenase ALOX12 Human genes 0.000 description 2
- 102100030058 Secreted frizzled-related protein 1 Human genes 0.000 description 2
- 102100034803 Small nuclear ribonucleoprotein-associated protein N Human genes 0.000 description 2
- 102100037168 Transcription factor JunB Human genes 0.000 description 2
- 102100021657 Tyrosine-protein phosphatase non-receptor type 6 Human genes 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000003066 decision tree Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000007477 logistic regression Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 108010039827 snRNP Core Proteins Proteins 0.000 description 2
- 238000003860 storage Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- JCJIZBQZPSZIBI-UHFFFAOYSA-N 2-[2,6-di(propan-2-yl)phenyl]benzo[de]isoquinoline-1,3-dione Chemical compound CC(C)C1=CC=CC(C(C)C)=C1N(C1=O)C(=O)C2=C3C1=CC=CC3=CC=C2 JCJIZBQZPSZIBI-UHFFFAOYSA-N 0.000 description 1
- 102100036614 ABC-type organic anion transporter ABCA8 Human genes 0.000 description 1
- 208000031261 Acute myeloid leukaemia Diseases 0.000 description 1
- 208000010507 Adenocarcinoma of Lung Diseases 0.000 description 1
- 102100034808 CCAAT/enhancer-binding protein alpha Human genes 0.000 description 1
- 102100032230 Caveolae-associated protein 1 Human genes 0.000 description 1
- 108010001857 Cell Surface Receptors Proteins 0.000 description 1
- 102100034951 Coiled-coil domain-containing protein 69 Human genes 0.000 description 1
- 108010079362 Core Binding Factor Alpha 3 Subunit Proteins 0.000 description 1
- 108010009540 DNA (Cytosine-5-)-Methyltransferase 1 Proteins 0.000 description 1
- 102100036279 DNA (cytosine-5)-methyltransferase 1 Human genes 0.000 description 1
- 241000255581 Drosophila <fruit fly, genus> Species 0.000 description 1
- 102100023685 G protein-coupled receptor kinase 5 Human genes 0.000 description 1
- 102100021090 Homeobox protein Hox-A9 Human genes 0.000 description 1
- 101000929669 Homo sapiens ABC-type organic anion transporter ABCA8 Proteins 0.000 description 1
- 101000945515 Homo sapiens CCAAT/enhancer-binding protein alpha Proteins 0.000 description 1
- 101000869049 Homo sapiens Caveolae-associated protein 1 Proteins 0.000 description 1
- 101000946601 Homo sapiens Coiled-coil domain-containing protein 69 Proteins 0.000 description 1
- 101000829476 Homo sapiens G protein-coupled receptor kinase 5 Proteins 0.000 description 1
- 101001109719 Homo sapiens Nucleophosmin Proteins 0.000 description 1
- 101000988407 Homo sapiens PDZ and LIM domain protein 2 Proteins 0.000 description 1
- 101000804792 Homo sapiens Protein Wnt-5a Proteins 0.000 description 1
- 101000626163 Homo sapiens Tenascin-X Proteins 0.000 description 1
- 102100035838 Lactosylceramide 4-alpha-galactosyltransferase Human genes 0.000 description 1
- 101150029107 MEIS1 gene Proteins 0.000 description 1
- 108700041619 Myeloid Ecotropic Viral Integration Site 1 Proteins 0.000 description 1
- 102000047831 Myeloid Ecotropic Viral Integration Site 1 Human genes 0.000 description 1
- 102100029176 PDZ and LIM domain protein 2 Human genes 0.000 description 1
- 206010061902 Pancreatic neoplasm Diseases 0.000 description 1
- 206010060862 Prostate cancer Diseases 0.000 description 1
- 208000000236 Prostatic Neoplasms Diseases 0.000 description 1
- 238000003559 RNA-seq method Methods 0.000 description 1
- 102100020718 Receptor-type tyrosine-protein kinase FLT3 Human genes 0.000 description 1
- 102100025369 Runt-related transcription factor 3 Human genes 0.000 description 1
- 102100024549 Tenascin-X Human genes 0.000 description 1
- 102000004887 Transforming Growth Factor beta Human genes 0.000 description 1
- 108090001012 Transforming Growth Factor beta Proteins 0.000 description 1
- 108010091356 Tumor Protein p73 Proteins 0.000 description 1
- 102100030018 Tumor protein p73 Human genes 0.000 description 1
- 108010070808 UDP-galactose-lactosylceramide alpha 1-4-galactosyltransferase Proteins 0.000 description 1
- 102000005789 Vascular Endothelial Growth Factors Human genes 0.000 description 1
- 108010019530 Vascular Endothelial Growth Factors Proteins 0.000 description 1
- 102000043366 Wnt-5a Human genes 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000008236 biological pathway Effects 0.000 description 1
- 238000009395 breeding Methods 0.000 description 1
- 230000001488 breeding effect Effects 0.000 description 1
- 210000003855 cell nucleus Anatomy 0.000 description 1
- 230000002301 combined effect Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 238000010195 expression analysis Methods 0.000 description 1
- 238000012248 genetic selection Methods 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 108010027263 homeobox protein HOXA9 Proteins 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000007786 learning performance Effects 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 210000004072 lung Anatomy 0.000 description 1
- 230000036210 malignancy Effects 0.000 description 1
- 208000015486 malignant pancreatic neoplasm Diseases 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 102000006240 membrane receptors Human genes 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000004899 motility Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 201000002528 pancreatic cancer Diseases 0.000 description 1
- 208000008443 pancreatic carcinoma Diseases 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000019491 signal transduction Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- ZRKFYGHZFMAOKI-QMGMOQQFSA-N tgfbeta Chemical group C([C@H](NC(=O)[C@H](C(C)C)NC(=O)CNC(=O)[C@H](CCC(O)=O)NC(=O)[C@H](CCCNC(N)=N)NC(=O)[C@H](CC(N)=O)NC(=O)[C@H](CC(C)C)NC(=O)[C@H]([C@@H](C)O)NC(=O)[C@H](CCC(O)=O)NC(=O)[C@H]([C@@H](C)O)NC(=O)[C@H](CC(C)C)NC(=O)CNC(=O)[C@H](C)NC(=O)[C@H](CO)NC(=O)[C@H](CCC(N)=O)NC(=O)[C@@H](NC(=O)[C@H](C)NC(=O)[C@H](C)NC(=O)[C@@H](NC(=O)[C@H](CC(C)C)NC(=O)[C@@H](N)CCSC)C(C)C)[C@@H](C)CC)C(=O)N[C@@H]([C@@H](C)O)C(=O)N[C@@H](C(C)C)C(=O)N[C@@H](CC=1C=CC=CC=1)C(=O)N[C@@H](C)C(=O)N1[C@@H](CCC1)C(=O)N[C@@H]([C@@H](C)O)C(=O)N[C@@H](CC(N)=O)C(=O)N[C@@H](CCC(O)=O)C(=O)N[C@@H](C)C(=O)N[C@@H](CC=1C=CC=CC=1)C(=O)N[C@@H](CCCNC(N)=N)C(=O)N[C@@H](C)C(=O)N[C@@H](CC(C)C)C(=O)N1[C@@H](CCC1)C(=O)N1[C@@H](CCC1)C(=O)N[C@@H](CCCNC(N)=N)C(=O)N[C@@H](CCC(O)=O)C(=O)N[C@@H](CCCNC(N)=N)C(=O)N[C@@H](CO)C(=O)N[C@@H](CCCNC(N)=N)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CC(C)C)C(O)=O)C1=CC=C(O)C=C1 ZRKFYGHZFMAOKI-QMGMOQQFSA-N 0.000 description 1
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
A genetic feature identifying system and a search method for identifying features of genetic information. The 5 method comprises the step of processing the genetic information using a combined global search process and local search process, wherein the global search process includes a population-based optimization process arranged to determine a global optima of a search population 10 associated with the genetic information, and wherein the local search process includes a machine-learning optimization process arranged to further optimize the search population so as to determine at least one signature feature associated with the genetic information. C), 0 C)J C0) E, o C 00 U) C)C '4-j 0 a) aZ 0 0 c C)l a) 5~
Description
TECHNICAL FIELD
The present invention relates to a genetic feature identifying system and a search method for identifying features of genetic information, and particularly, although not exclusively, to a feature selection process combined with a machine-learning optimization process using memetic framework.
BACKGROUND
Genetic information processing may be useful for identifying or predicting diseases of patients using statistical analysis. In general, the genetic information may be embedded with key features representing or associating with an existence or a probability of the existence of certain diseases or health-related issues.
Explosive growth of data urgently requires development of new technologies and automation tools that can intelligently help translating large amounts of data into useful information and knowledge. Sometimes it is not that all the features in the data are essential. Therefore it is necessary to select only a small portion of the relevant features from the original large data set for further processing or analysis.
SUMMARY OF THE INVENTION
2018100796 14 Jun 2018
In accordance with a first aspect of the present invention, there is provided a search method for identifying features of genetic information, comprising the step of processing the genetic information using a combined global search process and local search process, wherein the global search process includes a populationbased optimization process arranged to determine a global optima of a search population associated with the genetic information, and wherein the local search process includes a machine-learning optimization process arranged to further optimize the search population so as to determine at least one signature feature associated with the genetic information .
In an embodiment of the first aspect, the global search process includes a biological evolution process involving at least one genetic operator applied to the search population.
In an embodiment of the first aspect, the global search process includes an optimization of a plurality of control parameters for a regularization process.
In an embodiment of the first aspect, the regularization process includes a hybrid L1/2 + L2 regularization (HLR) process.
In an embodiment of the first aspect, the method further comprises the step of encoding and representing the genetic information in an intron part and an exon part
In an embodiment of the first aspect, the intron part is associated with penalized control parameters for the
2018100796 14 Jun 2018 regularization process and the exon part is associated with coefficients used in the machine-learning optimization process.
In an embodiment of the first aspect, the global search process includes a wrapper feature selection process arrange to induce the search population and to perform a heuristic searching process to globally optimize the plurality of control parameters for the regularization process.
In an embodiment of the first aspect, the local search process includes an embedded feature selection process arranged to optimize the search population by selecting the signature feature and to construct a learning model for the machine-learning optimization process
In an embodiment of the model is constructed based regularization process.
first aspect, the on an efficient learning gradient
In an embodiment of the first aspect, the combined global search process and local search process is based on a memetic framework arrange to facilitate an integration of the determination of the signature features and the machine-learning optimization process.
In accordance with a second aspect of the present 30 invention, there is provided a genetic feature identifying system for identifying features of genetic information, comprising a global search module and a local search module arranged to process the genetic information using a
2018100796 14 Jun 2018 global search process and a local search process respectively, wherein the global search process includes a population-based optimization process arranged to determine a global optima of a search population associated with the genetic information, and wherein the local search process includes a machine-learning optimization process arranged to further optimize the search population so as to determine at least one signature feature associated with the genetic information.
In an embodiment of the second aspect, the global search process includes a biological evolution process involving at least one genetic operator applied to the search population.
In an embodiment of the second aspect, the global search process module is arranged to optimize a plurality of control parameters for a regularization process.
In an embodiment of the second aspect, the regularization process includes a hybrid L1/2 + L2 regularization (HLR) process.
In an embodiment of the second aspect, the genetic feature identifying system further comprises a genetic information encoder arranged to encode and represent the genetic information in an intron part and an exon part.
In an embodiment of the second aspect, the intron part is associated with penalized control parameters for the regularization process and the exon part is associated with coefficients used in the machine-learning optimization process.
2018100796 14 Jun 2018
In an embodiment of the second aspect, the global search module is arranged to perform a wrapper feature selection process so as to induce the search population and to perform a heuristic searching process to globally optimize the plurality of control parameters for the regularization process.
In an embodiment of the second aspect, the local 10 search module is arranged to perform an embedded feature selection process so as to optimize the search population by selecting the signature feature and to construct a learning model for the machine-learning optimization process
In an embodiment of the second aspect, the local search module is further arranged to construct the learning model based on an efficient gradient regularization process.
In an embodiment of the second aspect, a combination of the global search module and the local search module is arrange to facilitate an integration of the determination of the signature features and the machine-learning optimization process based on a memetic framework.
BRIEF DESCRIPTION OF THE DRAWINGS
Embodiments of the present invention will now be 30 described, by way of example, with reference to the accompanying drawings in which:
2018100796 14 Jun2018
Figure 1 a schematic diagram of a computing server for operation as a genetic feature identifying system in accordance with one embodiment of the present invention;
Figure 2 is a schematic diagram of an embodiment of the genetic feature identifying system in accordance with one embodiment of the present invention; and
Figure 3 is an illustration of a procedure of 10 crossover operation.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
The inventors have, through their own research, 15 trials and experiments, devised that various methods may be used for feature selection, such as filter, wrapper, embedded and combined methods. For filter methods, different feature selection measures may be applied to rank individual features, such as but not limited to information theory; consistency measures; dependency (or correlation) measures; distance measures; rough set theory and fuzzy set theory.
The filter methods examine each feature independently, and ignore the individual performance of the feature in relation to the group, of which it is a part, despite the fact that features in a group may have a combined effect in a machine learning task.
For wrapper methods, machine learning algorithms may be used to evaluate the performance of selected feature subsets, these may include support vector machines (SVMs); Knearest neighbor (KNN); artificial neural networks
2018100796 14 Jun 2018 (ANNs); decision tree (DT); Naive Bayes (NB); multiple linear regression for classification; extreme learning machines (ELMs); and linear discriminant analysis (LDA). The results of the wrapper methods may be superior to those of the filter methods, but the computational cost of the wrapper methods may be higher.
The embedded methods integrate feature selection and learning procedure into a single process. For example, regularization methods may be an embedded technique which performs both learning model construction and automatic feature selection simultaneously.
Preferably, the applications of regularization method for feature selection may process high dimensional feature selection problems such as gene expression microarray data, Lasso (LI), smoothly clipped absolute deviation (SCAD), minimax concave penalty (MCP), and Ll/2 regularization may also be used.
In gene expression analysis, if genes share the same biological pathway, they may be highly correlated and grouped. Therefore, some methods may be applied to deal with issues of high relevance and grouping features, such as group Lasso, Elastic net, SCAD-L2, and hybrid Ll/2 + L2 regularization (HLR).
Given that each feature evaluation measure has its own advantages and disadvantages, preferably, the combined method means that the evaluation procedure includes different types of feature selection measures such as filter and wrapper.
2018100796 14 Jun 2018
Without wishing to be bound by theory, Evolutionary Computations (EC) methods may be combined with feature selection methods because of their global optimization capabilities. Based on the relevant evaluation criteria, the EC process of feature selection may also be divided into four categories, similar to the categorization mentioned above.
For example, these methods may include genetic 10 algorithm (GA), genetic programming (GP), particle swarm optimization (PSO), ant colony optimization (ACO), differential evolution (DE); evolutionary strategy (ES) ; estimated distribution algorithm (EDA) and memetic algorithm (MA).
In combined methods, memetic-based feature selection methods may combine wrapper and filter methods, and provide an opportunity for population-based optimization with local search. For example, GAs may be applied for wrapper feature selection and Markov blanket approach may be used as a local search for filter feature selection. However, such two-stage approaches may have the potential limitation that filter evaluation measures may eliminate potentially useful features regardless of their performance in the wrapper approaches. In addition, the wrapper approaches may involve a large number of assessments, and each assessment may take a considerable amount of time, especially when the numbers of features and instances are large. The second limitation of the combined feature selection methods is that they are primarily concerned with the relatively small numbers of features and instances.
2018100796 14 Jun 2018
Feature interaction (or grouping effect) presents another difficulty in feature selection. On one hand, a feature, which is weakly relevant to the target, could end up significantly improving the accuracy of the learning model when used together with some complementary features. On the other hand, an individually relevant feature can become redundant when used together with other features. Feature interaction occurs frequently in many areas. The third limitation of some combined feature selection methods is that filter measures, which evaluate features individually, do not work well, and a subset of relevant or grouping features requires to be evaluated as a whole.
With reference to Figure 1, an embodiment of the present invention is illustrated. This embodiment is arranged to provide a genetic feature identifying system for identifying features of genetic information, and the system comprises a global search module and a local search module arranged to process the genetic information using a global search process and a local search process respectively, wherein the global search process includes a population-based optimization process arranged to determine a global optima of a search population associated with the genetic information, and wherein the local search process includes a machine-learning optimization process arranged to further optimize the search population so as to determine at least one signature feature associated with the genetic information.
Preferably, the genetic feature identifying system may be used to solve the abovementioned limitations of the combined feature selection approaches, and may be used to perform a combined wrapper-embedded feature selection
2018100796 14 Jun 2018 approach (WEFSA). For example, the system may use a memetic framework to combine genetic algorithm (global search) and embedded regularization approaches (local search), so as to determine the signature features associated with the genetic information, such as DNA or genes being analysed. Therefore, the combined global search process and local search process is based on a memetic framework arrange to facilitate an integration of the determination of the signature features and the machine-learning optimization process.
In this embodiment, the global search module and the local search module are implemented by or for operation on a computer having an appropriate user interface. The computer may be implemented by any computing architecture, including stand-alone PC, client/server architecture, dumb terminal/mainframe architecture, or any other appropriate architecture. The computing device is appropriately programmed to implement the invention.
Referring to Figure 1, there is a shown a schematic diagram of a computer or a computing server 100 which in this embodiment comprises a server 100 arranged to operate, at least in part if not entirely, the genetic feature identifying system in accordance with one embodiment of the invention. The server 100 comprises suitable components necessary to receive, store and execute appropriate computer instructions. The components may include a processing unit 102, read-only memory (ROM)
104, random access memory (RAM) 106, and input/output devices such as disk drives 108, input devices 110 such as an Ethernet port, a USB port, etc. Display 112 such as a liquid crystal display, a light emitting display or any
2018100796 14 Jun 2018 other suitable display and communications links 114. The server 100 includes instructions that may be included in ROM 104, RAM 106 or disk drives 108 and may be executed by the processing unit 102. There may be provided a plurality of communication links 114 which may variously connect to one or more computing devices such as a server, personal computers, terminals, wireless or computing devices. At least one of a communications link may be connected to an external computing network through a telephone line or other type handheld plurality of of communications link.
The server may include storage devices such as a disk drive 108 which may encompass solid state drives, hard disk drives, optical drives or magnetic tape drives. The server 100 may use a single disk drive or multiple disk drives. The server 100 may also have a suitable operating system 116 which resides on the disk drive or in the ROM of the server 100.
The system has a database 120 residing on a disk or other storage device which is arranged to store at least one record 122. The database 120 is in communication with the server 100 with an interface, which is implemented by computer software residing on the server 100. Alternatively, the database 120 may also be implemented as a stand-alone database system in communication with the server 100 via an external computing network, or other types of communication links.
With reference to Figure 2, there is shown an embodiment of the genetic feature identifying system 200. In this embodiment, the server 100 is used as part of an
2018100796 14 Jun 2018 genetic feature identifying system 200 as a server or a processor arranged to analyze the genetic information such as DNA or genes of a biological specimen or sample. In one example, the system 200 may select the relevant features of bio-mark genes, which may be useful for predicting the patients' class or particular diseases.
Preferably, the global search process includes an optimization of a plurality of control parameters for a regularization process. Regularization methods may be considered as an important embedded technique and perform both model learning and automatic feature selection simultaneously. Focusing on high dimensional feature selection problems, such as relevant gene selection in microarray data. Example regularization methods includes Lasso, SCAD, MCP and Ll/2. Since Lasso is a convex penalty function, the gradient-based coordinate descent algorithm is suitable and may be used for the global optimization of Lasso .
In response to the problem of highly correlated and grouped features, for example, Elastic net, SCAD-L2, and hybrid Ll/2 +L2 regularization, a more complex harmonic regularization approach (CHR) may be used for uncertain probabilities distribution of data. Alternatively, a selfpaced curriculum learning (SPLC) regularization approach may be used, which significantly improves the learning efficiency when the number of instance is large. Regularization approaches are one-stage feature evaluation measures, which are suitable for complex feature selection problems with high interaction and large scales of features and instances.
2018100796 14 Jun 2018
In regularization methods, the control parameter between loss function and penalty function may be important for their performance in feature selection. The feasible value of the control parameter is generally tuned by the grid search method with k-fold cross violation approach. Preferably, some efficient regularization methods may include non-convex and multimodal penalty functions. These regularization methods may search across multiple parameters, which are suitable to be optimized by
EC approaches, for example, GA can deal with both unimodal and multimodal search space well, and the population-based search can find the global optima of these control parameters efficiently.
In accordance with an embodiment the present invention, there is provided a feature selection method performed by a global search module 202 and a local search module 204 arranged to process the input genetic information 208 using a global search process and a local search process respectively, using a combined wrapperembedded feature selection approach (WEFSA) which may improve learning performance and accelerate the search to identify the relevant feature subsets. Preferably, the genetic feature identifying system 200 may fine-tune the population of GA solutions by selecting the signature feature 210, and constructs the learning model based on an efficient gradient regularization process.
Preferably, the global search module 202 is arranged to perform a wrapper feature selection process so as to induce the search population and to perform a heuristic searching process to globally optimize the plurality of control parameters for the regularization process.
2018100796 14 Jun2018
The wrapper methods induce the population of GA solutions, using heuristic search strategies to globally optimize the control parameters for the non-convex regularization. Base on a memetic framework, the method in accordance with the present invention integrates feature selection and learning model construction into a single process under the global optimization of the non-convex regularization.
Any efficient regularization approach can serve as the embedded method in WEFSA, and preferably, the regularization process includes a hybrid L1/2 + L2 regularization (HLR) process arranged to optimize a plurality of control parameters for a regularization process .
As discussed above, regularization may be considered an important embedded feature selection approach in some example embodiments. Suppose X denotes the mp data matrix whose rows are = (^,1,-2,...,1,-,,). l< »<n , γ denotes the corresponding dependent variable (yi, -,yn)T.
For any control parameter λ (A > 0), the common form of regularization is:
L(A,/3) = argmin (7?(/3) + AP(/3)} (1) where are the estimated coefficients, is a loss function and P represents the regularization term. The most commonly used regularization method is the least absolute shrinkage and selection operator (Lasso, also the
1(.3) = ^1/3,11
Ll penalty), i.e., /=1 . It is performing continuous shrinkage and gene selection at the same time.
2018100796 14 Jun 2018
Some other Ll-norm type regularization methods may also be used. For example, the smoothly-clipped-absolutedeviation (SCAD) penalty, which is symmetric, nonconvex, and can produce spare solutions at the origin in the parameter space. The adaptive Lasso penalizes the different coefficients with the dynamic weights in the Ll penalty. The minimax concave penalty (MCP) provides the convexity of the penalized loss in sparse regions to the greatest extent, given certain thresholds for feature selection and unbiasedness. However, for large-scale feature selection problem, such as genomic data analysis, the results of the Ll type regularization may not be sparse enough for real application.
Genetic information such as a gene microarray or RNAseq data sets may have many thousands of genes, and it may be desirable to select fewer but informative genes.
f(/j) = £K
Although the L0 regularization, where i=i , yields the sparsest solution theoretically, it has to solve an NP hard combinatory optimization problem. In order to obtain a more concise solution and improve the predictive
accuracy of the machine | learning model, | the inventors | |
studied the Lp-norm | (0 < p | < 1), especially | p = 1/10, 1/2, |
2/3, or 9/10. | |||
Preferably, a | Ll/2 regularization can | be taken as a | |
representative of | the Lp | (0 < p < 1) | penalties, and |
analyzed its analytically expressive thresholding representation. Based on this thresholding representation, solving the Ll/2 regularization may be much easier than solving the L0 regularization. Moreover, the Ll/2 penalty
2018100796 14 Jun 2018 is unbiasedness and has oracle properties. These advantages make Ll/2 penalty an effective tool for high dimensional feature selection problems.
However, like most regularization methods, the Ll/2 penalty ignores the correlation between features, and therefore cannot analyze data with dependent structures. If there is a set of features of which the correlations are relatively high, the Ll/2 method tends to select only one feature to represent the corresponding group. In order to solve the problem of highly relevant features, Elastic net penalty may be applied, which is a linear combination of LI and L2 (the ridge technique) penalties. Such a method emphasizes the grouping effect, where strongly correlated features tend to enter or leave the learning model together.
Alternatively, Elastic SCAD (SCAD-L2), a combination of SCAD and L2 penalties for feature interaction may be used.
The hybrid L1/2+L2 regularization (HLR) approach may fit the logistic regression models for gene selection, where the regularization is a linear combination of the
Ll/2 and L2 penalties. For any fixed control parameter Ai.a2 (a,,a2>0), hybrid Ll/2 + L2 regularization (HLR) is defined as follows:
£(λι,λ2./3) = argmin (2?(/3) + λι|/3|1/2 + λ21/3|2 } (2) where ,/3,v^ = Σ>=· 1/3,2 = lft|2'
The HLR estimator β is the minimizer of Eg. (3):
2018100796 14 Jun 2018 arg min{L{A, a,/3)} (3) where λ = λι + Aa, and a =
A,
Aj +λ2 ’
A strictly convex penalty function may provide a sufficient condition for the grouping effect of features and the L2 penalty ensures a strict convexity. Therefore, the L2 penalty induces the grouping effect simultaneously in the HLR approach. Experimental results on artificial and real gene expression data obtained in some examples demonstrated that the HLR method may be used.
However, some efficient regularization methods are nonconvex and need to tune across multiple penalized parameters, which are generally adjusted by the grid search method with k- fold cross violation approach. It is believed that the population based search in EC is an efficient approach to globally optimize these penalized parameters .
Preferably, the global search process includes a biological evolution process involving at least one genetic operator applied to the search population. In Evolutionary Computations (EC), an initial population of candidate solutions is randomly generated in the search space and iteratively updated by artificial crossover, mutation and selection operators. After several generations, the population can gradually develop high quality solutions to the optimization problems.
In some preferably embodiments, local search (LS) technologies may be combined into the random search process of EC to improve the optimization efficiency. These hybrid algorithms may be referred as memetic
2018100796 14 Jun 2018 algorithms (MA) . MAs for feature selection, which combine wrapper and filter feature evaluation measures, provide an opportunity for population-based optimization with local search.
In one example embodiment, the filter feature ranking method in MA to balance the local and global searches for the purpose of improving the optimization quality and efficiency. Then, the Markov blanket approach may be integrated into MA to simultaneously identify all and part of the relevant features.
Alternatively, in another two-stage feature selection process, a Relief-F algorithm may be used to rank individual features and then the top-ranked features were used as input to the memetic wrapper feature selection process .
Heuristic mixtures may be introduced which combine the filter ranking scores to guide the search processes of GA and PSO for wrapper feature selection. Moreover, MAs for feature selection have already been used to solve some real application problems, such as, optimal controller design, motif-finding in DNA, mircoRNA and protein sequences.
As is shown above, in memetic-based feature selection approaches, the EC stage is included for wrapper feature selection, and the filter-based LS algorithm may help to reach a local optimal solution. Some wrapper+filter twostage memetic approaches do not guarantee that the selected features in the filter stage are also optimal candidates for the EC stage, since the evaluation criteria
2018100796 14 Jun 2018 of each stage may be totally different. In some examples, the filter stage in MA may eliminate potentially useful features with no regard to their performance in the wrapper process.
Some example combined feature selection methods may have limitations of inconsistency in feature evaluation measures, feature interactions and large scales of features and instants, it may be more preferable that a combined wrapper-embedded feature selection approach (WEFSA) using memetic framework with GA and hybrid Ll/2 + L2 regularization (HLR) is provided.
Preferably, the genetic feature identifying system further comprises a genetic information encoder 206 arranged to encode and represent the genetic information in an intron part and an exon part. For example, the genetic information may be a chromosome representation including intron (the penalized control parameters) and exon (the coefficients of the features in the learning model) for memetic optimization procedure.
In the first step of WEFSA, the GA population is randomly initialized with each chromosome encoded by intron and exon parts. Subsequently, as the intron part is associated with penalized control parameters for the regularization process, the hybrid Ll/2 + L2 regularization approach (local search) is performed on the exon parts under the fixed intron parts, to reach a local optimal solution or to improve the fitness of individuals in the search population.
2018100796 14 Jun 2018
The global search process includes a biological evolution process involving at least one genetic operator applied to the search population. Genetic operators such as crossovers and mutations are performed on the intron parts of the chromosomes, and the selection operator generates the next population. This process repeats itself till the stopping conditions are satisfied.
In an example embodiment of the wrapper-embedded 10 feature selection process, a representation for the two penalized control parameters dm and the coefficients (N. &·, βΡ) of the candidate feature subset can be encoded by the genetic information encoder 206 as a chromosome: intron + exon= ,βΡ) . The length of the chromosome is denoted as p+2, where p is the total number of features. The length of the chromosome is denoted as p+2, where p is the total number of features. The chromosome is a real value string and its intron part is globally optimized by GA operators.
Although the search space of the intron part is nonconvex and multimodal, GA has the global optimal ability because the dimension of the intron is quite low. On the other hand, the exon part may be optimized by the regularization approach for learning model construction and feature selection synchronously. In the exon part, a nonzero value of di implies that the corresponding feature has been selected. In contrast, the candidate feature has been rejected if its corresponding coefficients & is equal to zero. The maximum allowable number of nonzero di in the exon of each chromosome is denoted as T. When prior knowledge about the optimal number of features is
2018100796 14 Jun 2018 available, T may be limited to no more than the predefined value; otherwise T is equal to p.
The objective function may be defined by:
Fitness(chromosome) — Accuracy of theclassfication model’.with(\, a, 02, ·· · , £?p) (4) where nonzero & denotes the corresponding selected features subset encoded in the exon part of the chromosome. The objective function evaluates the significance of the given feature subset. In this paper, the fitness of the objective function is specified as the classification accuracy of the logistic regularization model with the chromosome ,βρ, using the hybrid
Ll/2 + L2 penalties method. Note that when two chromosomes are found to have similar fitness, i.e., the difference between their fitness is less than a small value of e (e = 1O_J in some examples) , then the one with a smaller number of selected features is given higher chances of surviving to the next generation.
The hybrid L1/2+L2 penalties method with the coordinate descent algorithm may be applied as memes or local search approach. The coordinate descent algorithm may be used as it is an efficient method for solving regularization problems, because its computational time increases linearly with the dimension of the feature selection problems. Preferably, WEFSA is capable of constructing the learning model and selecting the relevant features.
The hybrid L1/2+L2 regularization (HLR) in logistic model may be formed as:
2018100796 14 Jun 2018 = arg min [/7.(/3) + AP(3)] (5) where A — Ai + λ2, and 7?(/3) : j_s a loss function in logistic regression :
Λ(/ϊ) = arg min |_ π XN ' + log^ + exP(XW)} 16)
Here, .....= 1 denotes the decision vector of a binary value with 0 or 1 in logistic model.
T’i^is the HLR penalty function is defined as:
PW = + (1 -α) Σ IftT <7>
J=1 j=t where a = γ , and 0 < a < 1.
A1+A2 — —
By using the approach of the original coordinate-wise update :
<8' + A( 1 — 0) where 1 < j < p and n
As the partial residual for fitting
U) is defined as:
(10)
Additionally, Half(-) is the Ll/2 thresholding operator coordinate-wise update form for the HLR approach:
.(! + if 1^1 > -iS(A) j otherwise (II) where φ\(μ>) = arceoe(|(·^)-^),π = 3.14.
Therefore, the Eq. (5) can be linearized by one-term Taylor series expansion:
π arg min[4. //(Z, - - Χ’β) + AP(/3)] (12) i=l where Zi is the estimated response and Wi is the weight for Zi, which can be defined as follows.
2018100796 14 Jun 2018
7_ν' δ _ι j / h; = - Mm (13) (14) where /(Χ’β) is evaluated value under the current parameters:
ρχ·β) εχρ(Χ-β) + exp(X‘f3) (15)
Thus, Eq. (13) and (14) may be further redefined for fitting current ^as:
(i6) ω·ί = Σ (X. -Z,U)) (17)
-='
The procedure of the coordinate descent algorithm for the HLR penalized logistic model is described as follows.
Algorithm 1 The coordinate descent algorithm for the HLR penalized logistic model
1: Initialize all ftj(nt) -t- ()(j = 1.2, ·- ,p}, set m 4— 0 and Ato- are set by GA:
2: if 3(tn) dose not converge then 3: repeat
4: Calculate 2(m) and IF(w) and approximate the loss function Eq. (12) based on the current
5: for j = 1 to p do
6; Compute Z9>(rn) +- and
Φ) Σ1, Wgm)i„(Z,(m) - Zp>(m));
7: Update ft(m) «8: end for
9: rot-m + 1, £(ττί + 1)«— jGf(ro):
Id until there are no more features to be removed;
II end if
12: return the optimal feature subset:
In the evolution process of WEFSA, standard GA operators such as fitness proportionate selection, one point crossover and uniform mutation operators can be applied. Moreover, if prior knowledge on the optimal number of features is available, the number of nonzero of & in each exon part of the chromosome may be constrained to a maximum of T in the evolution process.
In an example evolution process such as crossover, 20 two parents (pa, ma) may be randomly select from current population for later breeding. Then, the operation of
2018100796 14 Jun 2018 crossover may be used to produce offsprings that inherit characteristics from both parents. A single crossover point on the intron of both pa and ma chromosomes is generated between the penalized control parameters λ and a then these two penalized control parameters on both sides of that point are swapped in the intron of the parent's chromosomes to create the intron part of the offsprings' chromosomes cl and c2. The exon β of these two offsprings chromosomes are evaluated by the local optimization strategies.
For mutation process, the mutation operator allows diversity of populations and larger exploration of search space. During this stage, one of the penalized control parameters λ, a is randomly chosen, with a mutation probability pm (e.g. pm = 0.1) to mutate a selected chromosome. The fitness and β of the new chromosome generated by the mutation operation are also evaluated by the local optimization strategies.
The roulette-wheel selection may be used to generate the next generation from the parent and offspring populations. The selection probability probe of the chromosome c is directly proportional to its fitness,
i.e., _ Σ/(parent) + £ /(offspring) (18)
At the genetic selection stage, the candidate chromosomes with higher accuracy will be less likely to be eliminated and still have the chance to be possible.
The inventors also evaluated the performance of the WEFSA approach in a simulation study of the system in
2018100796 14 Jun 2018 accordance with the embodiments of the present invention. In the experiments, six approaches were compared: GA, GP, MA, Elastic net, SCAD-L2, and the hybrid L1/2+L2 regularization (HLR) respectively. Data from a true model:
MA) = Χ'β + σε,ε ~ ΛΓ(0.1) l~y where X~N(0, 1), ε is the independent random noise and σ is the control parameter for noise was simulated.
Three scenarios are presented here. In every example, the dimension of features is 6000. The notation '/' represents the number of observations in the training and test sets respectively, e.g. 100/100. Here are the details of the three scenarios.
In Scenario 1, the dataset consists of 200/200 observations, the noise control parameter σ= 0.2 and 3 = (1,-1,1,-1,--- ,1.-1,0,---,0,2,-2,2.-2.--- ,2,-2,0,· ,0,2,2, · · · ,2,0,· ,0).
100 1900 100 1900 100 1900 a grouped feature situation was simulated :
Xj = p x ii + (1 — p) x Xj.j = 2,3, · · , 100;
Xj = px 12001 + (1 - p) X -Tj, j = 2002.2003, · · , 2100;
Xj = p X X4001 + (1 - p) x xj, j = 1002.4003, · · , 4100.
where p is the correlation coefficient of the grouped variables. In this example, there are three groups of correlated features. An ideal sparse regression method would select only the 300 true features and set the coefficients of the 5700 irrelevant features to zero.
Scenario 2 is defined similarly to Scenario 1, except the case when there are other independent factors, which also contributes to the decision variable y β = (Ι,-l.l.-l,··· ,1,-1.1.5,-2,1.7,3,-1,0. · . 0.2,-2, 2, -2.--- ,2,-2.1.5.-2,1.7.3.-1.0. ·· .0.2.2. · · ,2,1.5.-2,1.7. 3.-1,0, · ,0).
100 5x20 1800 100 5x20 1800 100 5x20 1800
In this example, there features (similar to are three groups of correlated Scenario 1) and 300 single
2018100796 14 Jun 2018 independent features. An example sparse regression method would select the 600 true features and set the coefficients of the 5400 irrelevant features to zero.
In Scenario 3, the true features were added up to 1000 of the total features, o= 0.1, and the dataset consists of 500/100 observations, and = (1,-1. l.-l. --.1,-1.1.5,-2,1.7,3,-1.0.··· ,0,2,-2,2.-2,-- ,2,-2.1.5,-2.1.7,3,-1.0.··· .0.2.2,··· .2.1.5.-2.1.7,3.-1.1,1,---.1.0.--,0 100 5x20 1600 100 5x20 1800 100 5x20 400 1400
Xj = p x xi + (1 - p) x Xj, j = 2,3, · · , 100:
Xj = p x x2ooi + (1 - p) x χ2, j = 2002,2003, · · · , 2100:
Xj = p x X4001 + (1 — p) x Xj< j' = 4002,4003, · · · ,4100:
Xj = 0.1 x X4201 + 0.9 x Xj, j = 4202,4203, · · · ,4600.
In this example, there are three groups of correlated features (similar to Scenario 1), 400 correlated features (the corrected parameter is 0.1) and 300 independent features. An example sparse regression method would select only the 1000 true features and set the coefficients of the 5000 irrelevant features to zero.
In one example, the correlation coefficient p of features may be set to 0.1, 0.4, 0.7 respectively. The learning model in GA, MA, Elastic net, SCAD-L2, HLR and
WEFSA is the logistic classification approach. In GP, the multitree classifier is used. For each iteration of GA and MA, the number of selected features based on the filter of information gain is set to 2000. The configuration parameters used by EC algorithms in these seven approaches are listed in Table I, which shows the parameters set for the EC process.
TABLE I
Parameters set for the EC algorithms in the seven approaches
Parameter | Value |
Population size (P) Crossover probability (pc) Mutation probability (pm) Stopping criterion (G) | 200 0.85 0.1 2000 |
2018100796 14 Jun 2018
In the regularization process of these seven approaches, the control parameters of Elastic net, SCADL2, and HLR approaches are tuned by the 10-fold crossvalidation (CV) approach in the training set. Note that, the Elastic net and HLR methods are tuned by the 10-CV approach on the two-dimensional parameter surfaces. The SCAD-L2 is tuned by the 10-CV approach on the threedimensional parameter surfaces. Then, different classifiers are built by these seven feature selection approaches. Finally, the obtained classifiers are applied to the test set for classification and prediction.
The simulations may be repeated 100 times for each method and compute the mean classification accuracy on the test sets. To evaluate the quality of the selected features for these approaches, the sensitivity and specificity of the feature selection performance are defined as follows:
TruePositive(TP) := 1)3. * /tl ,
I Io
TrueN egative(T N) := 1/3. * ,
I Io
FatsePositive(FP) := 1/3. * #1 ,
FalseNegative(FN) := /3. * 0 ,
Sensitivity :=
TP
TP + FN , Specificity :=
TN
TN + FP' where the is the element-wise product, and calculates the number of non-zero elements in a vector, β and β are the logical not operators on coefficients vector β and the simulated 3.
the true
Table II shows the feature selection and classification performances of different methods in the different parameter settings with Scenarios 1-3, in which
2018100796 14 Jun 2018 the results with best performances are denoted in bold texts .
TABLE II
Results of the simulation
Scenario | ||||||||||
P | Methods | 1 | 2 | 3 | I | 2 | 3 | 1 | 2 | 3 |
Sensitivity | Specificity | Accuracy | ||||||||
GA | 0.902 | 0.584 | 0.527 | 0.991 | 0.956 | 0.862 | 94.32% | 80.48% | 77.54% | |
GP | 0.915 | 0.748 | 0.726 | 0.997 | 0.987 | 0.907 | 95.50% | 88.47% | 79.66% | |
0.1 | MA | 0.908 | 0.679 | 0.652 | 0.993 | 0.971 | 0.893 | 94.73% | 82.91% | 80.65% |
Elastic net | 0.910 | 0.726 | 0.724 | 0.994 | 0.975 | 0.904 | 94.53% | 83.03% | 79.78% | |
SCAD-L2 | 0.916 | 0.795 | 0.758 | 0.997 | 0.982 | 0.912 | 94.49% | 82.12% | 80.41% | |
HLR | 0.919 | 0.863 | 0.791 | 0.998 | 0.987 | 0.918 | 95.81% | 90.15% | 85.76%, | |
WEFSA | 0.935 | 0.906 | 0.823 | 0.998 | 0.989 | 0.926 | 97.08% | 91.23% | 87.81% | |
GA | 0.724 | 0.531 | 0.457 | 0.985 | 0.923 | 0.813 | 89.71% | 76.49% | 70.37% | |
GP | 0.798 | 0.712 | 0.674 | 0.992 | 0.957 | 0.866 | 93.64% | 82.87% | 77.82% | |
MA | 0.741 | 0.635 | 0.572 | 0.987 | 0.929 | 0.848 | 89.84% | 80.04% | 75.63% | |
0.4 | Elastic net | 0.805 | 0.712 | 0.623 | 0.991 | 0.940 | 0.863 | 92.06%, | 82.19% | 75.19% |
scad-l2 | 0.837 | 0.741 | 0.698 | 0.992 | 0.949 | 0.894 | 92.44% | 82.84% | 76.51% | |
HLR | 0.862 | 0.820 | 0.725 | 0.994 | 0.960 | 0.903 | 93.89%, | 83.45% | 79.22% | |
WEFSA | 0.904 | 0.852 | 0.782 | 0.995 | 0.972 | 0.912 | 95.31% | 85.79% | 80.06% | |
GA | 0.563 | 0.467 | 0.417 | 0.961 | 0.891 | 0.775 | 75.08% | 69.04% | 62.65% | |
GP | 0.620 | 0.665 | 0.633 | 0.984 | 0.928 | 0.832 | 90.15% | 73.94%, | 70.24% | |
MA | 0.596 | 0.579 | 0.536 | 0.971 | 0.897 | 0.794 | 89.66% | 70.96% | 66.30% | |
0.7 | Elastic net | 0.675 | 0.637 | 0.561 | 0.977 | 0.905 | 0.816 | 88.17% | 71.85% | 65.26% |
SCAD-L2 | 0.691 | 0.694 | 0.583 | 0.986 | 0.929 | 0.822 | 89.79% | 74.27% | 68.83% | |
HLR | 0.763 | 0.729 | 0.671 | 0,988 | 0.937 | 0.837 | 90,04% | 77.18% | 73.75% | |
WEFSA | 0.820 | (1.754 | 0.724 | 0.991 | 0.943 | 0.851 | 92.34% | 80.65% | 76.94% |
It is found that with the decrease of the correlation coefficient p, the models' performances can be better. In Table II, the WEFSA approach always selects the most correct relevant features in different data environment with Scenarios 1-3. The highest sensitivities and specificities of feature selection obtained by WEFSA means that WEFSA selects most relevant features and deletes most irrelevant features respectively. Thus, the classification accuracy obtained by the WEFSA approach also outperforms other EC and regularization methods.
In another experiment performed by the inventors, to further evaluate the effectiveness of the WEFSA method, five example gene expression microarray datasets were used, including AML, DLBCL, Prostate, Lymphoma and Lung cancer. The AML dataset has 116 patients, which contain 6283 genes. The DLBCL contains about 240 samples' information, each sample includes the expression data of 8810 genes. The Prostate dataset contains the expression
2018100796 14 Jun 2018 profiles of 12,600 genes for 50 normal tissues and 52 prostate tumour tissues. The Lymphoma dataset contains 77 microarray gene expression profiles of the 2 most prevalent adult lymphoid malignancies: 58 samples of diffuse large B-cell lymphomas and 19 follicular lymphomas (FL) . The original data contains 7,129 gene expression values. The Lung cancer dataset contains 164 samples with 87 lung adenocarcinomas and 77 adjacent normal tissues with 22401 microarray gene expression profiles. A brief summary of these datasets is provided in Table III below.
TABLE III
The detailed information oh hive real gene expression datasets USED IN THE EXPERIMENTS
Dataset | No. samples | No. genes | Classes |
aMC | 116 | High risk f Low risk | |
DLBCL | 7399 | 240 | High risk / Low risk |
Lymphoma | 7129 | 77 | DLBCL / FL |
Proslate | 12600 | 102 | Normal / Tumor |
Lung cancer | 22401 | 164 | Normal / Tumor |
In order to accurately assess the performance of the seven different feature selection approaches, the real datasets are randomly divided into two sets: two thirds of the samples are put in the training set used for the model estimation, and the remaining one third of data are used to test the estimation performance. For regularization approaches, the penalized parameters are tuned by the 10fold cross validation. For each real dataset, the procedures using different methods are repeated over 100 times respectively.
2018100796 14 Jun 2018
TABLE IV
Results of empirical datasets
Methods | Tnunrog accuracy | Test accuracy | Να selected genes | |
AML | UA GP MA Elastic Oct SCAD-La HLR WF-ESA | 95.93% ¢7.02¾ ¢6.35% 96.67% 96,62% 97.46% 97.84% | 91.87% 9282% 91.13% 9204% 9294% 93.78% 94,32% | 32 21 25 28 23 22 19 |
DI.BC1. | GA GP MA Elastic net SCAD-Ly HLR WEFSA | 91.97% 9534« 93.40% 94.62% 95.39% 97.21% 97.28% | 88.58% 91,22% 90.34% 92.54% 9203% 93.15% 9.3.73% | 24 14 18 21 16 17 13 |
Eyrnphoma | GA OP MA Elastic net SCAD-L.1 i ll it WEFSA | 95.41% 96.14% 96.08% 95.93% 96.42% 98.51 % | 93.37% 9264% 9217% 91.65% 93.26% 94,03% | 63 38 54 41 28 29 27 |
Prostate | GA OP MA Elastic net SCAD-/.2 HUt WEFSA | 95.79% 97.26% 95.07% 96.52% 95.82% 97.15% 9832% | 90.34% 93.81% 90.83% 9251% 9289% 9263% 94.17% | 42 27 34 31 26 23 22 |
Lung caiKsr | GA GP MA Elastic net SCAD-La HLR WEFSA | 9/. 14*& 98.25% 97.42% 96.94% 97.63% 98.16% 98.83% | 9211% 91.59% 90,85%' 9227% 9248% 93.61% | 51 45 49 34 36 34 33 |
Table IV describes the averaged training accuracies (10- CV) and test accuracies obtained by different feature selection approaches regularization models in the five datasets, in which the results with best performances are denoted in bold texts.
It is obvious that the performance of the WEFSA approach is better than the other six approaches. The relevant gene selection performances of different approaches in the five real datasets are also shown in Table IV. The number of genes selected by the WEFSA model is the smallest compared to the other six feature selection approaches. In regularization approaches with grouping effect, such as Elastic net, SCAD- L2, and HLR, the performance of HLR is better than that of Elastic net and SCAD-L2 in gene selection.
2018100796 14 Jun 2018
On the contrary, in EC approaches, such as GA, GP and MA, the performance of GP is better than that of GA and MA in gene selection and classification. Comparing the performances of the seven feature selection algorithms,
Table IV proves that WEFSA approach has better performances in both gene selection and predictive classification.
For biological analysis of the results, 10 top-ranked 10 selected genes obtained by the different methods in the
AML dataset are shown in Table V below.
TABLE V
The IOtop genes in the AML dataset
Rank | GA | GP | MA | Elastic net | SCAD-Lo | HER | WEFSA |
1 | VEGF | SNRPN | MEIS1 | GSTM1 | DNMT1 | MLH1 | FED |
2 | PRDX2 | FHIT | ALOX12 | SFRP2 | CDH13 | CCDC69 | CDKN2B |
3 | TP73 | PNLIP | CDKN2A | GRAF | SFRP5 | JUNB | INK4B |
4 | SFRP5 | INK4B | CDH13 | GSTMI | ABCA8 | GLIPRI | GSTMI |
5 | FHIT | A4GALT | SNRPN | PTPN6 | RARA | PTPN6 | SFRPI |
6 | GLIPRI | SFRP1 | GSTM1 | FHIT | GSTMI | RUNX3 | NPMI |
7 | GRK5 | PRDX2 | GRAF | JUNB | WNT5A | GSTMI | SFRP2 |
8 | TNXB | SLN | SFRP5 | SFRP5 | PNLIP | MEISi | SFRP5 |
9 | GRAF | PTPN6 | PDLIM2 | ALOX12 | DAPK1 | ALOX12 | CEBPA |
10 | PTRF | DAPK1 | PTRF | CDKN2A | HOXA9 | SFRP5 | GLIPRI |
Compared with the other feature selection methods, the WEFSA approach selects some unique genes, such as SFRP1 and SFRP2, which are members of the Sfrp family, a kind of signal transduction proteins. The Sfrp family proteins play a key role in transmitting the TGF-beta signals from the cell-surface receptor to cell nucleus, mutation or deletion of AML disease, which has been proved to lead to pancreatic cancer. It is believed that the Sfrp family may be strongly associated with AML diseases.
In the other genes selected by the WEFSA approach, the gene FLT3 can stimulate the motility of AML diseases. The expression of FLT3 has been found to be up regulated
2018100796 14 Jun 2018 in some different kinds of AML diseases. The protein encoded by the gene NPM1 is said to be very similar to the tumour suppressor of drosophila, which is a highly relevant gene to AML diseases.
Moreover, some relevant genes selected by other regularization models using Elastic net, SCAD-L2, and HLR approaches are also found by the WEFSA, for example, SFRP5 and GSTM1. They are significantly associated to AML diseases, which has been discussed in.
Advantageously, the WEFSA approach not only can find the relevant genes that are selected by other feature selection methods, but also can find some unique genes, which are not selected by other models but are significantly associated to diseases. Hence, the WEFSA approach may identify the relevant genes accurately and efficiently.
It will also be appreciated that where the methods and systems of the present invention are either wholly implemented by computing system or partly implemented by computing systems then any appropriate computing system architecture may be utilised. This will include standalone computers, network computers and dedicated hardware devices. Where the terms computing system and computing device are used, these terms are intended to cover any appropriate arrangement of computer hardware capable of implementing the function described.
It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the invention as shown in the specific embodiments
2018100796 14 Jun2018 without departing from the spirit or scope of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.
Any reference to prior art contained herein is not to be taken as an admission that the information is common general knowledge, unless otherwise indicated.
2018100796 14 Jun 2018
Claims (15)
- CLAIMS :A search method for identifying features of genetic information, comprising the step of processing the genetic information using a combined global search process and local search process, wherein the global search process includes a population-based optimization process arranged to determine a global optima of a search population associated with the genetic information, and wherein the local search process includes a machine-learning optimization process arranged to further optimize the search population so as to determine at least one signature feature associated with the genetic information.The search method in accordance with Claim 1, wherein the global search process includes a biological evolution process involving at least one genetic operator applied to the search population.The search method in accordance with Claim 1, wherein the global search process includes an optimization of a plurality of control parameters for a regularization process.The search method in accordance with Claim 3, wherein the regularization process includes a hybrid L1/2 + L2 regularization (HLR) process.30 5. The search method in accordance with Claim 4, further comprising the step of encoding and representing the genetic information in an intron part and an exon part.2018100796 14 Jun 20186. The search method in accordance with Claim 5, wherein the intron part is associated with penalized control parameters for the regularization process and the
- 5 exon part is associated with coefficients used in the machine-learning optimization process.
- 7. The search method in accordance with Claim 3, wherein the global search process includes a wrapper feature10 selection process arrange to induce the search population and to perform a heuristic searching process to globally optimize the plurality of control parameters for the regularization process.15
- 8. The search method in accordance with Claim 1, wherein the local search process includes an embedded feature selection process arranged to optimize the search population by selecting the signature feature and to construct a learning model for the machine-learning20 optimization process
- 9. The search method in accordance with Claim 8, wherein the learning model is constructed based on an efficient gradient regularization process.
- 10. The search method in accordance with Claim 1, wherein the combined global search process and local search process is based on a memetic framework arrange to facilitate an integration of the determination of the30 signature features and the machine-learning optimization process.2018100796 14 Jun 2018
- 11. A genetic feature identifying system for identifying features of genetic information, comprising a global search module and a local search module arranged to process the genetic information using a global search process and a local search process respectively, wherein the global search process includes a population-based optimization process arranged to determine a global optima of a search population associated with the genetic information, and wherein the local search process includes a machine-learning optimization process arranged to further optimize the search population so as to determine at least one signature feature associated with the genetic information .The genetic feature identifying system in accordance with Claim 11, wherein the global search process includes a biological evolution process involving at least one genetic operator applied to the search population .
- 13. The genetic feature identifying system in accordance with Claim 11, wherein the global search process module is arranged to optimize a plurality of control parameters for a regularization process.
- 14. The genetic feature identifying system in accordance with Claim 13, wherein the regularization process includes a hybrid L1/2 + L2 regularization (HLR) process .
- 15. The genetic feature identifying system in accordance with Claim 14, further comprising a genetic2018100796 14 Jun2018 information encoder arranged to encode and represent the genetic information in an intron part and an exon part.5
- 16. The genetic feature identifying system in accordance with Claim 15, wherein the intron part is associated with penalized control parameters for the regularization process and the exon part is associated with coefficients used in the machine10 learning optimization process.
- 17. The genetic feature identifying system in accordance with Claim 13, wherein the global search module is arranged to perform a wrapper feature selection15 process so as to induce the search population and to perform a heuristic searching process to globally optimize the plurality of control parameters for the regularization process.20
- 18. The genetic feature identifying system in accordance with Claim 11, wherein the local search module is arranged to perform an embedded feature selection process so as to optimize the search population by selecting the signature feature and to construct a25 learning model for the machine-learning optimization process
- 19. The genetic feature identifying system in accordance with Claim 18, wherein the local search module is30 further arranged to construct the learning model based on an efficient gradient regularization process2018100796 14 Jun 2018
- 20. The genetic feature identifying system in accordance with Claim 11, wherein a combination of the global search module and the local search module is arrange to facilitate an integration of the determination of5 the signature features and the machine-learning optimization process based on a memetic framework.2018100796 14 Jun 20182018100796 14 Jun 2018CM d<D p<DΠ5 <DM— p<DC <D <D ·>>wCDΠ5 cO) woQ <D ω2018100796 14 Jun 2018 • « <3 SU * ·FIG. 3
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2018100796A AU2018100796A4 (en) | 2018-06-14 | 2018-06-14 | A genetic feature identifying system and a search method for identifying features of genetic information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2018100796A AU2018100796A4 (en) | 2018-06-14 | 2018-06-14 | A genetic feature identifying system and a search method for identifying features of genetic information |
Publications (1)
Publication Number | Publication Date |
---|---|
AU2018100796A4 true AU2018100796A4 (en) | 2018-07-19 |
Family
ID=62845391
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
AU2018100796A Ceased AU2018100796A4 (en) | 2018-06-14 | 2018-06-14 | A genetic feature identifying system and a search method for identifying features of genetic information |
Country Status (1)
Country | Link |
---|---|
AU (1) | AU2018100796A4 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110909158A (en) * | 2019-07-05 | 2020-03-24 | 重庆信科设计有限公司 | Text classification method based on improved firefly algorithm and K nearest neighbor |
WO2020112478A1 (en) * | 2018-11-29 | 2020-06-04 | Somalogic, Inc. | Methods for determining disease risk combining downsampling of class-imbalanced sets with survival analysis |
-
2018
- 2018-06-14 AU AU2018100796A patent/AU2018100796A4/en not_active Ceased
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020112478A1 (en) * | 2018-11-29 | 2020-06-04 | Somalogic, Inc. | Methods for determining disease risk combining downsampling of class-imbalanced sets with survival analysis |
CN113271849A (en) * | 2018-11-29 | 2021-08-17 | 私募蛋白质体公司 | Disease risk determination method combining category imbalance set down-sampling and survival analysis |
CN110909158A (en) * | 2019-07-05 | 2020-03-24 | 重庆信科设计有限公司 | Text classification method based on improved firefly algorithm and K nearest neighbor |
CN110909158B (en) * | 2019-07-05 | 2022-10-18 | 重庆信科设计有限公司 | Text classification method based on improved firefly algorithm and K nearest neighbor |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Camacho et al. | Next-generation machine learning for biological networks | |
Xu et al. | Inference of genetic regulatory networks with recurrent neural network models using particle swarm optimization | |
Nyathi et al. | Comparison of a genetic algorithm to grammatical evolution for automated design of genetic programming classification algorithms | |
Lægreid et al. | Predicting gene ontology biological process from temporal gene expression patterns | |
Meyer et al. | Advances in systems biology modeling: 10 years of crowdsourcing DREAM challenges | |
Lai et al. | Artificial intelligence and machine learning in bioinformatics | |
Manning et al. | Biologically inspired intelligent decision making: a commentary on the use of artificial neural networks in bioinformatics | |
US20170193157A1 (en) | Testing of Medicinal Drugs and Drug Combinations | |
WO2016118513A1 (en) | Method and system for analyzing biological networks | |
Fonseca et al. | Phylogeographic model selection using convolutional neural networks | |
Roohani et al. | GEARS: Predicting transcriptional outcomes of novel multi-gene perturbations | |
AU2018100796A4 (en) | A genetic feature identifying system and a search method for identifying features of genetic information | |
Böck et al. | Hub-centered gene network reconstruction using automatic relevance determination | |
Elzeki et al. | A new hybrid genetic and information gain algorithm for imputing missing values in cancer genes datasets | |
US11527308B2 (en) | Enhanced optimization with composite objectives and novelty-diversity selection | |
Al‐Anni et al. | Prediction of NSCLC recurrence from microarray data with GEP | |
Vimaladevi et al. | A microarray gene expression data classification using hybrid back propagation neural network | |
Bacardit et al. | Hard data analytics problems make for better data analysis algorithms: bioinformatics as an example | |
Deshpande et al. | Efficient strategies for screening large-scale genetic interaction networks | |
Garro et al. | Designing artificial neural networks using differential evolution for classifying DNA microarrays | |
Mohammadi et al. | Multi-resolution single-cell state characterization via joint archetypal/network analysis | |
Husseini et al. | Type2 soft biclustering framework for Alzheimer microarray | |
Shafiekhani et al. | Extended robust Boolean network of budding yeast cell cycle | |
González-Álvarez et al. | Convergence analysis of some multiobjective evolutionary algorithms when discovering motifs | |
González-Alvarez et al. | Multiobjective optimization algorithms for motif discovery in DNA sequences |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FGI | Letters patent sealed or granted (innovation patent) | ||
MK22 | Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry |