AU2018100796A4

AU2018100796A4 - A genetic feature identifying system and a search method for identifying features of genetic information

Info

Publication number: AU2018100796A4
Application number: AU2018100796A
Authority: AU
Inventors: Yong Liang; Xiao Ying LIU; Sai Wang
Original assignee: Macau University of Science and Technology
Current assignee: Macau University of Science and Technology
Priority date: 2018-06-14
Filing date: 2018-06-14
Publication date: 2018-07-19
Anticipated expiration: 2026-06-14

Abstract

A genetic feature identifying system and a search method for identifying features of genetic information. The 5 method comprises the step of processing the genetic information using a combined global search process and local search process, wherein the global search process includes a population-based optimization process arranged to determine a global optima of a search population 10 associated with the genetic information, and wherein the local search process includes a machine-learning optimization process arranged to further optimize the search population so as to determine at least one signature feature associated with the genetic information. C), 0 C)J C0) E, o C 00 U) C)C '4-j 0 a) aZ 0 0 c C)l a) 5~

Description

TECHNICAL FIELD

The present invention relates to a genetic feature identifying system and a search method for identifying features of genetic information, and particularly, although not exclusively, to a feature selection process combined with a machine-learning optimization process using memetic framework.

BACKGROUND

Genetic information processing may be useful for identifying or predicting diseases of patients using statistical analysis. In general, the genetic information may be embedded with key features representing or associating with an existence or a probability of the existence of certain diseases or health-related issues.

Explosive growth of data urgently requires development of new technologies and automation tools that can intelligently help translating large amounts of data into useful information and knowledge. Sometimes it is not that all the features in the data are essential. Therefore it is necessary to select only a small portion of the relevant features from the original large data set for further processing or analysis.

SUMMARY OF THE INVENTION

2018100796 14 Jun 2018

In accordance with a first aspect of the present invention, there is provided a search method for identifying features of genetic information, comprising the step of processing the genetic information using a combined global search process and local search process, wherein the global search process includes a populationbased optimization process arranged to determine a global optima of a search population associated with the genetic information, and wherein the local search process includes a machine-learning optimization process arranged to further optimize the search population so as to determine at least one signature feature associated with the genetic information .

In an embodiment of the first aspect, the global search process includes a biological evolution process involving at least one genetic operator applied to the search population.

In an embodiment of the first aspect, the global search process includes an optimization of a plurality of control parameters for a regularization process.

In an embodiment of the first aspect, the regularization process includes a hybrid L1/2 + L2 regularization (HLR) process.

In an embodiment of the first aspect, the method further comprises the step of encoding and representing the genetic information in an intron part and an exon part

In an embodiment of the first aspect, the intron part is associated with penalized control parameters for the

2018100796 14 Jun 2018 regularization process and the exon part is associated with coefficients used in the machine-learning optimization process.

In an embodiment of the first aspect, the global search process includes a wrapper feature selection process arrange to induce the search population and to perform a heuristic searching process to globally optimize the plurality of control parameters for the regularization process.

In an embodiment of the first aspect, the local search process includes an embedded feature selection process arranged to optimize the search population by selecting the signature feature and to construct a learning model for the machine-learning optimization process

In an embodiment of the model is constructed based regularization process.

first aspect, the on an efficient learning gradient

In an embodiment of the first aspect, the combined global search process and local search process is based on a memetic framework arrange to facilitate an integration of the determination of the signature features and the machine-learning optimization process.

In accordance with a second aspect of the present 30 invention, there is provided a genetic feature identifying system for identifying features of genetic information, comprising a global search module and a local search module arranged to process the genetic information using a

2018100796 14 Jun 2018 global search process and a local search process respectively, wherein the global search process includes a population-based optimization process arranged to determine a global optima of a search population associated with the genetic information, and wherein the local search process includes a machine-learning optimization process arranged to further optimize the search population so as to determine at least one signature feature associated with the genetic information.

In an embodiment of the second aspect, the global search process includes a biological evolution process involving at least one genetic operator applied to the search population.

In an embodiment of the second aspect, the global search process module is arranged to optimize a plurality of control parameters for a regularization process.

In an embodiment of the second aspect, the regularization process includes a hybrid L1/2 + L2 regularization (HLR) process.

In an embodiment of the second aspect, the genetic feature identifying system further comprises a genetic information encoder arranged to encode and represent the genetic information in an intron part and an exon part.

In an embodiment of the second aspect, the intron part is associated with penalized control parameters for the regularization process and the exon part is associated with coefficients used in the machine-learning optimization process.

2018100796 14 Jun 2018

In an embodiment of the second aspect, the global search module is arranged to perform a wrapper feature selection process so as to induce the search population and to perform a heuristic searching process to globally optimize the plurality of control parameters for the regularization process.

In an embodiment of the second aspect, the local 10 search module is arranged to perform an embedded feature selection process so as to optimize the search population by selecting the signature feature and to construct a learning model for the machine-learning optimization process

In an embodiment of the second aspect, the local search module is further arranged to construct the learning model based on an efficient gradient regularization process.

In an embodiment of the second aspect, a combination of the global search module and the local search module is arrange to facilitate an integration of the determination of the signature features and the machine-learning optimization process based on a memetic framework.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the present invention will now be 30 described, by way of example, with reference to the accompanying drawings in which:

2018100796 14 Jun2018

Figure 1 a schematic diagram of a computing server for operation as a genetic feature identifying system in accordance with one embodiment of the present invention;

Figure 2 is a schematic diagram of an embodiment of the genetic feature identifying system in accordance with one embodiment of the present invention; and

Figure 3 is an illustration of a procedure of 10 crossover operation.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

The inventors have, through their own research, 15 trials and experiments, devised that various methods may be used for feature selection, such as filter, wrapper, embedded and combined methods. For filter methods, different feature selection measures may be applied to rank individual features, such as but not limited to information theory; consistency measures; dependency (or correlation) measures; distance measures; rough set theory and fuzzy set theory.

The filter methods examine each feature independently, and ignore the individual performance of the feature in relation to the group, of which it is a part, despite the fact that features in a group may have a combined effect in a machine learning task.

For wrapper methods, machine learning algorithms may be used to evaluate the performance of selected feature subsets, these may include support vector machines (SVMs); Knearest neighbor (KNN); artificial neural networks

2018100796 14 Jun 2018 (ANNs); decision tree (DT); Naive Bayes (NB); multiple linear regression for classification; extreme learning machines (ELMs); and linear discriminant analysis (LDA). The results of the wrapper methods may be superior to those of the filter methods, but the computational cost of the wrapper methods may be higher.

The embedded methods integrate feature selection and learning procedure into a single process. For example, regularization methods may be an embedded technique which performs both learning model construction and automatic feature selection simultaneously.

Preferably, the applications of regularization method for feature selection may process high dimensional feature selection problems such as gene expression microarray data, Lasso (LI), smoothly clipped absolute deviation (SCAD), minimax concave penalty (MCP), and Ll/2 regularization may also be used.

In gene expression analysis, if genes share the same biological pathway, they may be highly correlated and grouped. Therefore, some methods may be applied to deal with issues of high relevance and grouping features, such as group Lasso, Elastic net, SCAD-L2, and hybrid Ll/2 + L2 regularization (HLR).

Given that each feature evaluation measure has its own advantages and disadvantages, preferably, the combined method means that the evaluation procedure includes different types of feature selection measures such as filter and wrapper.

2018100796 14 Jun 2018

Without wishing to be bound by theory, Evolutionary Computations (EC) methods may be combined with feature selection methods because of their global optimization capabilities. Based on the relevant evaluation criteria, the EC process of feature selection may also be divided into four categories, similar to the categorization mentioned above.

For example, these methods may include genetic 10 algorithm (GA), genetic programming (GP), particle swarm optimization (PSO), ant colony optimization (ACO), differential evolution (DE); evolutionary strategy (ES) ; estimated distribution algorithm (EDA) and memetic algorithm (MA).

In combined methods, memetic-based feature selection methods may combine wrapper and filter methods, and provide an opportunity for population-based optimization with local search. For example, GAs may be applied for wrapper feature selection and Markov blanket approach may be used as a local search for filter feature selection. However, such two-stage approaches may have the potential limitation that filter evaluation measures may eliminate potentially useful features regardless of their performance in the wrapper approaches. In addition, the wrapper approaches may involve a large number of assessments, and each assessment may take a considerable amount of time, especially when the numbers of features and instances are large. The second limitation of the combined feature selection methods is that they are primarily concerned with the relatively small numbers of features and instances.

2018100796 14 Jun 2018

Feature interaction (or grouping effect) presents another difficulty in feature selection. On one hand, a feature, which is weakly relevant to the target, could end up significantly improving the accuracy of the learning model when used together with some complementary features. On the other hand, an individually relevant feature can become redundant when used together with other features. Feature interaction occurs frequently in many areas. The third limitation of some combined feature selection methods is that filter measures, which evaluate features individually, do not work well, and a subset of relevant or grouping features requires to be evaluated as a whole.

With reference to Figure 1, an embodiment of the present invention is illustrated. This embodiment is arranged to provide a genetic feature identifying system for identifying features of genetic information, and the system comprises a global search module and a local search module arranged to process the genetic information using a global search process and a local search process respectively, wherein the global search process includes a population-based optimization process arranged to determine a global optima of a search population associated with the genetic information, and wherein the local search process includes a machine-learning optimization process arranged to further optimize the search population so as to determine at least one signature feature associated with the genetic information.

Preferably, the genetic feature identifying system may be used to solve the abovementioned limitations of the combined feature selection approaches, and may be used to perform a combined wrapper-embedded feature selection

2018100796 14 Jun 2018 approach (WEFSA). For example, the system may use a memetic framework to combine genetic algorithm (global search) and embedded regularization approaches (local search), so as to determine the signature features associated with the genetic information, such as DNA or genes being analysed. Therefore, the combined global search process and local search process is based on a memetic framework arrange to facilitate an integration of the determination of the signature features and the machine-learning optimization process.

In this embodiment, the global search module and the local search module are implemented by or for operation on a computer having an appropriate user interface. The computer may be implemented by any computing architecture, including stand-alone PC, client/server architecture, dumb terminal/mainframe architecture, or any other appropriate architecture. The computing device is appropriately programmed to implement the invention.

Referring to Figure 1, there is a shown a schematic diagram of a computer or a computing server 100 which in this embodiment comprises a server 100 arranged to operate, at least in part if not entirely, the genetic feature identifying system in accordance with one embodiment of the invention. The server 100 comprises suitable components necessary to receive, store and execute appropriate computer instructions. The components may include a processing unit 102, read-only memory (ROM)

104, random access memory (RAM) 106, and input/output devices such as disk drives 108, input devices 110 such as an Ethernet port, a USB port, etc. Display 112 such as a liquid crystal display, a light emitting display or any

2018100796 14 Jun 2018 other suitable display and communications links 114. The server 100 includes instructions that may be included in ROM 104, RAM 106 or disk drives 108 and may be executed by the processing unit 102. There may be provided a plurality of communication links 114 which may variously connect to one or more computing devices such as a server, personal computers, terminals, wireless or computing devices. At least one of a communications link may be connected to an external computing network through a telephone line or other type handheld plurality of of communications link.

The server may include storage devices such as a disk drive 108 which may encompass solid state drives, hard disk drives, optical drives or magnetic tape drives. The server 100 may use a single disk drive or multiple disk drives. The server 100 may also have a suitable operating system 116 which resides on the disk drive or in the ROM of the server 100.

The system has a database 120 residing on a disk or other storage device which is arranged to store at least one record 122. The database 120 is in communication with the server 100 with an interface, which is implemented by computer software residing on the server 100. Alternatively, the database 120 may also be implemented as a stand-alone database system in communication with the server 100 via an external computing network, or other types of communication links.

With reference to Figure 2, there is shown an embodiment of the genetic feature identifying system 200. In this embodiment, the server 100 is used as part of an

2018100796 14 Jun 2018 genetic feature identifying system 200 as a server or a processor arranged to analyze the genetic information such as DNA or genes of a biological specimen or sample. In one example, the system 200 may select the relevant features of bio-mark genes, which may be useful for predicting the patients' class or particular diseases.

Preferably, the global search process includes an optimization of a plurality of control parameters for a regularization process. Regularization methods may be considered as an important embedded technique and perform both model learning and automatic feature selection simultaneously. Focusing on high dimensional feature selection problems, such as relevant gene selection in microarray data. Example regularization methods includes Lasso, SCAD, MCP and Ll/2. Since Lasso is a convex penalty function, the gradient-based coordinate descent algorithm is suitable and may be used for the global optimization of Lasso .

In response to the problem of highly correlated and grouped features, for example, Elastic net, SCAD-L2, and hybrid Ll/2 +L2 regularization, a more complex harmonic regularization approach (CHR) may be used for uncertain probabilities distribution of data. Alternatively, a selfpaced curriculum learning (SPLC) regularization approach may be used, which significantly improves the learning efficiency when the number of instance is large. Regularization approaches are one-stage feature evaluation measures, which are suitable for complex feature selection problems with high interaction and large scales of features and instances.

2018100796 14 Jun 2018

In regularization methods, the control parameter between loss function and penalty function may be important for their performance in feature selection. The feasible value of the control parameter is generally tuned by the grid search method with k-fold cross violation approach. Preferably, some efficient regularization methods may include non-convex and multimodal penalty functions. These regularization methods may search across multiple parameters, which are suitable to be optimized by

EC approaches, for example, GA can deal with both unimodal and multimodal search space well, and the population-based search can find the global optima of these control parameters efficiently.

In accordance with an embodiment the present invention, there is provided a feature selection method performed by a global search module 202 and a local search module 204 arranged to process the input genetic information 208 using a global search process and a local search process respectively, using a combined wrapperembedded feature selection approach (WEFSA) which may improve learning performance and accelerate the search to identify the relevant feature subsets. Preferably, the genetic feature identifying system 200 may fine-tune the population of GA solutions by selecting the signature feature 210, and constructs the learning model based on an efficient gradient regularization process.

Preferably, the global search module 202 is arranged to perform a wrapper feature selection process so as to induce the search population and to perform a heuristic searching process to globally optimize the plurality of control parameters for the regularization process.

2018100796 14 Jun2018

The wrapper methods induce the population of GA solutions, using heuristic search strategies to globally optimize the control parameters for the non-convex regularization. Base on a memetic framework, the method in accordance with the present invention integrates feature selection and learning model construction into a single process under the global optimization of the non-convex regularization.

Any efficient regularization approach can serve as the embedded method in WEFSA, and preferably, the regularization process includes a hybrid L1/2 + L2 regularization (HLR) process arranged to optimize a plurality of control parameters for a regularization process .

As discussed above, regularization may be considered an important embedded feature selection approach in some example embodiments. Suppose X denotes the mp data matrix whose rows are = (^,1,-2,...,1,-,,). l< »<n , γ denotes the corresponding dependent variable (yi, -,y_n)^T.

For any control parameter λ (A > 0), the common form of regularization is:

L(A,/3) = argmin (7?(/3) + AP(/3)} (1) where are the estimated coefficients, is a loss function and P represents the regularization term. The most commonly used regularization method is the least absolute shrinkage and selection operator (Lasso, also the

1(.3) = ^1/3,1¹

Ll penalty), i.e., /=1 . It is performing continuous shrinkage and gene selection at the same time.

2018100796 14 Jun 2018

Some other Ll-norm type regularization methods may also be used. For example, the smoothly-clipped-absolutedeviation (SCAD) penalty, which is symmetric, nonconvex, and can produce spare solutions at the origin in the parameter space. The adaptive Lasso penalizes the different coefficients with the dynamic weights in the Ll penalty. The minimax concave penalty (MCP) provides the convexity of the penalized loss in sparse regions to the greatest extent, given certain thresholds for feature selection and unbiasedness. However, for large-scale feature selection problem, such as genomic data analysis, the results of the Ll type regularization may not be sparse enough for real application.

Genetic information such as a gene microarray or RNAseq data sets may have many thousands of genes, and it may be desirable to select fewer but informative genes.

f(/j) = £K

Although the L0 regularization, where i=i , yields the sparsest solution theoretically, it has to solve an NP hard combinatory optimization problem. In order to obtain a more concise solution and improve the predictive

accuracy of the machine	learning model,	the inventors
studied the Lp-norm	(0 < p	< 1), especially	p = 1/10, 1/2,
2/3, or 9/10.
Preferably, a	Ll/2 regularization can	be taken as a
representative of	the Lp	(0 < p < 1)	penalties, and

analyzed its analytically expressive thresholding representation. Based on this thresholding representation, solving the Ll/2 regularization may be much easier than solving the L0 regularization. Moreover, the Ll/2 penalty

2018100796 14 Jun 2018 is unbiasedness and has oracle properties. These advantages make Ll/2 penalty an effective tool for high dimensional feature selection problems.

However, like most regularization methods, the Ll/2 penalty ignores the correlation between features, and therefore cannot analyze data with dependent structures. If there is a set of features of which the correlations are relatively high, the Ll/2 method tends to select only one feature to represent the corresponding group. In order to solve the problem of highly relevant features, Elastic net penalty may be applied, which is a linear combination of LI and L2 (the ridge technique) penalties. Such a method emphasizes the grouping effect, where strongly correlated features tend to enter or leave the learning model together.

Alternatively, Elastic SCAD (SCAD-L2), a combination of SCAD and L2 penalties for feature interaction may be used.

The hybrid L1/2+L2 regularization (HLR) approach may fit the logistic regression models for gene selection, where the regularization is a linear combination of the

Ll/2 and L2 penalties. For any fixed control parameter Ai.a₂ (a,,a₂>0), hybrid Ll/2 + L2 regularization (HLR) is defined as follows:

£(λι,λ₂./3) = argmin (2?(/3) + λι|/3|_1/2 + λ₂1/3|² } (2) where ^,/3,v^ = ^Σ>=· ^1/3,2 = ^lft|2'

The HLR estimator β is the minimizer of Eg. (3):

2018100796 14 Jun 2018 arg min{L{A, a,/3)} (3) where λ = λι + Aa, and a =

A,

Aj +λ2 ’

A strictly convex penalty function may provide a sufficient condition for the grouping effect of features and the L2 penalty ensures a strict convexity. Therefore, the L2 penalty induces the grouping effect simultaneously in the HLR approach. Experimental results on artificial and real gene expression data obtained in some examples demonstrated that the HLR method may be used.

However, some efficient regularization methods are nonconvex and need to tune across multiple penalized parameters, which are generally adjusted by the grid search method with k- fold cross violation approach. It is believed that the population based search in EC is an efficient approach to globally optimize these penalized parameters .

Preferably, the global search process includes a biological evolution process involving at least one genetic operator applied to the search population. In Evolutionary Computations (EC), an initial population of candidate solutions is randomly generated in the search space and iteratively updated by artificial crossover, mutation and selection operators. After several generations, the population can gradually develop high quality solutions to the optimization problems.

In some preferably embodiments, local search (LS) technologies may be combined into the random search process of EC to improve the optimization efficiency. These hybrid algorithms may be referred as memetic

2018100796 14 Jun 2018 algorithms (MA) . MAs for feature selection, which combine wrapper and filter feature evaluation measures, provide an opportunity for population-based optimization with local search.

In one example embodiment, the filter feature ranking method in MA to balance the local and global searches for the purpose of improving the optimization quality and efficiency. Then, the Markov blanket approach may be integrated into MA to simultaneously identify all and part of the relevant features.

Alternatively, in another two-stage feature selection process, a Relief-F algorithm may be used to rank individual features and then the top-ranked features were used as input to the memetic wrapper feature selection process .

Heuristic mixtures may be introduced which combine the filter ranking scores to guide the search processes of GA and PSO for wrapper feature selection. Moreover, MAs for feature selection have already been used to solve some real application problems, such as, optimal controller design, motif-finding in DNA, mircoRNA and protein sequences.

As is shown above, in memetic-based feature selection approaches, the EC stage is included for wrapper feature selection, and the filter-based LS algorithm may help to reach a local optimal solution. Some wrapper+filter twostage memetic approaches do not guarantee that the selected features in the filter stage are also optimal candidates for the EC stage, since the evaluation criteria

2018100796 14 Jun 2018 of each stage may be totally different. In some examples, the filter stage in MA may eliminate potentially useful features with no regard to their performance in the wrapper process.

Some example combined feature selection methods may have limitations of inconsistency in feature evaluation measures, feature interactions and large scales of features and instants, it may be more preferable that a combined wrapper-embedded feature selection approach (WEFSA) using memetic framework with GA and hybrid Ll/2 + L2 regularization (HLR) is provided.

Preferably, the genetic feature identifying system further comprises a genetic information encoder 206 arranged to encode and represent the genetic information in an intron part and an exon part. For example, the genetic information may be a chromosome representation including intron (the penalized control parameters) and exon (the coefficients of the features in the learning model) for memetic optimization procedure.

In the first step of WEFSA, the GA population is randomly initialized with each chromosome encoded by intron and exon parts. Subsequently, as the intron part is associated with penalized control parameters for the regularization process, the hybrid Ll/2 + L2 regularization approach (local search) is performed on the exon parts under the fixed intron parts, to reach a local optimal solution or to improve the fitness of individuals in the search population.

2018100796 14 Jun 2018

The global search process includes a biological evolution process involving at least one genetic operator applied to the search population. Genetic operators such as crossovers and mutations are performed on the intron parts of the chromosomes, and the selection operator generates the next population. This process repeats itself till the stopping conditions are satisfied.

In an example embodiment of the wrapper-embedded 10 feature selection process, a representation for the two penalized control parameters dm and the coefficients (N. &·, β_Ρ) of the candidate feature subset can be encoded by the genetic information encoder 206 as a chromosome: intron + exon= ,β_Ρ) . The length of the chromosome is denoted as p+2, where p is the total number of features. The length of the chromosome is denoted as p+2, where p is the total number of features. The chromosome is a real value string and its intron part is globally optimized by GA operators.

Although the search space of the intron part is nonconvex and multimodal, GA has the global optimal ability because the dimension of the intron is quite low. On the other hand, the exon part may be optimized by the regularization approach for learning model construction and feature selection synchronously. In the exon part, a nonzero value of di implies that the corresponding feature has been selected. In contrast, the candidate feature has been rejected if its corresponding coefficients & is equal to zero. The maximum allowable number of nonzero di in the exon of each chromosome is denoted as T. When prior knowledge about the optimal number of features is

2018100796 14 Jun 2018 available, T may be limited to no more than the predefined value; otherwise T is equal to p.

The objective function may be defined by:

Fitness(chromosome) — Accuracy of theclassfication model’.with(\, a, 02, ·· · , £?_p) (4) where nonzero & denotes the corresponding selected features subset encoded in the exon part of the chromosome. The objective function evaluates the significance of the given feature subset. In this paper, the fitness of the objective function is specified as the classification accuracy of the logistic regularization model with the chromosome ,βρ, using the hybrid

Ll/2 + L2 penalties method. Note that when two chromosomes are found to have similar fitness, i.e., the difference between their fitness is less than a small value of e (e = 1O^_J i_{n some} examples) , then the one with a smaller number of selected features is given higher chances of surviving to the next generation.

The hybrid L1/2+L2 penalties method with the coordinate descent algorithm may be applied as memes or local search approach. The coordinate descent algorithm may be used as it is an efficient method for solving regularization problems, because its computational time increases linearly with the dimension of the feature selection problems. Preferably, WEFSA is capable of constructing the learning model and selecting the relevant features.

The hybrid L1/2+L2 regularization (HLR) in logistic model may be formed as:

2018100796 14 Jun 2018 = arg min [/7.(/3) + AP(3)] (5) where A — Ai + λ₂, and 7?(/3) : j__{s a} loss function in logistic regression :

Λ(/ϊ) = arg min |^_ π XN ' ^{+ log}^ ^{+ ex}P(XW)} 16)

Here, .....= 1 denotes the decision vector of a binary value with 0 or 1 in logistic model.

T’i^is the HLR penalty function is defined as:

PW = + (1 -α) Σ IftT <⁷>

J=1 j=t where a = _γ , and 0 < a < 1.

A1+A2 — —

By using the approach of the original coordinate-wise update :

<⁸' + A( 1 — 0) where 1 < j < p and n

As the partial residual for fitting

U) is defined as:

(10)

Additionally, Half(-) is the Ll/2 thresholding operator coordinate-wise update form for the HLR approach:

.(! ₊ if 1^1 > -iS(A) j otherwise (II) where φ\(μ>) = arceoe(|(·^)^-^),π = 3.14.

Therefore, the Eq. (5) can be linearized by one-term Taylor series expansion:

π arg min[4. //(Z, - - Χ’β) + AP(/3)] (12) i=l where Zi is the estimated response and Wi is the weight for Zi, which can be defined as follows.

2018100796 14 Jun 2018

7_ν' δ _ι j / h; = - Mm (13) (14) where /(Χ’β) is evaluated value under the current parameters:

ρχ·β) εχρ(Χ-β) + exp(X‘f3) (15)

Thus, Eq. (13) and (14) may be further redefined for fitting current ^as:

(i6) ω·ί = Σ (X. -Z,^U)) (17)

-='

The procedure of the coordinate descent algorithm for the HLR penalized logistic model is described as follows.

Algorithm 1 The coordinate descent algorithm for the HLR penalized logistic model

1: Initialize all ftj(nt) -t- ()(j = 1.2, ·- ,p}, set m 4— 0 and A_to- are set by GA:

2: if 3(tn) dose not converge then 3: repeat

4: Calculate 2(m) and IF(w) and approximate the loss function Eq. (12) based on the current

5: for j = 1 to p do

6; Compute Z9>(rn) +- and

Φ) Σ1, Wg_m)i„(Z,(m) - Zp>(m));

7: Update ft(m) «8: end for

9: rot-m + 1, £(ττί + 1)«— jGf(ro):

Id until there are no more features to be removed;

II end if

12: return the optimal feature subset:

In the evolution process of WEFSA, standard GA operators such as fitness proportionate selection, one point crossover and uniform mutation operators can be applied. Moreover, if prior knowledge on the optimal number of features is available, the number of nonzero of & in each exon part of the chromosome may be constrained to a maximum of T in the evolution process.

In an example evolution process such as crossover, 20 two parents (pa, ma) may be randomly select from current population for later breeding. Then, the operation of

2018100796 14 Jun 2018 crossover may be used to produce offsprings that inherit characteristics from both parents. A single crossover point on the intron of both pa and ma chromosomes is generated between the penalized control parameters λ and a then these two penalized control parameters on both sides of that point are swapped in the intron of the parent's chromosomes to create the intron part of the offsprings' chromosomes cl and c2. The exon β of these two offsprings chromosomes are evaluated by the local optimization strategies.

For mutation process, the mutation operator allows diversity of populations and larger exploration of search space. During this stage, one of the penalized control parameters λ, a is randomly chosen, with a mutation probability pm (e.g. pm = 0.1) to mutate a selected chromosome. The fitness and β of the new chromosome generated by the mutation operation are also evaluated by the local optimization strategies.

The roulette-wheel selection may be used to generate the next generation from the parent and offspring populations. The selection probability probe of the chromosome c is directly proportional to its fitness,

i.e., _ Σ/(parent) + £ /(offspring) ⁽¹⁸⁾

At the genetic selection stage, the candidate chromosomes with higher accuracy will be less likely to be eliminated and still have the chance to be possible.

The inventors also evaluated the performance of the WEFSA approach in a simulation study of the system in

2018100796 14 Jun 2018 accordance with the embodiments of the present invention. In the experiments, six approaches were compared: GA, GP, MA, Elastic net, SCAD-L2, and the hybrid L1/2+L2 regularization (HLR) respectively. Data from a true model:

MA) = Χ'β + σε,ε ~ ΛΓ(0.1) ^l~^y where X~N(0, 1), ε is the independent random noise and σ is the control parameter for noise was simulated.

Three scenarios are presented here. In every example, the dimension of features is 6000. The notation '/' represents the number of observations in the training and test sets respectively, e.g. 100/100. Here are the details of the three scenarios.

In Scenario 1, the dataset consists of 200/200 observations, the noise control parameter σ= 0.2 and 3 = (1,-1,1,-1,--- ,1.-1,0,---,0,2,-2,2.-2.--- ,2,-2,0,· ,0,2,2, · · · ,2,0,· ,0).

100 1900 100 1900 100 1900 a grouped feature situation was simulated :

Xj = p x ii + (1 — p) x Xj.j = 2,3, · · , 100;

Xj = px 12001 + (1 - p) X -Tj, j = 2002.2003, · · , 2100;

Xj = p X X4001 + (1 - p) x xj, j = 1002.4003, · · , 4100.

where p is the correlation coefficient of the grouped variables. In this example, there are three groups of correlated features. An ideal sparse regression method would select only the 300 true features and set the coefficients of the 5700 irrelevant features to zero.

Scenario 2 is defined similarly to Scenario 1, except the case when there are other independent factors, which also contributes to the decision variable y β = (Ι,-l.l.-l,··· ,1,-1.1.5,-2,1.7,3,-1,0. · . 0.2,-2, 2, -2.--- ,2,-2.1.5.-2,1.7.3.-1.0. ·· .0.2.2. · · ,2,1.5.-2,1.7. 3.-1,0, · ,0).

100 5x20 1800 100 5x20 1800 100 5x20 1800

In this example, there features (similar to are three groups of correlated Scenario 1) and 300 single

2018100796 14 Jun 2018 independent features. An example sparse regression method would select the 600 true features and set the coefficients of the 5400 irrelevant features to zero.

In Scenario 3, the true features were added up to 1000 of the total features, o= 0.1, and the dataset consists of 500/100 observations, and = (1,-1. l.-l. --.1,-1.1.5,-2,1.7,3,-1.0.··· ,0,2,-2,2.-2,-- ,2,-2.1.5,-2.1.7,3,-1.0.··· .0.2.2,··· .2.1.5.-2.1.7,3.-1.1,1,---.1.0.--,0 100 5x20 1600 100 5x20 1800 100 5x20 400 1400

Xj = p x xi + (1 - p) x Xj, j = 2,3, · · , 100:

Xj = p x x₂ooi + (1 - p) x χ₂, j = 2002,2003, · · · , 2100:

Xj = p x X4001 + (1 — p) x Xj< j' = 4002,4003, · · · ,4100:

Xj = 0.1 x X4201 + 0.9 x Xj, j = 4202,4203, · · · ,4600.

In this example, there are three groups of correlated features (similar to Scenario 1), 400 correlated features (the corrected parameter is 0.1) and 300 independent features. An example sparse regression method would select only the 1000 true features and set the coefficients of the 5000 irrelevant features to zero.

In one example, the correlation coefficient p of features may be set to 0.1, 0.4, 0.7 respectively. The learning model in GA, MA, Elastic net, SCAD-L2, HLR and

WEFSA is the logistic classification approach. In GP, the multitree classifier is used. For each iteration of GA and MA, the number of selected features based on the filter of information gain is set to 2000. The configuration parameters used by EC algorithms in these seven approaches are listed in Table I, which shows the parameters set for the EC process.

TABLE I

Parameters set for the EC algorithms in the seven approaches

Parameter	Value
Population size (P) Crossover probability (pc) Mutation probability (pm) Stopping criterion (G)	200 0.85 0.1 2000

2018100796 14 Jun 2018

In the regularization process of these seven approaches, the control parameters of Elastic net, SCADL2, and HLR approaches are tuned by the 10-fold crossvalidation (CV) approach in the training set. Note that, the Elastic net and HLR methods are tuned by the 10-CV approach on the two-dimensional parameter surfaces. The SCAD-L2 is tuned by the 10-CV approach on the threedimensional parameter surfaces. Then, different classifiers are built by these seven feature selection approaches. Finally, the obtained classifiers are applied to the test set for classification and prediction.

The simulations may be repeated 100 times for each method and compute the mean classification accuracy on the test sets. To evaluate the quality of the selected features for these approaches, the sensitivity and specificity of the feature selection performance are defined as follows:

TruePositive(TP) := 1)3. * /tl ,

I Io

TrueN egative(T N) := 1/3. * ,

I Io

FatsePositive(FP) := 1/3. * #1 ,

FalseNegative(FN) := /3. * 0 ,

Sensitivity :=

TP

TP + FN , Specificity :=

TN

TN + FP' where the is the element-wise product, and calculates the number of non-zero elements in a vector, β and β are the logical not operators on coefficients vector β and the simulated 3.

the true

Table II shows the feature selection and classification performances of different methods in the different parameter settings with Scenarios 1-3, in which

2018100796 14 Jun 2018 the results with best performances are denoted in bold texts .

TABLE II

Results of the simulation

		Scenario
P	Methods	1	2	3	I	2	3	1	2	3
		Sensitivity	Specificity	Accuracy
	GA	0.902	0.584	0.527	0.991	0.956	0.862	94.32%	80.48%	77.54%
	GP	0.915	0.748	0.726	0.997	0.987	0.907	95.50%	88.47%	79.66%
0.1	MA	0.908	0.679	0.652	0.993	0.971	0.893	94.73%	82.91%	80.65%
Elastic net	0.910	0.726	0.724	0.994	0.975	0.904	94.53%	83.03%	79.78%
	SCAD-L₂	0.916	0.795	0.758	0.997	0.982	0.912	94.49%	82.12%	80.41%
	HLR	0.919	0.863	0.791	0.998	0.987	0.918	95.81%	90.15%	85.76%,
	WEFSA	0.935	0.906	0.823	0.998	0.989	0.926	97.08%	91.23%	87.81%
	GA	0.724	0.531	0.457	0.985	0.923	0.813	89.71%	76.49%	70.37%
	GP	0.798	0.712	0.674	0.992	0.957	0.866	93.64%	82.87%	77.82%
	MA	0.741	0.635	0.572	0.987	0.929	0.848	89.84%	80.04%	75.63%
0.4	Elastic net	0.805	0.712	0.623	0.991	0.940	0.863	92.06%,	82.19%	75.19%
	scad-l₂	0.837	0.741	0.698	0.992	0.949	0.894	92.44%	82.84%	76.51%
	HLR	0.862	0.820	0.725	0.994	0.960	0.903	93.89%,	83.45%	79.22%
	WEFSA	0.904	0.852	0.782	0.995	0.972	0.912	95.31%	85.79%	80.06%
	GA	0.563	0.467	0.417	0.961	0.891	0.775	75.08%	69.04%	62.65%
	GP	0.620	0.665	0.633	0.984	0.928	0.832	90.15%	73.94%,	70.24%
	MA	0.596	0.579	0.536	0.971	0.897	0.794	89.66%	70.96%	66.30%
0.7	Elastic net	0.675	0.637	0.561	0.977	0.905	0.816	88.17%	71.85%	65.26%
	SCAD-L₂	0.691	0.694	0.583	0.986	0.929	0.822	89.79%	74.27%	68.83%
	HLR	0.763	0.729	0.671	0,988	0.937	0.837	90,04%	77.18%	73.75%
	WEFSA	0.820	(1.754	0.724	0.991	0.943	0.851	92.34%	80.65%	76.94%

It is found that with the decrease of the correlation coefficient p, the models' performances can be better. In Table II, the WEFSA approach always selects the most correct relevant features in different data environment with Scenarios 1-3. The highest sensitivities and specificities of feature selection obtained by WEFSA means that WEFSA selects most relevant features and deletes most irrelevant features respectively. Thus, the classification accuracy obtained by the WEFSA approach also outperforms other EC and regularization methods.

In another experiment performed by the inventors, to further evaluate the effectiveness of the WEFSA method, five example gene expression microarray datasets were used, including AML, DLBCL, Prostate, Lymphoma and Lung cancer. The AML dataset has 116 patients, which contain 6283 genes. The DLBCL contains about 240 samples' information, each sample includes the expression data of 8810 genes. The Prostate dataset contains the expression

2018100796 14 Jun 2018 profiles of 12,600 genes for 50 normal tissues and 52 prostate tumour tissues. The Lymphoma dataset contains 77 microarray gene expression profiles of the 2 most prevalent adult lymphoid malignancies: 58 samples of diffuse large B-cell lymphomas and 19 follicular lymphomas (FL) . The original data contains 7,129 gene expression values. The Lung cancer dataset contains 164 samples with 87 lung adenocarcinomas and 77 adjacent normal tissues with 22401 microarray gene expression profiles. A brief summary of these datasets is provided in Table III below.

TABLE III

The detailed information oh hive real gene expression datasets USED IN THE EXPERIMENTS

Dataset	No. samples	No. genes	Classes
aMC		116	High risk f Low risk
DLBCL	7399	240	High risk / Low risk
Lymphoma	7129	77	DLBCL / FL
Proslate	12600	102	Normal / Tumor
Lung cancer	22401	164	Normal / Tumor

In order to accurately assess the performance of the seven different feature selection approaches, the real datasets are randomly divided into two sets: two thirds of the samples are put in the training set used for the model estimation, and the remaining one third of data are used to test the estimation performance. For regularization approaches, the penalized parameters are tuned by the 10fold cross validation. For each real dataset, the procedures using different methods are repeated over 100 times respectively.

2018100796 14 Jun 2018

TABLE IV

Results of empirical datasets

	Methods	Tnunrog accuracy	Test accuracy	Να selected genes
AML	UA GP MA Elastic Oct SCAD-La HLR WF-ESA	95.93% ¢7.02¾ ¢6.35% 96.67% 96,62% 97.46% 97.84%	91.87% 9282% 91.13% 9204% 9294% 93.78% 94,32%	32 21 25 28 23 22 19
DI.BC1.	GA GP MA Elastic net SCAD-Ly HLR WEFSA	91.97% 9534« 93.40% 94.62% 95.39% 97.21% 97.28%	88.58% 91,22% 90.34% 92.54% 9203% 93.15% 9.3.73%	24 14 18 21 16 17 13
Eyrnphoma	GA OP MA Elastic net SCAD-L.1 i ll it WEFSA	95.41% 96.14% 96.08% 95.93% 96.42% 98.51 %	93.37% 9264% 9217% 91.65% 93.26% 94,03%	63 38 54 41 28 29 27
Prostate	GA OP MA Elastic net SCAD-/.2 HUt WEFSA	95.79% 97.26% 95.07% 96.52% 95.82% 97.15% 9832%	90.34% 93.81% 90.83% 9251% 9289% 9263% 94.17%	42 27 34 31 26 23 22
Lung caiKsr	GA GP MA Elastic net SCAD-La HLR WEFSA	9/. 14*& 98.25% 97.42% 96.94% 97.63% 98.16% 98.83%	9211% 91.59% 90,85%' 9227% 9248% 93.61%	51 45 49 34 36 34 33

Table IV describes the averaged training accuracies (10- CV) and test accuracies obtained by different feature selection approaches regularization models in the five datasets, in which the results with best performances are denoted in bold texts.

It is obvious that the performance of the WEFSA approach is better than the other six approaches. The relevant gene selection performances of different approaches in the five real datasets are also shown in Table IV. The number of genes selected by the WEFSA model is the smallest compared to the other six feature selection approaches. In regularization approaches with grouping effect, such as Elastic net, SCAD- L2, and HLR, the performance of HLR is better than that of Elastic net and SCAD-L2 in gene selection.

2018100796 14 Jun 2018

On the contrary, in EC approaches, such as GA, GP and MA, the performance of GP is better than that of GA and MA in gene selection and classification. Comparing the performances of the seven feature selection algorithms,

Table IV proves that WEFSA approach has better performances in both gene selection and predictive classification.

For biological analysis of the results, 10 top-ranked 10 selected genes obtained by the different methods in the

AML dataset are shown in Table V below.

TABLE V

The IOtop genes in the AML dataset

Rank	GA	GP	MA	Elastic net	SCAD-Lo	HER	WEFSA
1	VEGF	SNRPN	MEIS1	GSTM1	DNMT1	MLH1	FED
2	PRDX2	FHIT	ALOX12	SFRP2	CDH13	CCDC69	CDKN2B
3	TP73	PNLIP	CDKN2A	GRAF	SFRP5	JUNB	INK4B
4	SFRP5	INK4B	CDH13	GSTMI	ABCA8	GLIPRI	GSTMI
5	FHIT	A4GALT	SNRPN	PTPN6	RARA	PTPN6	SFRPI
6	GLIPRI	SFRP1	GSTM1	FHIT	GSTMI	RUNX3	NPMI
7	GRK5	PRDX2	GRAF	JUNB	WNT5A	GSTMI	SFRP2
8	TNXB	SLN	SFRP5	SFRP5	PNLIP	MEISi	SFRP5
9	GRAF	PTPN6	PDLIM2	ALOX12	DAPK1	ALOX12	CEBPA
10	PTRF	DAPK1	PTRF	CDKN2A	HOXA9	SFRP5	GLIPRI

Compared with the other feature selection methods, the WEFSA approach selects some unique genes, such as SFRP1 and SFRP2, which are members of the Sfrp family, a kind of signal transduction proteins. The Sfrp family proteins play a key role in transmitting the TGF-beta signals from the cell-surface receptor to cell nucleus, mutation or deletion of AML disease, which has been proved to lead to pancreatic cancer. It is believed that the Sfrp family may be strongly associated with AML diseases.

In the other genes selected by the WEFSA approach, the gene FLT3 can stimulate the motility of AML diseases. The expression of FLT3 has been found to be up regulated

2018100796 14 Jun 2018 in some different kinds of AML diseases. The protein encoded by the gene NPM1 is said to be very similar to the tumour suppressor of drosophila, which is a highly relevant gene to AML diseases.

Moreover, some relevant genes selected by other regularization models using Elastic net, SCAD-L2, and HLR approaches are also found by the WEFSA, for example, SFRP5 and GSTM1. They are significantly associated to AML diseases, which has been discussed in.

Advantageously, the WEFSA approach not only can find the relevant genes that are selected by other feature selection methods, but also can find some unique genes, which are not selected by other models but are significantly associated to diseases. Hence, the WEFSA approach may identify the relevant genes accurately and efficiently.

It will also be appreciated that where the methods and systems of the present invention are either wholly implemented by computing system or partly implemented by computing systems then any appropriate computing system architecture may be utilised. This will include standalone computers, network computers and dedicated hardware devices. Where the terms computing system and computing device are used, these terms are intended to cover any appropriate arrangement of computer hardware capable of implementing the function described.

It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the invention as shown in the specific embodiments

2018100796 14 Jun2018 without departing from the spirit or scope of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.

Any reference to prior art contained herein is not to be taken as an admission that the information is common general knowledge, unless otherwise indicated.

2018100796 14 Jun 2018

Claims

CLAIMS :

A search method for identifying features of genetic information, comprising the step of processing the genetic information using a combined global search process and local search process, wherein the global search process includes a population-based optimization process arranged to determine a global optima of a search population associated with the genetic information, and wherein the local search process includes a machine-learning optimization process arranged to further optimize the search population so as to determine at least one signature feature associated with the genetic information.

The search method in accordance with Claim 1, wherein the global search process includes a biological evolution process involving at least one genetic operator applied to the search population.

The search method in accordance with Claim 1, wherein the global search process includes an optimization of a plurality of control parameters for a regularization process.

The search method in accordance with Claim 3, wherein the regularization process includes a hybrid L1/2 + L2 regularization (HLR) process.

30 5. The search method in accordance with Claim 4, further comprising the step of encoding and representing the genetic information in an intron part and an exon part.

2018100796 14 Jun 2018

6. The search method in accordance with Claim 5, wherein the intron part is associated with penalized control parameters for the regularization process and the
5 exon part is associated with coefficients used in the machine-learning optimization process.
7. The search method in accordance with Claim 3, wherein the global search process includes a wrapper feature

10 selection process arrange to induce the search population and to perform a heuristic searching process to globally optimize the plurality of control parameters for the regularization process.

15
8. The search method in accordance with Claim 1, wherein the local search process includes an embedded feature selection process arranged to optimize the search population by selecting the signature feature and to construct a learning model for the machine-learning

20 optimization process
9. The search method in accordance with Claim 8, wherein the learning model is constructed based on an efficient gradient regularization process.
10. The search method in accordance with Claim 1, wherein the combined global search process and local search process is based on a memetic framework arrange to facilitate an integration of the determination of the

30 signature features and the machine-learning optimization process.

2018100796 14 Jun 2018
11. A genetic feature identifying system for identifying features of genetic information, comprising a global search module and a local search module arranged to process the genetic information using a global search process and a local search process respectively, wherein the global search process includes a population-based optimization process arranged to determine a global optima of a search population associated with the genetic information, and wherein the local search process includes a machine-learning optimization process arranged to further optimize the search population so as to determine at least one signature feature associated with the genetic information .

The genetic feature identifying system in accordance with Claim 11, wherein the global search process includes a biological evolution process involving at least one genetic operator applied to the search population .
13. The genetic feature identifying system in accordance with Claim 11, wherein the global search process module is arranged to optimize a plurality of control parameters for a regularization process.
14. The genetic feature identifying system in accordance with Claim 13, wherein the regularization process includes a hybrid L1/2 + L2 regularization (HLR) process .
15. The genetic feature identifying system in accordance with Claim 14, further comprising a genetic

2018100796 14 Jun2018 information encoder arranged to encode and represent the genetic information in an intron part and an exon part.

5
16. The genetic feature identifying system in accordance with Claim 15, wherein the intron part is associated with penalized control parameters for the regularization process and the exon part is associated with coefficients used in the machine10 learning optimization process.
17. The genetic feature identifying system in accordance with Claim 13, wherein the global search module is arranged to perform a wrapper feature selection

15 process so as to induce the search population and to perform a heuristic searching process to globally optimize the plurality of control parameters for the regularization process.

20
18. The genetic feature identifying system in accordance with Claim 11, wherein the local search module is arranged to perform an embedded feature selection process so as to optimize the search population by selecting the signature feature and to construct a

25 learning model for the machine-learning optimization process
19. The genetic feature identifying system in accordance with Claim 18, wherein the local search module is

30 further arranged to construct the learning model based on an efficient gradient regularization process

2018100796 14 Jun 2018
20. The genetic feature identifying system in accordance with Claim 11, wherein a combination of the global search module and the local search module is arrange to facilitate an integration of the determination of

5 the signature features and the machine-learning optimization process based on a memetic framework.

2018100796 14 Jun 2018

2018100796 14 Jun 2018

CM d

<D p

<D

Π5 <D

M— p

<D

C <D <D ·

>>

w

CD

Π5 c

O) w

o

Q <D ω

2018100796 14 Jun 2018 • « <3 SU * ·

FIG. 3