WO2022182291A1

WO2022182291A1 - Prediction function generator

Info

Publication number: WO2022182291A1
Application number: PCT/SG2022/050085
Authority: WO
Inventors: Can CUI; Wei Wang; Zhaojing LUO; Beng Chin Ooi
Original assignee: National University Of Singapore
Priority date: 2021-02-25
Filing date: 2022-02-22
Publication date: 2022-09-01

Abstract

Disclosed is a method for generating prediction functions. The method includes generating a population of alphas based on a parent alpha and for each alpha in the population, training the alpha on a training data set, that includes a trend, to generate a trained alpha, and determining a fitness score for the trained alpha. Above steps are repeated for a number of repetitions and, for each repetition except a first said repetition, the population is defined based on the trained alphas of a previous repetition, and for at least one repetition, the population is defined by replacing at least one trained alpha of a previous repetition with a mutated alpha generated from another said trained alpha of the previous repetition. After the last repetition, the alpha with highest fitness score is selected.

Description

PREDICTION FUNCTION GENERATOR

Technical Field The present invention relates, in general terms, to generating prediction functions. The present invention may be applied in various fields, from viral spread prediction to stock market analysis, by generating prediction functions having weak correlation. Background

Prediction functions are used as models to determine how particular tasks will change over time. In this context, a "task" may be a virus and the prediction function may seek to predict the manner in which it will spread. Another common form of prediction function is used to predict the price movements of various stocks listed on a stock exchange. These prediction functions are referred to as "alphas".

Alphas relating to stocks are stock prediction models that capture trading signals in a stock market. A set of good alphas with high returns are weakly correlated to diversify risk. Existing alphas can be categorized into two classes:

(1) Formulaic alphas: these alphas are simple algebraic expressions and thus can generalize well and be mined into a weakly-correlated set. Such alphas model scalar features; and (2) Machine learning alphas: these are data-driven by being trained from data. Such alphas model high dimensional features in vectors or matrices. However, machine learning alphas are too complex to be applied mined into a weakly-correlated set. Formulaic alphas are popular given the fine properties of their algebraic expressions: compactness, explicit interpretability, and good generalizability. Weak correlation is typically designed into the development of alphas, using domain knowledge.

Existing approaches to machine learning alpha discovery involves complex machine learning (ML) processes or a genetic algorithm to automatically mine a set of formulaic alphas. In the first approach, ML alphas are machine learning models generating trading signals. They can model high dimensional features and learn from training data to boost the generalization performance. However, they are complex by design and thus not easily mined into a weakly-correlated set. Further, ML alphas are designed with strong structural assumptions. For example, a model with the injection of relational domain knowledge (e.g. two stocks are in the same sector) assumes the returns of related stocks change similarly - e.g. stocks of the same industry sector. However, this assumption is less likely to hold for volatile stocks. In the genetic algorithm-based approach, discovering formulaic alphas is formulated as a symbolic regression problem. Genetic algorithms use random alphas as a starting point to avoid a specific gene dominating the population. Genetic algorithms only search formulaic alphas that only utilise short-term features, and they fail to search alphas in a large space with vector and matrix operations. The search space is therefore limited to arithmetic operations, which can hinder improvement of a population of alphas initialised with a well-designed alpha. Thus, neither approach can generate a set of weakly-correlated alphas with high returns by the standard of investment.

Some other ML processes for alpha discovery are more flexible but are computationally expensive - sometimes taking weeks to process a two-layer neural network. This is can prohibitive in fast moving stock markets or, in the viral case, for rapidly spreading viruses. However, even these ML processes cannot leverage domain knowledge since stocks, for example, are considered as independent tasks and thus cannot capture that rich source of information.

It would be desirable to provide a method and/or system for generating alphas that overcomes one or more of the abovementioned drawbacks of the prior art, or at least provides a useful alternative.

Summary

Disclosed herein is a new class of alphas that possess similar strengths as existing ML alphas. The new class of alphas can model scalar, vector, and matrix features.

The alphas are discovered through a mining framework that searches from a large space of operators operating on scalar, vector, or matrix operands efficiently. In some embodiments, relational domain knowledge is injected during the search and accelerate the search by the early stopping of invalid, also referred to as redundant, (part of) alphas.

The methods taught herein can evolve initial alphas (i.e. parent alphas) into novel alphas having a high Sharpe ratio and weak correlations. Unless context dictates otherwise, the terms "alpha" and "prediction function" will be used interchangeably herein.

Similarly, the term "temporal descriptor" and "stock price" will be used interchangeably except where context dictates an alternative meaning - e.g. where "temporal descriptor" refers to size of a population infected with a virus.

Accordingly, disclosed herein is a method for generating alphas, comprising:

(a) generating a population of alphas based on a parent alpha function;

(b) receiving a training data set comprising stock prices of one or more tasks, the stock prices for at least one task forming a trend;

(c) for each alpha in the population: training the alpha on the training data set to generate a trained alpha; and determining a fitness score for the trained alpha;

(d) performing step (c) for a predetermined number of repetitions, _M, wherein: for each repetition except a first said repetition, the population is defined based on the trained alphas of a previous repetition; and for at least one repetition, the population is defined by replacing at least one trained alpha of a previous repetition with a mutated alpha generated from another said trained alpha of the previous repetition; and (e) selecting, after completion of an M^th said repetition, the alpha with highest fitness score.

Replacing at least one trained alpha may comprise replacing the trained alpha that is oldest.

The method may further comprise defining the parent alpha. The parent alpha may be one of a predefined-alpha, an expert-designed alpha, an empty alpha or random alpha.

Defining the parent alpha may comprise defining the parent alpha as a two- dimensional neural network.

Each alpha may comprise a sequence of operators that operate on operands, the method further comprising rejecting any said alpha, or part thereof, comprising a redundant operand. The method may further involve identifying the redundant operand by checking if it is an output operand of a valid operation, a valid operation being an operation performed by a valid operator. The method may further involve categorising an operator as redundant if one or both of an input operand and output operand of the operator is none of: an output of the alpha; an input of the alpha; and both an output of one operation and an input of another operation of respective operators of the alpha.

Also disclosed is a prediction function generator, comprising: memory; and at least one processor (processor(s)), wherein the memory stores instructions that, when executed by the processor(s), cause the system to:

(a) generate a population of alphas based on a parent alpha;

(b) receive a training data set comprising stock prices of one or more tasks, the stock prices for at least one task forming a trend;

(c) for each alpha in the population: train the alpha on the training data set to generate a trained alpha; and determine a fitness score for the trained alpha;

(d) perform step (c) for a predetermined number of repetitions, M, wherein: for each repetition except a first said repetition, the population is defined based on the trained alphas of a previous repetition; and for at least one repetition, the population is defined by replacing at least one trained alpha of a previous repetition with a mutated alpha generated from another said trained alpha of the previous repetition; and

(e) select, after completion of an M^th said repetition, the alpha with highest fitness score.

The processor(s) may replace the at least one trained alpha by replacing the trained alpha that is oldest.

For each trained alpha, the fitness score may be an information coefficient based on a Pearson Correlation between a stock price predicted by the alpha and a corresponding stock price from the training data set.

The instructions may also cause the processor(s) to define the parent alpha. The parent alpha may therefore be one of a predefined-alpha, an expert- designed alpha, an empty alpha or random alpha. The processor(s) may define the parent alpha by defining the parent alpha as a two-dimensional neural network. For each task for which the stock prices form a trend, the trend may be cyclical and the stock prices cover at least one complete cycle of the trend.

The training data set may comprise a plurality of tasks and each alpha may therefore comprise a sequence of operators. In such cases, the method may include, or the instructions may further result in the processor(s), modelling a dependency between two or more said tasks based on a statistical relationship between operands, for all of the two or more tasks, of an operator in the sequence. The statistical relationship may be determined at least in part by: applying the sequence of operators to each of two or more tasks of the plurality of tasks; and while applying the sequence of operators, inputting into at least one operator in the sequence other than a first operator, operands from a previous operator in the sequence from each of the two or more tasks.

Each alpha may comprise a sequence of operators that operate on operands. The instructions may thus further causing the processor(s) to reject any said alpha, or part thereof, comprising a redundant operand. The instructions may further cause the processor(s) to identify the redundant operand by checking if it is an output operand of a valid operation, a valid operation being an operation performed by a valid operator. The instructions may further cause the processor(s) to categorise an operator as redundant if one or both of an input operand and output operand of the operator is none of: an output of the alpha; an input of the alpha; and both an output of one operation and an input of another operation of respective operators of the alpha.

For the first repetition of step (c) the population is generated based on a parent prediction function. For every other repetition, the population is generated based on the output of the immediately previous repetition - i.e. the repetition that has just completed. Also, "performing step (c) for a predetermined number of repetitions" is intended to include the first performance of step (c) as the first repetition.

The prediction function generator receives a training data set, the data set comprising temporal data of descriptors of one or more tasks (e.g. stock, event or virus). The definition of a "descriptor" is understood from the underlying task and data set. For example, where the task is a "stock" or index listed on a stock exchange, a descriptor may be the price of the stock on a particular day, the moving average of close prices of the stock over a particular period (e.g. 5, 10, 20 or 30 days) with the output being the predicted return (i.e. price increase in percentage points relative to the current price). Where the task is a virus, a descriptor may be the number of infections (detected or total) on a particular day, or the geographical radius of spread from the location of a patient zero.

In general, the alpha that is dropped between repetitions of step (c) will be the oldest alpha. This reduces the likelihood that genes of one alpha will propagate through the entire population of alphas of a later generation.

The term "at least in part" is used in this context to mean that the statistical relationship may be entirely determined by the above operations or may be at least one additional step or process.

Each alpha may comprise a sequence of operators that operate on operands, the method further comprising rejecting any said alpha, or part thereof, comprising an invalid, also referred to as redundant, operand.

Advantageously, the methods and systems described herein may generate alphas that are weakly-correlated and high dimensional.

Advantageously, the ML methods described herein, which may be based on AutoML, can mine (i.e. generate, including producing or discovering) weakly- correlated alphas. Moreover, the operators used on the operands can facilitate searching for new alphas while avoiding discovering complex machine learning alphas from scratch.

Advantageously, domain knowledge can be incorporated or injected without requiring strong structural assumptions in modelling an alpha.

Advantageously, alphas evolved using embodiments of the present method have several intriguing strengths: (1) they can model scalar features like a formulaic alpha or high dimensional features; (2) they can update parameters in a training stage to improve inference ability; and (3) as a simple model, they show good generalization ability with high risk-adjusted return and can be mined into a set of uncorrelated, or weakly-correlated alphas.

Advantageously, present method employ a pruning technique to boost search efficiency, by avoiding redundant calculations.

Brief description of the drawings

Embodiments of the present invention will now be described, by way of nonlimiting example, with reference to the drawings in which:

Figure 1 is a flow diagram of a method for generating prediction functions;

Figure 2 shows a further flow chart embodying the method of Figure 1;

Figure 3 depicts evolution and evaluation processes in generating (including producing) or mining prediction functions;

Figure 4 illustrates an operation in which outputs of a previous operation are passed between tasks - stocks in the example shown; Figure 5 illustrates validity checks performed on alphas;

Figures 6a to 6e illustrate evolution trajectories for fittest alphas in each round of step (c) of Figure 1; and

Figure 7 is an exemplary computer system for performing the method of Figure

1.

Detailed description

Described below is a form of evolutionary algorithm that expands the operator space from traditional evolutionary algorithms, to facilitate searching of a novel class of prediction functions (alphas). At the same time, the methods constrain the alpha search space by stopping redundant operations or alphas early rather than at the end of evolution.

The present methods simplify the difficult goal of discovering a complex ML alpha. This simplification significantly reduces the search time of good alphas. Effectively, the goal changes from discovering a ML alpha with fixed vector features to flexible alpha types as a formula or the novel alpha. Some embodiments leverage off an AutoML-Zero framework but, as a result of the change in goal, the learn function in the formulation of AutoML-Zero changes to a parameter-updating function.

An alpha evolved in accordance with present teachings can be a recursive model with optional scalar, vector, or matrix features and have a parameter-updating function. This is achieved by selecting a part of the input feature matrix as scalar and vector features during alpha searching. The alphas discovered using methods disclosed herein can predominantly model scalar features, and are thus simple to mine into w weakly-correlated set, yet can be high-dimensional data- driven models utilising long-term features. Such a method is shown in Figure 1. The method 100 for generating prediction functions (alphas), includes:

Step 102: generating a population of prediction functions (alphas) based on a parent prediction function (parent alpha);

Step 104: receiving a training data set;

Step 106: training each alpha based on the training set;

Step 108: determining or calculating a fitness score for each alpha;

Step 110: defining a population for the next iteration of steps 106 and 108; and

Step 112: selecting an alpha.

Step 104 involves receiving a training data set. The training data set comprises temporal descriptors of one or more tasks. A temporal descriptor will be evident from the nature of the task being analysed. For example, where each task is a specific type of virus, a temporal descriptor may be the geographic spread in distance of detected infections of the virus from a patient zero (i.e. the first person to have been detected with the virus) on a particular day, or the number of detected infections on a particular day. Where each task is a particular stock traded on a stock exchange, a temporal descriptor may be the close price of the stock on a particular day.

The present methods can also detect trends in the movement of a task. For example, the present methods may involve an alpha that takes the cyclic movement of the price of a stock into account. To that end, the data in the training data set for a particular task (i.e. stock) may include temporal descriptors for a full cycle of the cyclic trend, or at least sufficient temporal descriptors to enable the trend to be borne out of the data in the training data set.

Steps 106 and 108 are performed on each alpha in the population generated at step 102. Then, according to step 110, steps 106 and 108 are repeated for a predetermined number of repetitions, N. For each repetition of steps 106 and 108 a new population is defined. For the first repetition of steps 106 and 108 the population is generated from a parent alpha. In some cases, the parent alpha may be pre-loaded. In general, the parent alpha will need to be defined per step 114 - shown in broken lines as it may not occur in all embodiments. The parent alpha may be a predefined-alpha. In other cases the parent alpha will be an expert-designed alpha, an empty alpha, random alpha or a neural network - e.g. a two-dimensional or two-layer neural network.

For each subsequent repetition of steps 106 and 108 (i.e. each repetition other than the first repetition) the population is defined by the alphas trained by a previous repetition - generally the immediately preceding repetition. For example, for the fifth repetition the population is defined based on the trained alphas trained by the fourth repetition. To reduce the prevalence of a particular gene in the trained alphas, the population defined for at least one subsequent repetition will be based on the trained alphas of the immediately preceding repetition, but one of those trained alphas will be dropped and replaced with another, new alpha. The offer that is dropped will typically be the oldest alpha. The new alpha will be generated by mutating one of the trained alphas from the immediately preceding repetition. While the alpha that is dropped may also serve as the basis for mutation to produce or generate the new alpha, the new alpha will typically be generated by mutating the alpha with the highest fitness score in the previous repetition.

After all of the N repetitions have completed, the alpha with the highest fitness score is selected per step 112.

Figure 2 shows an architecture overview 200 for implementing the method 100. Using this architecture, alphas are formulated as a sequence of operations, each of which comprises an operator (OP), input operand(s) and an output operand. Each alpha consists of a setup function Setup() to initialise operands, a predict function Predict() to generate a prediction, and a parameter-updating function UpdateQ to update parameters if any parameters are generated during the evolution. In Predict(), the new class of alphas improves the formulaic alphas by extracting features from vectors and matrices. Additionally, by including scalars into operations the new alphas are less complex than ML alphas. In Update(), the operands are defined that are updated in the training stage (i.e. features) and passed to the inference stage as parameters. Unlike intermediate calculation results that are only useful for a specific prediction, these operands, as features of long-term training data, improve the alpha's inference ability.

For any particular alpha, a maximum number of operations and/or operands can be specified, to limit complexity.

As mentioned above, an initial alpha may be selected or be otherwise defined randomly or input or received in any other manner. In the embodiment reflected by Figure 2, a well-designed alpha 202 (e.g. expert designed) is selected from a database 204 of alphas and is transformed (206) into an evolvable formulation (208) as described below - for each repetition of steps 106 and 108, the population may be stored in database 204. During subsequent evolution 210 (i.e. repetitions of steps 106 and 108 and the various population definition processes), validity checks are performed 212 to retain valid alphas, or valid parts of alphas - step 118 of Figure 1.

The alphas are then hashed into a database 214 to enable comparison to future alphas to avoid repetition - the validity checking and de-duplication processes can dramatically reduce the search space, not by simplification but by removing alphas, or parts thereof, that provide no benefit over the alphas that are retained. Note: an alpha from which an invalid, or redundant, part has been removed can still validly be referred to as an alpha, rather than a "partial alpha" or similar.

After the training process runs, mutated (i.e. trained) alphas are evaluated and eliminated if they are correlated - 216. As a result, a new alpha 218 can be discovered with weak-correlation when compared with other alphas, and high yield - in the context of a stock, the yield will be the change in price of the stock and in the context of a bacterium, the yield may be the number of new infections or the increase in geographical spread from patient zero.

The new outer 218 is then stored in the database 204.

As mentioned with reference to Figure 2, the other is converted into a formulation that can be involved. There are different versions of alpha definitions. For embodiments illustrated herein, an alpha is a combination of mathematical expressions, computer source code, and configuration parameters that can be used, in combination with historical data, to make predictions about future movements of stocks. While the discussion will largely centre around the application of the method 100 to the development of alphas for stock price prediction, as mentioned herein the present methods may also be used for the prediction of movements in other tasks.

Within the above definition, a search is conducted for any alpha that fits in the search space. An alpha is a sequence of operations, each of which is formed by an operator (i.e. OP), input, and output operands. The operator operates on the input operand to produce the output operand. For example, "OP(sl, s2)" may be "si + s2" for a summation operator.

The sequence is sequentially divided into three components as mentioned above: a setup function to initialize operands, a predict function to generate a prediction, and a parameter-updating function to update parameters if any parameters are formulated in an alpha during searching. We aim to search for a best alpha a^* from a search space A, which is the universe of: all possible sequences of operations, that can be combined to form new alphas; a hyperparameter for constraining the sequences by controlling the number of operations in each component - e.g. the number of times a particular operator or type of operator can be used in an alpha; a set of available Ops for each component; and a set of available operands for each data type (scalar, vector, and matrix).

Various operators can be used for each component. For each operand, s, v and m denote a scalar, vector and matrix respectively. Taking the Predict() component of an alpha X, prior to evolving, the fifth operation s5 = s6 - s9 has an output operand s5, input operands s6 and s9, and OP of s5, s6, s9 are the sixth, seventh and tenth scalar operands, respectively. Each operand is marked with a letter to represent the data type and a number starting from 0 and less than the maximum number of operands of the data type. Special operands are set before search: input feature matrix mO, output label sO and prediction si. An operand can be overwritten. Thus, only the last prediction si in predict function is the final prediction. The search algorithm will select OPs and operands at the mutation step.

An alpha is evaluated over a set of tasks FK, where K is the number of tasks (stocks). Each task is a regression task for a stock, mapping an input feature matrix X e R^fxw to a scalar label of return y, where f is the number of types of features and w is the time window in days we consider as input. For a cyclically varying task, the time window w may be a full cycle - e.g. three years for a task that has a three year cycle. The pair of X and y defines a sample. All samples S are split into a training set S_tr, a validation set S_v, and a test set S_te.

The framework 200 is based on evolutionary search. It performs an iterative selection process for the best alpha under a time budget. Figure 3 sets out the process 300. In the first iteration the process is initialised by a starting parent alpha, Ao. This starting parent alpha could be a predefined alpha such as a two- layer neural network (equivalent to predict and parameter-updating function in our searched alpha), an expert-designed alpha (only equivalent to predict function), or simply an empty or random alpha - e.g. the alpha may be defined at step 114. The search starts by generating a population Po based on the starting parent alpha - step 102. In the present case, P₀ is generated by mutating the parent alpha. Two types of mutations are performed on the parent alpha to generate a child alpha: randomising operands or OP(s) in all operations; and inserting a random operation of removing an operation at a random location of the alpha. Thus, to generate child alphas in forming a new parent P₁: an argument or OP(s) in all operations or a single operation in a component function may be replaced; or a mutation for a single operation in a component function may be added or dropped.

Each alpha of P₀ is trained and evaluated on tasks FK. The evaluation yields a fitness score as shown in the right side of Figure 3. The fitness score may be calculated using any appropriate method. Presently, for each trained alpha, the fitness score is an information coefficient (IC) based on a Pearson Correlation between a temporal descriptor predicted by the alpha and a corresponding temporal descriptor from the training data set. Presently, the Information Coefficient (IC) used as the fitness score and, for alpha i, is calculated according to: (1)

where is the value vector of predictions of an alpha / on date

t, y_t is the corresponding value vector of labels, corr is the sample Pearson Correlation, and N is the number of samples in S_v.

After an iteration or repetition of training, the population is updated for the next iteration or repetition. This may simply involve selecting all of the trained alphas from the previous, completed iteration or repetition. However, presently, after each repetition or iteration of evolution an alpha is selected and used as a parent alpha to generate child alphas by mutations to form a new parent, P₁, for the next iteration of evolution and so on. The alpha may be selected on any basis - for example, the alpha with the highest fitness score may be selected. In other cases, the alpha with the highest fitness score in a randomly selected set of fixed size, called the tournament, is selected as a new parent alpha - e.g. alpha A₃ has the highest fitness score in a tournament of P₀, Similarly, at least one (and generally only one - e.g. the oldest) alpha is dropped from the set of alphas trained by the previous iteration or repetition of evolution. In this case, Pi is generated by adding A₅ mutated from A₃ and eliminating A₁ from P₀ in Figure 3. The above training and evaluation and population recreation steps are repeated until a training budget is exhausted - e.g. N repetitions. To illustrate, search steps 302 show alphas in population P₀ arranged from oldest A₁ on the left to youngest A₄ on the right. This convention is maintained for further populations P₁ to P₁₀₀. To progress from one population, P₀, to the next, P₁, the alpha with the highest fitness score, presently A₃, is mutated to generate a child 306, A₅. When progressing from population P₁ to population P₂ the new alpha A₅ is not considered for mutation. Instead, the alpha with highest fitness score, that survived from the previous transition from one repetition of steps 106 and 108 to the next, is used for mutation. Presently, that alpha, A₂, is also the oldest alpha. Therefore it is mutated to generate a new alpha, A₆, but is then dropped from the next population P₂.

To determine which alphas to use for mutation in each iteration of training, the alphas are evaluated (304) to determine their fitness scores.

Step 118 involves search space optimisation. To do this, step 118 rejects redundant alphas, or parts thereof. It may do this between successive repetitions of steps 106, 108 and 110 and/or for the first performance of step 106.

The present tasks FK are interconnected. This may be due to each task representing a stock that is on the stock market, or in a particular industry, or bacteria from the same family - e.g. coronaviruses. Therefore, the present method 100 needs to take into account the interrelationship or interconnectedness between tasks in order to identify weakly-correlated alphas. In the embodiment illustrated in Figure 4, the training data set comprises a plurality of tasks 400 and each alpha comprises a sequence of operators for moving from one step 402 to the next step 404, to step 406 and so on. A dependency between two or more of the tasks 400 is modelled based on a statistical relationship between operands of an operator of the alpha. One of the mechanisms to model task inter-dependency will be referred to as edge operations (RelationOps). Each RelationOp calculates statistics based on input scalar operands from current task and other related tasks.

For the execution of an operation with an RelationOp on a sample s^(a) ∈ S where a e F_K, the input operands passed into the operation of a are the outputs from their corresponding operations on samples s^F ⊂ s where F is a set of related tasks. Based on the type of RelationOp, the operation calculates an output. For the example shown in Figure 4, execution of s2 = rank(s3) (404) has its input calculated as s3 = norm(m0) (402) on the samples of the same time step from all related tasks, where norm calculates the Frobenius norm or a matrix. The output operand and F are determined by the types of RelationOps.

There are multiple types of RelationOps including: RankOp outputs the ranking of the input operand calculated on s^(a) among those calculated on S^FK ; RelationRankOp outputs the ranking of the input operand calculated on s^(a) among those calculated on s^Fi where ⊂, F_K are tasks in the same sector; RelationDemeanOp calculates the difference between the input operand calculated on s^(a) the mean of those calculated on s^Fi. An example of RelationOp is demonstrated in Figure 4.

OPs extracting a scalar feature from X and OPs extracting a vector feature from X are defined as GetScalarOps and GetVectorOps, respectively. These are examples of ExtractionOps. ExtractionOps facilitate searching for the new class of alphas and avoid discovering a complex machine learning alpha from scratch. Specifically, in the evolutionary process, the fitness score of an alpha augmented with the extracted scalar inputs is usually high among a population, and thus this alpha is less likely to survive the next population as a parent alpha. Once an ExtractionOp is selected in a mutation step, X serves as a pool for selecting a scalar or a vector, otherwise, it is a direct input into an operation for matrix calculation. Thus, the actual input of an evolved alpha can be X (i.e. a matrix) or just a scalar, column, or row feature vector of X. This flexibility of input forms is enabled by our designed OPs GetScalarOp and GetVectorOp, which further expand the search space of an alpha. Since a formulaic alpha is a special case in the formulation of alpha 202 and the search space is defined with the formulation, the expanded search space incorporates the search space of the search algorithm of formulaic alphas.

Iteratively, this process guides the evolution towards the new alpha instead of a machine learning alpha that does not allow scalar inputs.

Searching for the best alpha in a large space requires efficiency. In some cases, efficiency can be increased by early stopping repeated evaluations of an alpha having the same prediction as another alpha. However, this may still result in evaluation of all alphas regardless of whether or not they are valid. In addition, all operations may be performed in a valid alpha regardless of whether or not any particular operation is valid. These two issues are demonstrated on the left of Figure 5 that illustrates validity checks on two examples using pruning techniques disclosed herein. The dark, circle, and solid nodes are redundant operands, valid or necessary operands, and the input respectively. The edges are operators of the alpha. The dashed operators connect to an operand calculated in the previous sample - e.g. s 3. In this illustration, alphas are first transformed into graphs (e.g. trees). Then connectivity between the prediction and a valid input is checked.

Evaluating all alphas and all operators regardless of whether or not they are valid negatively impacts the efficiency of alpha search because K is larger than the number in the original problem. Moreover, K cannot be reduced by approximating using a small subset of tasks. This is due to stock prediction tasks being difficult given the noisy nature of the stock price data. So fingerprinting and checking fingerprints, which is a post-execution method since fingerprinting requires execution of an alpha over samples, are inefficient processes.

Therefore, the method 100 involves pruning redundant operations and alphas as well as fingerprinting without evaluation. Specifically, the fingerprint of an alpha is built by pruning redundant operations and alphas before evaluation and transforming the strings of the remaining operations of an alpha into numbers. If the fingerprint is matched in a cache, the alpha is evaluated to get its fitness score and then hashed into the cache.

The method 100 then involves performing redundancy pruning. Redundancy pruning removes duplicate alphas, invalid or redundant alphas or parts of alphas. In so doing, the alphas and parts thereof that are ultimately evaluated are only those that are valid - in effect, this generates a minimum, complete set of valid alphas, each of which is formed from only valid components.

To reduce the search space, a pruning operation is performed - this involves checking if particular operators/instructions in an alpha are redundant or if the alpha is redundant as a whole. If only part of the alpha is redundant, that part is deleted from the alpha. The reason is that a redundant part of an alpha may still be able to be calculated but will have no bearing on the prediction from the alpha. Similarly, a redundant alpha cannot provide a prediction. Therefore, training and predicting using all alphas and redundant parts will yield nothing more than training and predicting using only valid alphas containing only valid operators. Thus, the two are functionally equivalent.

To perform the pruning operation, and alpha is represented as a graph, with operators as edges and operands as nodes. The last s1 prediction operand in the predict function is used as the root node, s1 can be assumed to be valid since the previous s1 operands are overwritten. Starting from the root node (a valid operation with the last si as output operand), we iteratively find the validity of each of the input operands and its operation by checking if it is an output operand from a valid operation. This iteration ends and returns true if we find a leaf operand is an input m0. Finally, the operation(s) with a redundant output operand are pruned.

Thus, each alpha can be visualised as a sequence of operators (edges) that operate on operands (nodes) set out in a graph as shown in Figure 5. The dark and light nodes are the redundant and necessary nodes, respectively, and solid nodes are m0. The dashed edge is the operator with input operand calculated at the last time step. A redundant operand is identified by checking if it is an output operand of a valid operation, a valid operation being an operation performed by a valid operator - redundant instructions/operators are circled in the listed instructions for each alpha 500, 502. With reference to example 500, an operator may be categorised as redundant if one or both of an input operand and output operand of the operator is none of: an output of the alpha - s₁; an input of the alpha - m0; and both an output of one operation and an input of another operation of respective operators of the alpha - s2, s3, s4, s5, s6. The fourth operation of the Predict() component, with output s1, denoted by node s1(4), is redundant since it is overwritten by s1(8) which is used in the prediction. Node 1l(4) is not an input or output of the alpha, nor is it an input of one operator and an output of another - i.e. it is redundant as it is not required on the path from input operand mO to prediction si . The same applies to s8. The operation with output s8 is redundant since s8 does not contribute to the calculation of sl18). Thus, the operators for sl(4) and s8 are redundant and can be omitted from the alpha 500 without changing its functionality.

For example 502 there is no path between prediction s1 and input m0. As a consequence, m0 is not used in the prediction and the alpha is redundant. Once a valid alpha is identified, the strings of valid operators/operations are transformed into numbers and used as a fingerprint for the alpha. The fingerprint is then hashed into the cache for use in further pruning operations in later searches of other alphas. Thus, the set of alphas generated by the method 100 can be de-duplicated by comparing hashes.

In the early stages of the evolutionary process, an alpha usually has more redundant operations than useful ones. These redundant operations can be pruned by the pruning techniques mentioned above. In the later stages of the evolutionary process, an alpha with no redundancy tends to be vulnerable to random mutations - e.g. deleting a random operation would invalidate - i.e. render redundant - the prediction. Consequently, in the later stages, some alphas may become redundant after random mutations and would thus be pruned.

Experiments

Performance of the method 100 was tested, the testing being organized into three stages.

In the first stage, the method 100 was compared with existing algorithms under various settings. The second stage involved ablation studies of the contribution of parameter-updating functions and optional knowledge injection. In the third stage case studies were performed for the generated alphas.

In pre-processing for experiments, tasks (i.e. stocks) with insufficient samples were excluded as were stocks that reached too low a value since these would not be traded and therefore any alpha applying to them is unlikely to be of practical use. Each type of the features in the remaining data was normalized by its maximum value across all time steps for each stock. Various algorithms were compared including using the genetic algorithm as the baseline, the present method without initialized alphas, and with initialised alphas including a domain-expert-designed alpha (see Figure 2), an randomly designed alpha with a bad initial performance and a two-layer neural network alpha searched by AutoML-Zero. To mine a set of weakly-correlated alphas, each algorithm was run for five rounds and alphas having a correlation with existing alphas of larger than 15% (i.e. sample Pearson Correlation) were discarded - to this end, that which constitutes a "weak correlation" or alphas that are "weakly-correlated" may be determined by reference to a maximum threshold correlation such as 15%, 10%, 5%, 3% or other value. Constrained by a cut-off or cut-offs, the alpha with the highest Sharpe ratio and fitness score was selected from each round. It was evident that the search algorithm 100 had increasing difficulty discovering alphas with high Sharpe ratio and low fitness score in later rounds of evolution. The same cut-off process was applied to all algorithms under test.

The final repetition (i.e. performance of steps 106 and 108 of method 100 in Figure 1) used the previous best alphas as initialisation. Apart from the IC, the Sharpe ratio was used for stock prediction experiments, to measure risk- adjusted returns of a portfolio built based on an alpha. The result of this experimentation was the creation of alphas with nearly no correlation to a starting expert designed alpha as shown in Table 1.

Alpha Sharpe ratio IC (fitness) correlation with existing alpha

Expert designed alpha 4.111784 0.013159 NA

Best evolved alpha from 21.323797 0.067358 0.030301

AlphaEvolve with initialisation as expert-designed Alpha

Best generated alpha by 13.034052 0.048853 -0.103120 genetic algorithm

Table 1: mining weakly-correlated alpha with existing domain-expert-designed alpha

The training process proceeded for five rounds. Figures 6a to 6e show the evolutionary trajectories for the best alphas in all rounds, with the x-axis being the number of candidate alphas evaluated while the y-axis is the fitness score on the test period. The first four rounds have a decreasing trend in both Sharpe ratio and IC. This is expected since the accumulative cut-offs are setting higher search difficulty for the same initializations. This increasing difficulty is also shown in the evolutionary trajectories of the best-evolved alphas (Figures 6a to 6d). It shows that the ICs of the best-evolved alphas decrease and the fluctuations increase as the round increases. This trend changes when the previous best-evolved alphas are set as the initialized alphas in the last round. It shows that method 100 can still discover good alphas with all cut-offs and thus the potential of the method 100 in mining more good alphas.

When compared with the genetic algorithm, the Sharpe ratio and the IC of the genetic algorithm reduce very fast, showing that it is not adaptive to the standard of searching weakly-correlated alpha. This is expected since the search space of the genetic algorithm is smaller and has the premature convergence problem.

Regarding the performance of evolving an initialised alpha (e.g. a domain- expert-designed alpha, a randomly designed alpha, or a two-layer neural network) when compared with no initialisation, the method 100 can discover good alphas by leveraging a well-designed alpha. Alphas that were initialised fared better in general than those that were not initialised

After conducting the study, an ablation study was performed on the update- function to determine its effectiveness. It was found that an evolved alpha often shows in the form of a recursive model (i.e. a special case of a system of equations which can be solved sequentially). The recursion is along with time steps T.

To ease readability, the original form of an evolved alpha is changed to a set of equations as shown in Figure 2 (see 218). For the study, the equation set is then divided into three parts: M, P and U. In the training stage, M and U (if any) are the predict function and the parameter-updating function, while in the inference stage the predict function comprises M and P. M is used in both stages to pass parameters between stages. An alpha may therefore be written as:

Some operand is defined as a parameter: firstly the operand does not change during the inference stage, thus it is not an intermediate result like any other operands in the alpha during the inference stage. In some cases, the operand appears in a comparison operation and can be overwritten. Such alpha is adaptive to the data in the inference stage and convertible to the formulaic type upon overwrite; secondly, since this operand is recursively calculated based on all training samples in the training stage, it summarizes the historical information and serves as a system property (i.e. the system of equations in the inference time) and thus a parameter. Note that in the training stage, as opposed to the predict function and learn function in a machine learning alpha where the prediction is used in a loss function to learn the parameters, the predict function in the alpha discovered by method 100 generates intermediate results to the parameter-updating function where the parameters are updated recursively.

For an alpha (Equations from 2 to 9), after the training stage, parameters S4_t-2 and S2_t-2 updated by Equations 6 and 7 are passed to M as the initial values of S4_t-2 and S2_t-2 for the inference stage. These parameters affect the output operands S1t in Equation (2) and S1_t-1 in Equation (3) by initialising the input operands. Then parameter S4_t-2 is overwritten in Equation (4). The remaining parameter S2_t-2 is used in an upper bound for an expression of the temporal difference (i.e. trend) of the high prices - in particular, arcsin(S2_t-2) is used as a cap value for an expression of high prices (Equation 3). The prediction is the fraction with a tangent of S3_t-1 as the numerator and a difference between the expressions of high prices at different time steps (i.e. features on a trend) as the denominator (Equation 2). This shows that the alpha is for making trading decisions based on the trend of high prices and the historically summarized bound. This cap value will be overwritten by recent trend of high prices (i.e. new values updated by Equations 4 and 5) once the cap value is less than the trend (Equation 3). At this point, the model becomes a formulaic alpha - the formula reflected by Equations (2) to (9) for the alpha reflected in Figure 6a. For a neural network alpha shown in Figure 6b, it becomes a complex formula using a relation rank operator and high price:

For the alpha shown in Figure 6c (Equations from 11 to 16), the parameter M2_t- ₂ is updated recursively with the input feature matrix (Equation 14 and 15). S2_t- ₂ is a trend feature based on the comparison between a high price high_price_t-4 and a recursively compared feature of high price high_price _t-5 (Equation 12 and 13) . Thus, in the inference stage, Equation (11) shows that the alpha makes trading decisions based on the volatility of the historically updated features M2t- 2, the trend feature based on high prices S2_t-2 and the recent return S0_t-3.

For the alpha reflected in Figure 6d (Equations from 17 to 19), a lower bound of the transpose of the input feature matrix is set for an expression of the input feature matrix to recursively update the parameter M1_t-2 (Equation 19). At the end of the training stage M1_t-2 is passed to M and P as initial matrices M1_t-2 and M1_t-2 respectively (Equation 17 and 18). Then Ml t-3 recursively compares with (an expression of) the input feature matrix (Equation 18). Finally, the prediction is the standard deviation of another comparison result between M1_t-2 and an expression of the input feature matrix (Equation 17), showing that this alpha trades based on the volatility of the current market features capped by the historical features. Note that once M1_t-2 is larger than heavyside (M0_t-2, 1) (Equation 18), the whole model becomes a formula without parameters.

For the alpha reflected in Figure 6e (Equations from 20 to 22), the parameter M1_t-2 is updated recursively with the expressions of the input features matrix (Equation 21 and Equation 22). The prediction is based on the comparison between the inverse of close_price_t-3 and the expression of MV30_t-4 (i.e. the moving average of the close prices over the last 30 days calculated at 0_t-4) and the standard deviation of M1_t-2 (Equation 20). Thus this alpha makes trading decisions based on the recent close price, the long-term trend of close prices and the volatility of the historical summarized features.

Notably, all parameter-updating functions increase the ICs in the inference stage. This proves the effectiveness of the parameter-updating functions since the fitness score used in searching alphas in method 100 is the IC. The Sharpe ratios, however, do not change with the increasing ICs for in general. This is expected because the Sharpe ratio of a portfolio depends on the top and bottom stock rankings while the IC measures rankings of all stocks. Thus, better ICs do not always lead to better Sharpe ratios.

It was found that a noisy stock market affected by rapid-changing information is not suitable for modelling with static relational knowledge. Method 100 avoids this issue by not setting such a strict structural assumption and thus has better performance than previous method that assume a strict structure. Moreover, in some instances it was found that avoiding injection of domain knowledge led to better results.

Figure 7 is a block diagram showing an exemplary computer device 700, in which embodiments of the invention may be practiced. The computer device 700 may be a mobile computer device such as a smart phone, a wearable device, a palm-top computer, and multimedia Internet enabled cellular telephones, an on-board computing system or any other computing system, a mobile device such as an iPhone ™ manufactured by Apple™, Inc or one manufactured by LG™, HTC™ and Samsung™, for example, or other device. As shown, the mobile computer device 700 includes the following components in electronic communication via a bus 706:

(a) a display 702;

(b) non-volatile (non-transitory) memory 704;

(c) random access memory ("RAM") 708;

(d) N processing components 710;

(e) a transceiver component 712 that includes N transceivers; and

(f) user controls 714.

Although the components depicted in Figure 7 represent physical components, Figure 7 is not intended to be a hardware diagram. Thus, many of the components depicted in Figure 7 may be realized by common constructs or distributed among additional physical components. Moreover, it is certainly contemplated that other existing and yet-to-be developed physical components and architectures may be utilized to implement the functional components described with reference to Figure 7.

The display 702 generally operates to provide a presentation of content to a user, and may be realized by any of a variety of displays (e.g., CRT, LCD, HDMI, micro-projector and OLED displays).

In general, the non-volatile data storage 704 (also referred to as non-volatile memory) functions to store (e.g., persistently store) data and executable code. The system architecture may be implemented in memory 704, or by instructions stored in memory 704 - e.g. memory 704 may be a computer readable storage medium for storing instructions that, when executed by processor(s) 710 cause the processor(s) 710 to perform the method 100 described with reference to Figure 1.

In some embodiments for example, the non-volatile memory 704 includes bootloader code, modem software, operating system code, file system code, and code to facilitate the implementation components, well known to those of ordinary skill in the art, which are not depicted nor described for simplicity.

In many implementations, the non-volatile memory 704 is realized by flash memory (e.g., NAND or ONENAND memory), but it is certainly contemplated that other memory types may be utilized as well. Although it may be possible to execute the code from the non-volatile memory 704, the executable code in the non-volatile memory 704 is typically loaded into RAM 708 and executed by one or more of the N processing components 710.

The N processing components 710 in connection with RAM 708 generally operate to execute the instructions stored in non-volatile memory 704. As one of ordinarily skill in the art will appreciate, the N processing components 710 may include a video processor, modem processor, DSP, graphics processing unit (GPU), and other processing components.

The transceiver component 712 includes N transceiver chains, which may be used for communicating with external devices via wireless networks. Each of the N transceiver chains may represent a transceiver associated with a particular communication scheme. For example, each transceiver may correspond to protocols that are specific to local area networks, cellular networks (e.g., a CDMA network, a GPRS network, a UMTS networks), and other types of communication networks.

The system 700 of Figure 7 may be connected to any appliance 718, such as one or more cameras mounted to the vehicle, a speedometer, a weather service for updating local context, or an external database from which context can be acquired.

It should be recognized that Figure 7 is merely exemplary and in one or more exemplary embodiments, the functions described herein may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code encoded on a non-transitory computer-readable medium 704. Non-transitory computer-readable medium 704 includes both computer storage medium and communication medium including any medium that facilitates transfer of a computer program from one place to another. A storage medium may be any available medium that can be accessed by a computer.

It will be appreciated that many further modifications and permutations of various aspects of the described embodiments are possible. Accordingly, the described aspects are intended to embrace all such alterations, modifications, and variations that fall within the spirit and scope of the appended claims.

Throughout this specification and the claims which follow, unless the context requires otherwise, the word "comprise", and variations such as "comprises" and "comprising", will be understood to imply the inclusion of a stated integer or step or group of integers or steps but not the exclusion of any other integer or step or group of integers or steps.

The reference in this specification to any prior publication (or information derived from it), or to any matter which is known, is not, and should not be taken as an acknowledgment or admission or any form of suggestion that that prior publication (or information derived from it) or known matter forms part of the common general knowledge in the field of endeavour to which this specification relates.

Claims

1. A method for generating alphas, comprising:

(f) generating a population of alphas based on a parent alpha function;

(g) receiving a training data set comprising stock prices of one or more tasks, the stock prices for at least one task forming a trend;

(h) for each alpha in the population: training the alpha on the training data set to generate a trained alpha; and determining a fitness score for the trained alpha;

(i) performing step (c) for a predetermined number of repetitions, M, wherein: for each repetition except a first said repetition, the population is defined based on the trained alphas of a previous repetition; and for at least one repetition, the population is defined by replacing at least one trained alpha of a previous repetition with a mutated alpha generated from another said trained alpha of the previous repetition; and

(j) selecting, after completion of an M^th said repetition, the alpha with highest fitness score.

2. The method of claim 1, wherein replacing at least one trained alpha comprises replacing the trained alpha that is oldest.

3. The method of claim 1 or 2, wherein, for each trained alpha, the fitness score is an information coefficient based on a Pearson Correlation between a stock price predicted by the alpha and a corresponding stock price from the training data set.

4. The method of any one of claims 1 to 3, further comprising defining the parent alpha.

5. The method of claim 4, wherein the parent alpha is one of a predefined- alpha, an expert-designed alpha, an empty alpha or random alpha.

6. The method claim 4 or 5, wherein defining the parent alpha comprises defining the parent alpha as a two-dimensional neural network.

7. The method of any one of claims 1 to 6, wherein, for each task for which the stock prices form a trend, the trend is cyclical and the stock prices cover at least one complete cycle of the trend.

8. The method of any one of claims 1 to 7, wherein the training data set comprises a plurality of tasks and each alpha comprises a sequence of operators, the method further comprising modelling a dependency between two or more said tasks based on a statistical relationship between operands, for all of the two or more tasks, of an operator in the sequence. 9. The method of claim 8, wherein the statistical relationship is determined at least in part by: applying the sequence of operators to each of two or more tasks of the plurality of tasks; and while applying the sequence of operators, inputting into at least one operator in the sequence other than a first operator, operands from a previous operator in the sequence from each of the two or more tasks. lO.The method of any one of claims 1 to 9, wherein each alpha comprises a sequence of operators that operate on operands, the method further comprising rejecting any said alpha, or part thereof, comprising a redundant operand. ll.The method of claim 10, further comprising identifying the redundant operand by checking if it is an output operand of a valid operation, a valid operation being an operation performed by a valid operator. 12.The method of claim 11, further comprising categorising an operator as redundant if one or both of an input operand and output operand of the operator is none of: an output of the alpha; an input of the alpha; and both an output of one operation and an input of another operation of respective operators of the alpha.

13. A prediction function generator, comprising: memory; and at least one processor (processor(s)), wherein the memory stores instructions that, when executed by the processor(s), cause the system to:

(f) generate a population of alphas based on a parent alpha;

(g) receive a training data set comprising stock prices of one or more tasks, the stock prices for at least one task forming a trend;

(h) for each alpha in the population: train the alpha on the training data set to generate a trained alpha; and determine a fitness score for the trained alpha;

(i) perform step (c) for a predetermined number of repetitions, M, wherein: for each repetition except a first said repetition, the population is defined based on the trained alphas of a previous repetition; and for at least one repetition, the population is defined by replacing at least one trained alpha of a previous repetition with a mutated alpha generated from another said trained alpha of the previous repetition; and (j) select, after completion of an M^th said repetition, the alpha with highest fitness score.

14.The prediction function generator of claim 13, wherein the processor(s) replaces the at least one trained alpha by replacing the trained alpha that is oldest.

15.The prediction function generator of claim 13 or 14, wherein, for each trained alpha, the fitness score is an information coefficient based on a Pearson Correlation between a stock price predicted by the alpha and a corresponding stock price from the training data set.

16.The prediction function generator of any one of claims 13 or 15, wherein the instructions further cause the processor(s) to define the parent alpha.

17.The prediction function generator of claim 16, wherein the parent alpha is one of a predefined-alpha, an expert-designed alpha, an empty alpha or random alpha.

18.The prediction function generator claim 16 or 17, wherein the processor(s) define the parent alpha by defining the parent alpha as a two-dimensional neural network.

19.The prediction function generator of any one of claims 13 to 18, wherein, for each task for which the stock prices form a trend, the trend is cyclical and the stock prices for the respective task cover at least one complete cycle of the trend.

20.The prediction function generator of any one of claims 13 to 19, wherein the training data set comprises a plurality of tasks and each alpha comprises a sequence of operators, the instructions further cause the processor(s) to model a dependency between two or more said tasks based on a statistical relationship between operands, for all of the two or more tasks, of an operator in the sequence. The prediction function generator of claim 20, wherein the processor(s) determine the statistical relationship at least in part by: applying the sequence of operators to each of two or more tasks of the plurality of tasks; and while applying the sequence of operators, inputting into at least one operator in the sequence other than a first operator, operands from a previous operator in the sequence from each of the two or more tasks. The prediction function generator of any one of claims 13 to 21, wherein each alpha comprises a sequence of operators that operate on operands, the instructions further causing the processor(s) to reject any said alpha, or part thereof, comprising a redundant operand. The prediction function generator of claim 22, wherein the instructions further cause the processor(s) to identify the redundant operand by checking if it is an output operand of a valid operation, a valid operation being an operation performed by a valid operator. The prediction function generator of claim 23, wherein the instructions further cause the processor(s) to categorise an operator as redundant if one or both of an input operand and output operand of the operator is none of: an output of the alpha; an input of the alpha; and both an output of one operation and an input of another operation of respective operators of the alpha.