CN111028086A

CN111028086A - Enhanced index tracking method based on clustering and LSTM network

Info

Publication number: CN111028086A
Application number: CN201911169339.3A
Authority: CN
Inventors: 鲍亮; 张晶; 宋金秋; 任笑
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2019-11-26
Filing date: 2019-11-26
Publication date: 2020-04-17

Abstract

The invention discloses an enhanced index tracking method based on clustering and an LSTM network, which comprises the following steps: (1) data preprocessing, including the steps of data acquisition, data cleaning, normalization, dimension reduction and the like; (2) generating a training sample set for memorizing the LSTM network at long time; (3) constructing a long-time memory LSTM network model; (4) training a long-time memory LSTM network model; (5) and calculating the weight of the stocks in the test set. The method overcomes the defects of too complex model and large tracking error in the prior art in the process of enhancing the index tracking, so that the model adopted in the method for enhancing the index tracking is simple, the weight value can be dynamically adjusted, and the method has the advantage of small tracking error.

Description

Enhanced index tracking method based on clustering and LSTM network

Technical Field

The invention belongs to the technical field of computers, and further relates to an enhanced index tracking method based on a clustering and long Short Term memory network (LSTM) (Long Short Term memory) in the technical field of data processing. The invention can be used for tracking the enhanced index.

Background

The index enhancement is to add an enhanced positive investment means on the basis of passive index tracking, properly adjust the investment portfolio and strive to obtain positive market benefits while controlling risks. The index enhancement strategy does not completely duplicate the constituent shares of the tracking index, but adds weight to partially viewed stocks, and reduces weight or even completely removes unappreciated stocks. In a comprehensive view, excess income is achieved, active risks are controlled, and the investment goal is to obtain the income higher than the benchmark while closely tracking the benchmark index.

There are three types of currently used enhanced index tracking methods: the first rule-based enhanced index tracking method is characterized in that problems are solved by using professional knowledge and various mathematical models, accurate data and massive calculation are required, and factors such as matrix non-positive definite are limited; the second method is an enhanced index tracking method based on a heuristic algorithm, which is to search an optimal solution in a space, wherein the optimal solution is easy to fall into local optimality in high-dimensional space search, so that the performance of the optimal solution is influenced to a certain extent; and the third method is a learning-based enhanced index tracking method, which solves problems by using machine learning models such as various network models, reinforcement learning models and the like.

An enhanced index tracking method based on the deep attention network and the reinforcement learning is disclosed in the patent document 'investment portfolio selection method based on the deep attention network and the reinforcement learning' (application number: 201910390018X, application date: 2019.05.10, application publication number: CN110223180A) applied by Beijing aerospace university. The method introduces a neural network model fused with an attention mechanism into the financial field, takes the Sharper ratio as a reward function, and uses a reinforcement learning framework training model to balance income and risk when generating investment portfolio selection. Meanwhile, the method also provides a brand-new cross-asset attention mechanism to model the correlation among different assets, and carries out deep exploration on model interpretability. The method has the disadvantages that the method is not beneficial to learning of the reinforcement learning model when the sharp ratio is used as the reward function and the sharp ratio is negative, so that the reinforcement learning model is unstable, the weight output by the model, and the error between the investment combination constructed according to the weight and the reference index is overlarge.

A stock index tracking and predicting method and system based on social network clustering is disclosed in the patent document applied by Nanjing university (application number: 2017101004662, application date: 2017.02.23, application publication number: CN106897797A), and the stock index tracking method based on social network clustering comprises the steps of firstly collecting the index and component stock data of the previous month and the current month from a third-party database, and cleaning the data to obtain the data in a sample and the data out of the sample which can be used for research; then, calculating measurement distance by using correlation coefficients among the component stocks, constructing a social network among the component stocks, clustering the network by using a self-adaptive affine propagation clustering algorithm, extracting a clustering center of each cluster to form a stock pool, realizing optimal tracking of stock in the stock pool to the index of the stock, and determining the optimal weight of index tracking; and finally, applying the stock pool and the optimal weight obtained by training in the sample to index tracking of data outside the sample to obtain a predicted index. The invention also provides a stock index tracking and predicting system, the constructed stock pool has low correlation, small tracking error and good stability of a copied result, and the index is accurately tracked. The method has the disadvantages that the weight obtained by the data in the sample is directly used for the data outside the sample, and dynamic adjustment is not carried out, so that the tracking error of the reference index is overlarge.

Disclosure of Invention

The invention aims to provide an enhanced exponential tracking method based on clustering and an LSTM network aiming at the defects of the prior art, which is used for solving the problems of complex model, large calculated amount and overlarge tracking error.

The idea for realizing the purpose of the invention is that the data is preprocessed by using a preprocessing means, the data is screened by using a clustering method, then a training data set of the long-time memory LSTM network is constructed by using a sliding window, the long-time memory LSTM network is trained by using the training data set, and finally the test data set is input into the trained long-time memory LSTM network for calculation to obtain the weight of each stock.

The technical scheme of the invention comprises the following steps:

(1) data preprocessing:

(1a) collecting index point data of each transaction day in 10 years and original component strand data contained in the index from a third-party database, wherein the time span is (1, T + L), the data in the sample is arranged between (1, T), (T +1, T + L) is data outside the sample, the index point data dimension is (1, T + L), the data dimension of the original component strand is (N, P, T + L), wherein N is the total number of component strands contained in the index, P is the total number of features of each component strand, P > 3, and T + L is the total number of all transaction days in 10 years,

is a rounding-down operation;

(1b) traversing all component strands in the original component strand data, removing the component strands which do not meet the time length of T + L, and forming the remaining component strands into component strand data with (M, P, T + L) dimension, wherein M is the total number of the component strands contained in the component strand data;

(1c) normalizing all characteristics in the component strand data;

(1d) using a Principal Component Analysis (PCA) method to reduce the dimensions of all the characteristics in the normalized component strand data to obtain reduced-dimension data of (M,3, T + L) dimensions;

(2) generating a training sample set for memorizing the LSTM network at long time:

(2a) forming an initial training sample set by using the in-sample data subjected to dimensionality reduction, and forming a test sample set by using the out-sample data subjected to dimensionality reduction;

(2b) taking out the data of the last 120 days from the initial training sample set after dimensionality reduction, and carrying out K-means clustering on the data of each day to obtain component stock data of (Q,3, T + L) dimensionality, wherein Q is the number of stocks with the largest occurrence frequency in 120 days;

(2c) sliding data in a training data set of (Q,3, T) dimension on a time dimension by a length R, obtaining data of (Q,3, R) dimension each time, sharing T-R +1 group data, and obtaining a training sample set D of (Q,3, R, T-R +1) dimension required by network training_trainWherein R is more than 2 and less than T;

(3) constructing a long-time memory LSTM network model:

(3a) a three-layer long-and-short memory LSTM network is built, and the structure of the LSTM network is as follows in sequence: an input layer, a hidden layer, an output layer;

(3b) setting the batch processing size of the long and short time memory LSTM network to be 1, setting the number of nodes of an input layer of the long and short time memory LSTM network to be Y, wherein Y is Q3S + S, wherein X represents multiplication operation, S is the number of steps of delay time step forward propagation of the long and short time memory LSTM network, S is more than 1 and less than R, and the output dimension of the long and short time memory LSTM network is equal to Q;

(3c) setting an activation function of a long-time and short-time memory LSTM network as a hyperbolic tangent activation function;

(3d) the loss function in the long-term memory LSTM network model is set as follows:

wherein f is a loss function, Σ is a summation operation, ln is a logarithm operation based on a natural constant e, a is a serial number in all days obtained by sliding data in the training data set of (Q,3, T) dimension once in the time dimension by a length R in step (2c), b is a serial number of all stocks in the component stock data obtained in step (2b),

the price of the b-th stock in the constituent stock data obtained in the step (2b) on the a-th day,

the weight of the b th stock in the component stock data obtained in the a th step (2b), l^aIs the index point location of the day a,

the price of the b th stock in the constituent stock data obtained in the step (2b) of the a-1 th day,

the weight of the b th stock in the component stock data obtained in step (2b) at days a-1, l^a-1Is the index point of day a-1,

the price of the b th stock in the constituent stock data obtained in the a +1 th day step (2b), l^a+1Index point location for day a + 1;

(3e) setting an optimization algorithm for memorizing the LSTM network at long time as an optimization algorithm Adam based on adaptive matrix estimation;

(4) training a long-time memory LSTM network model:

will train sample set D_trainInputting the parameters in the step (3b) and the activation function in the step (3c) into the long-short memory LSTM network, performing forward propagation on the long-short memory LSTM network, and performing backward propagation on the error of the long-short memory LSTM network by using the loss function in the step (3d) and the optimization algorithm in the step (3e) until the loss function converges to obtain a trained long-short memory LSTM network model;

(5) calculating the weight of the stock in the test set:

selecting the data of the stocks with the same composition stock data as the data of the composition stock obtained in the step (2b) from the test sample set obtained in the step (2a) to form test data D_testTest data D_testInputting the parameters in the step (3b), the activation function in the step (3c) and the trained long-short memory LSTM network model parameters obtained in the step (4) into a long-short memory LSTM network, carrying out forward propagation on the long-short memory LSTM network, and outputting the weight of each stock.

Compared with the prior art, the invention has the following advantages:

firstly, the method uses Principal Component Analysis (PCA) and K-means clustering method when constructing the training sample set of the network, so that the defect of difficult complex calculation of the model in the prior art is overcome, and the calculation speed of the method is high.

Secondly, because the invention uses a long-time memory LSTM network model and uses a sliding window in the calculation of the stock weight, the defect that the weight cannot be dynamically adjusted in the prior art is overcome, and the index tracking error is smaller.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The specific steps of the present invention will be further described with reference to fig. 1.

Step 1, data preprocessing.

The method comprises the steps of firstly, collecting exponential point data and original component strand data contained in an index of each transaction day in 10 years from a third-party database, wherein the time span is (1, T, T +1, T, T + L), wherein the data in a sample is arranged between the (1, T, T) and the data in the sample (T +1, T, T + L) is data outside the sample, the dimensionality of the exponential point data is (1, T + L), the data dimensionality of the original component strand data is (N, P, T + L), wherein N is the total number of components contained in the index, P is the total number of features of each component strand, P > 3, and T + L is the total number of all transaction days in 10 years,

to round-down operations.

And secondly, traversing all the component strands in the component strand data, removing the component strands which do not meet the time length of T + L, and forming the remaining component strands into (M, P, T + L) dimensional component strand data, wherein M is the total number of the component strands contained in the component strand data.

Thirdly, normalizing all the characteristics in the component stock data according to the following formula:

wherein the content of the first and second substances,

is the normalized value, x, of the ith characteristic of all component strands in the component strand dataⁱIs the value of the ith characteristic of all component strands in the component strand data before normalization,

is the minimum value of the ith characteristic of all component strands in the component strand data before normalization,

the value is the maximum value of the ith characteristic of all component strands in the component strand data before normalization.

And fourthly, using a Principal Component Analysis (PCA) method to reduce the dimensions of all the characteristics in the normalized component strand data to obtain the (M,3, T + L) dimension reduced data.

And 2, generating a training sample set for memorizing the LSTM network at long time and short time.

Firstly, forming an initial training sample set by using the in-sample data subjected to dimensionality reduction, and forming a test sample set by using the out-sample data subjected to dimensionality reduction.

And secondly, taking out data of last 120 days from the initial training sample set after dimensionality reduction, carrying out K-means clustering on the data of each day to obtain component stock data of (Q,3, T + L) dimensionality, wherein Q is the number of stocks with the largest occurrence frequency in 120 days, the K-means clustering refers to carrying out K-means clustering on the data of each day in the selected 120 days, respectively selecting each stock closest to K clustering centers, selecting P stocks each day, and selecting the Q stocks with the largest occurrence frequency in 120 days from the P stocks selected each day to obtain the component stock data of (Q,3, T + L) dimensionality, wherein the value of P and Q is equal to K.

Thirdly, sliding data in the training data set of (Q,3, T) dimension on the time dimension by the length R, obtaining the data of (Q,3, R) dimension each time, sharing T-R +1 group data, and obtaining the training sample set D of (Q,3, R, T-R +1) dimension required by network training_trainWherein R is more than 2 and less than T.

And 3, constructing a long-term memory LSTM network model.

Firstly, a three-layer long-time memory LSTM network is built, and the structure of the LSTM network is as follows in sequence: input layer, hidden layer, output layer.

And secondly, setting the batch processing size of the long and short time memory LSTM network to be 1, setting the number of nodes of an input layer of the long and short time memory LSTM network to be Y, and setting Y to Q to 3 to S + S, wherein X represents multiplication operation, S is the number of steps of delay interval step forward propagation of the long and short time memory LSTM network, S is more than 1 and less than R, and the output dimension of the long and short time memory LSTM network is equal to Q.

And thirdly, setting an activation function of the long-time memory LSTM network as a hyperbolic tangent activation function.

Fourthly, setting a loss function in the long-time memory LSTM network model as follows:

wherein f is a loss function, Σ is a summation operation, ln is a logarithm operation based on a natural constant e, a is a serial number in all days obtained by sliding data in the training data set of (Q,3, T) dimension once in the time dimension by a length R in the third step of step 2, b is a serial number of all stocks in the component stock data obtained in the second step of step 2,

the price of the b-th stock in the constituent stock data obtained in the second step of step 2 on day a,

the weight of the b th stock in the constituent stock data obtained in the second step of the step 2 on day a, l^aIs the index point location of the day a,

the price of the b-th stock in the constituent stock data obtained in the second step of step 2 on days a-1,

the weight of the b th stock in the constituent stock data obtained in the second step of step 2 on days a-1, l^a-1Is the index point of day a-1,

the price of the b th stock in the constituent stock data obtained in the second step of step 2 on day a +1,/^a+1The index points at day a + 1.

Fifthly, setting an optimization algorithm of the long-time memory LSTM network as an optimization algorithm Adam based on adaptive matrix estimation;

step 4, training the long-time memory LSTM network model, and collecting the training sample set D_trainInputting the parameters into the long-short time memory LSTM network, performing forward propagation on the long-short time memory LSTM network by using the parameters in the second step in the step 3 and the activation function in the third step in the step 3, and performing backward propagation on the error of the long-short time memory LSTM network by using the loss function in the fourth step in the step 3 and the optimization algorithm in the fifth step in the step 3 until the loss function converges to obtain a trained long-short time memory LSTM network model.

And 5, calculating the weight of the stocks in the test set.

Selecting the data of the stocks with the same composition stock data as the data of the stocks obtained in the second step of step 2 from the test sample set obtained in the first step of step 2 to form test data D_testTest data D_testInputting the parameters in the second step of the step 3, the activation function in the third step of the step 3 and the trained long-short memory LSTM network model parameters obtained in the step 4 into a long-short memory LSTM network, carrying out forward propagation on the long-short memory LSTM network, and outputting the weight of each stock.

The invention will now be further described with reference to the following examples

Step 1, data preprocessing.

In the first step, the method utilizes Wind software to obtain the stock data (containing 18 characteristics) and the index point data of the proof 180 index 2009-2018. The capital 180 index (also called capital component index) is obtained by adjusting and renaming the original 30 index of the Shanghai securities exchange, and the sample stock is 180 sample stocks which are most representative of the market among all A stocks. The upper certificate 180 contains 180 shares of the component stock, 2009 and 2018 all trade days are 2431 days, each stock has 18 characteristics, so that the data dimension of the original component stock is (111,18,2431), the data dimension in the sample is (111,18,2382) and the data dimension out of the sample is (111,18, 49). The above 18 feature list 1 is as follows:

table 118 list of features

Price of opening dish	Closing price	Highest price	Lowest price	Rise and fall	Amplitude of fluctuation
						Volume of business	Rate of hand change	Market profit rate	Net rate of market	Market rate	Total market value
Market rate	Value of A stock market	Total stock book	A stock through book	Average price	Amount of transaction

And step two, all stocks of which the time range does not cover 2009-2018 in the data of the original component stock are removed, missing data are complemented by using an average value, and finally 111 stocks are reserved to form component stock data with the dimension of (111,18, 2431).

And thirdly, carrying out Principal Component Analysis (PCA) dimensionality reduction on the component strand data to obtain data of (111, 3, 2431) dimensionality, wherein the data in the sample is (111, 3,2382) dimensionality, and the data outside the sample is (111, 3, 49) dimensionality.

Step one, the data in the samples after the dimensionality reduction obtained in the step 1 and the third step form an initial training sample set, and the data outside the samples after the dimensionality reduction form a testing sample set.

And secondly, selecting the data of the last 120 days in the initial training sample set obtained in the first step, wherein the dimensionality is (111, 3, 120), the data of each day is (111, 3) dimensionality, clustering the data of each day, the number of clustering clusters is 10, selecting 10 stocks closest to a clustering center in the data of each day as alternative stocks, selecting the first 10 stocks with the largest occurrence frequency in 120 days as stocks contained in an investment portfolio, and taking out the data of the selected stocks in the initial training sample set to obtain a training data set of the (10,3,2382) dimensionality.

And thirdly, sliding the training data set of the dimension (10,3,2382) obtained in the second step in the last dimension, namely the time dimension, by the length 50, and obtaining data of the dimension (10,3, 50) each time, wherein 2333 groups of data are obtained, the data are data of the dimension (10,3,2382, 2333), and the data are training sample sets used for training the network.

And 3, constructing a long-term memory LSTM network model.

And secondly, setting the batch processing size of the long and short time memory LSTM network to be 1, setting the number of nodes of an input layer of the long and short time memory LSTM network to be 620(Y is Q3S + S, wherein Q is 10, S is 20), and setting the output dimension of the long and short time memory LSTM network to be 10.

wherein f is a loss function, Σ is a summation operation, ln is a logarithm operation based on a natural constant e, a is the serial number of all days obtained by sliding data in the training data set of (10,3,2382) dimension once in the time dimension by the length 50 in the third step of step 2, b is the serial number of all stocks in the component stock data obtained in the second step of step 2,

and 4, training a long-short-term memory LSTM network model, inputting the training sample set of the network obtained in the step 3 into the long-short-term memory LSTM network constructed in the step 3, performing forward propagation on the long-short-term memory LSTM network by using the parameters in the second step of the step 3 and the activation function in the third step of the step 3, and performing backward propagation on the error of the long-short-term memory LSTM network by using the loss function in the fourth step of the step 3 and the optimization algorithm in the fifth step of the step 3 until the loss function is converged to obtain the trained long-short-term memory LSTM network model.

And 5, calculating the weight of the stocks in the test set.

Selecting the data of the stocks with the same components as the data of the stocks obtained in the second step of the step 2 from the test sample set obtained in the first step of the step 2 to form test data with dimensions (10,3 and 49), inputting the test data into the long-short memory LSTM network, and performing forward propagation on the long-short memory LSTM network by using the trained long-short memory LSTM network model parameters obtained in the step 4 to obtain the weight of each stock.

The effect of the present invention is further explained by combining the simulation experiment as follows:

1. simulation experiment conditions are as follows:

the hardware platform of the simulation experiment of the invention: the processor is Intel (R) CoreTMi7-8700 CPU, the main frequency is 3.2GHz, and the memory is 16 GB.

The software platform of the simulation experiment of the invention is as follows: windows 10 operating system, python3.7.3, tensorflow 1.13.

The data used in the simulation experiment of the invention is all data of Shanghai securities exchange Shanghai 180-proof 2009-2018.

2. Simulation content and result analysis thereof:

the simulation experiment of the invention adopts the invention and five prior arts (deterministic behavior strategy DPG, cycle reinforcement learning RRL, deep deterministic behavior strategy DDPG, genetic algorithm and cycle reinforcement learning GA-RRL, heuristic genetic algorithm HGA) to simulate the data obtained on the proof 180, i.e. the index is tracked by using the component stock of the proof 180 to obtain the investment weight of each stock.

In the simulation experiment, the adopted prior art refers to that:

the DPG method of deterministic behavior strategy in the prior art refers to a deterministic behavior strategy index tracking method proposed by z.jiang et al in the published article "a deterministic learning frame for the deterministic behavior management protocol" (arXiv preprintic xiv:1706.10059,2017), which is called deterministic behavior strategy DPG method for short.

The prior art cyclic reinforcement learning RRL method refers to a cyclic reinforcement learning index tracking method, which is proposed in an article 'effective reinforced using recovery learning and lstm neural networks' (arXiv prediction: 1707.07338,2017) published by D.W. Lu et al, and is called a cyclic reinforcement learning RRL method for short.

The deep deterministic behavior strategy DDPG in the prior art refers to a deep deterministic behavior strategy index tracking method, which is called a deep deterministic behavior strategy DDPG method for short, proposed by Z.Liang et al in the published paper "adaptive deep deterministic behavior learning in portal foundry" (arXivpreprint arXiv:1808.09940,2018).

The genetic algorithm and the cyclic reinforcement learning GA-RRL in the prior art refer to an index tracking method combining the genetic algorithm and the cyclic reinforcement learning, which is proposed in a paper "Using genetic algorithm to improve the cyclic reinforcement learning" published by J.Zhang et al (computerized information, vol.47, No.4, pp.551-567,2016.), and are referred to as the genetic algorithm and the cyclic reinforcement learning GA-RRL method for short.

The heuristic genetic algorithm HGA in the prior art is an index tracking method based on the heuristic genetic algorithm, which is proposed in a paper "analytical genetic algorithm for the index tracking protocol" (vol.148, No.3, pp.621-643,2003.) published by J.E. Beaseley et al.

The results of the six methods of the simulation experiment of the invention are evaluated by two evaluation indexes (tracking error TE, excess profit ER).

The following formulas are used for respectively calculating the tracking error TE and the excess profit ER of each simulation experiment, and then all calculation results are drawn into a table 2.

Wherein, sigma is summation operation, ln is logarithm operation with natural constant e as base, a is serial number of days, b is serial number of stock,

the price of the b-th stock on day a,

the weight of the b-th stock on day a, l^aIs the index point location of the day a,

the price of the b-th stock on days a-1,

the weight of the b-th stock on days a-1, l^a-1Is the index point of day a-1,

price of the b-th stock on day a +1,/^a+1The index points at day a + 1.

TABLE 2 quantitative analysis table of the results of the present invention and various prior arts in simulation experiment

Smaller values of the tracking error TE indicate less risk of the portfolio, and larger values of the excess profit ER indicate higher profit of the portfolio. As can be seen from table 2, in the simulation experiment, the tracking error TE of the method for enhancing index tracking is smaller than that of the other five prior arts, and the excess profit ER of the method for enhancing index tracking is larger than that of the other five prior arts, which proves that the method for enhancing index tracking is superior to the prior arts.

Claims

1. A method for enhancing index tracking based on clustering and an LSTM network is characterized in that a training sample set is generated, and a long-time memory LSTM network model is constructed, wherein the method comprises the following steps:

(1) data preprocessing:

(1a) collecting index point data of each transaction day in 10 years and original component strand data contained in the index from a third-party database, wherein the time span is (1, T + L), the data in the sample is between (1, T), the data in the sample is (T +1, T + L) the data out of the sample, the index point data dimension is (1, T + L), the data dimension of the original component strand is (N, P, T + L), wherein N is the total number of component strands contained in the index, P is the total number of features of each component strand, and P is more than 3,

for the round-down operation, T + L is the total number of all trading days in 10 years;

(1c) normalizing all characteristics in the component strand data;

(3) constructing a long-time memory LSTM network model:

wherein f is a loss function, Σ is a summation operation, ln is a logarithm operation based on a natural constant e, and a is the step(2c) The sequence numbers of all the stocks in the component stock data obtained in step (2b) are the sequence numbers of all the stocks in the training data set of (Q,3, T) dimension, which are obtained by sliding the data of (Q,3, T) dimension once in the time dimension by the length R,

(4) training a long-time memory LSTM network model:

(5) calculating the weight of the stock in the test set:

2. The method of claim 1, wherein the normalization of all the features in the component strand data in step (1c) is performed according to the following formula:

wherein the content of the first and second substances,

3. The method for tracking an enhanced index based on clustering and an LSTM network as claimed in claim 1, wherein said K-means clustering in step (2b) is performed by K-means clustering for data of each day of 120 selected days, selecting stocks nearest to each K clustering centers respectively, selecting P stocks each day, and selecting Q stocks with the most frequent occurrence frequency in 120 days from the P stocks selected each day, to obtain (Q,3, T + L) dimensional component stock data, wherein P and Q have the same value as K.