CN111028086A - Enhanced index tracking method based on clustering and LSTM network - Google Patents
Enhanced index tracking method based on clustering and LSTM network Download PDFInfo
- Publication number
- CN111028086A CN111028086A CN201911169339.3A CN201911169339A CN111028086A CN 111028086 A CN111028086 A CN 111028086A CN 201911169339 A CN201911169339 A CN 201911169339A CN 111028086 A CN111028086 A CN 111028086A
- Authority
- CN
- China
- Prior art keywords
- data
- lstm network
- long
- stock
- component
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 51
- 230000015654 memory Effects 0.000 claims abstract description 69
- 238000012549 training Methods 0.000 claims abstract description 44
- 238000012360 testing method Methods 0.000 claims abstract description 28
- 230000009467 reduction Effects 0.000 claims abstract description 13
- 238000010606 normalization Methods 0.000 claims abstract description 8
- 230000002708 enhancing effect Effects 0.000 claims abstract description 6
- 238000007781 pre-processing Methods 0.000 claims abstract description 6
- 230000006870 function Effects 0.000 claims description 33
- 238000004422 calculation algorithm Methods 0.000 claims description 23
- 239000000470 constituent Substances 0.000 claims description 17
- 230000004913 activation Effects 0.000 claims description 15
- 238000005457 optimization Methods 0.000 claims description 12
- 238000000513 principal component analysis Methods 0.000 claims description 10
- 238000003064 k means clustering Methods 0.000 claims description 8
- 230000003044 adaptive effect Effects 0.000 claims description 5
- 239000011159 matrix material Substances 0.000 claims description 5
- 238000012545 processing Methods 0.000 claims description 5
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 4
- 230000007787 long-term memory Effects 0.000 claims description 4
- 239000000126 substance Substances 0.000 claims description 2
- 230000007547 defect Effects 0.000 abstract description 4
- 230000008901 benefit Effects 0.000 abstract description 3
- 238000004140 cleaning Methods 0.000 abstract description 2
- 230000002787 reinforcement Effects 0.000 description 15
- 238000004088 simulation Methods 0.000 description 12
- 230000006399 behavior Effects 0.000 description 10
- 230000002068 genetic effect Effects 0.000 description 9
- 125000004122 cyclic group Chemical group 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 6
- 230000007246 mechanism Effects 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000004445 quantitative analysis Methods 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/06—Asset management; Financial planning or analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0639—Performance analysis of employees; Performance analysis of enterprise or organisation operations
- G06Q10/06393—Score-carding, benchmarking or key performance indicator [KPI] analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Human Resources & Organizations (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Development Economics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Strategic Management (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Entrepreneurship & Innovation (AREA)
- Economics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Educational Administration (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Game Theory and Decision Science (AREA)
- Finance (AREA)
- Marketing (AREA)
- Operations Research (AREA)
- Accounting & Taxation (AREA)
- General Business, Economics & Management (AREA)
- Tourism & Hospitality (AREA)
- Quality & Reliability (AREA)
- Probability & Statistics with Applications (AREA)
- Technology Law (AREA)
- Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)
Abstract
The invention discloses an enhanced index tracking method based on clustering and an LSTM network, which comprises the following steps: (1) data preprocessing, including the steps of data acquisition, data cleaning, normalization, dimension reduction and the like; (2) generating a training sample set for memorizing the LSTM network at long time; (3) constructing a long-time memory LSTM network model; (4) training a long-time memory LSTM network model; (5) and calculating the weight of the stocks in the test set. The method overcomes the defects of too complex model and large tracking error in the prior art in the process of enhancing the index tracking, so that the model adopted in the method for enhancing the index tracking is simple, the weight value can be dynamically adjusted, and the method has the advantage of small tracking error.
Description
Technical Field
The invention belongs to the technical field of computers, and further relates to an enhanced index tracking method based on a clustering and long Short Term memory network (LSTM) (Long Short Term memory) in the technical field of data processing. The invention can be used for tracking the enhanced index.
Background
The index enhancement is to add an enhanced positive investment means on the basis of passive index tracking, properly adjust the investment portfolio and strive to obtain positive market benefits while controlling risks. The index enhancement strategy does not completely duplicate the constituent shares of the tracking index, but adds weight to partially viewed stocks, and reduces weight or even completely removes unappreciated stocks. In a comprehensive view, excess income is achieved, active risks are controlled, and the investment goal is to obtain the income higher than the benchmark while closely tracking the benchmark index.
There are three types of currently used enhanced index tracking methods: the first rule-based enhanced index tracking method is characterized in that problems are solved by using professional knowledge and various mathematical models, accurate data and massive calculation are required, and factors such as matrix non-positive definite are limited; the second method is an enhanced index tracking method based on a heuristic algorithm, which is to search an optimal solution in a space, wherein the optimal solution is easy to fall into local optimality in high-dimensional space search, so that the performance of the optimal solution is influenced to a certain extent; and the third method is a learning-based enhanced index tracking method, which solves problems by using machine learning models such as various network models, reinforcement learning models and the like.
An enhanced index tracking method based on the deep attention network and the reinforcement learning is disclosed in the patent document 'investment portfolio selection method based on the deep attention network and the reinforcement learning' (application number: 201910390018X, application date: 2019.05.10, application publication number: CN110223180A) applied by Beijing aerospace university. The method introduces a neural network model fused with an attention mechanism into the financial field, takes the Sharper ratio as a reward function, and uses a reinforcement learning framework training model to balance income and risk when generating investment portfolio selection. Meanwhile, the method also provides a brand-new cross-asset attention mechanism to model the correlation among different assets, and carries out deep exploration on model interpretability. The method has the disadvantages that the method is not beneficial to learning of the reinforcement learning model when the sharp ratio is used as the reward function and the sharp ratio is negative, so that the reinforcement learning model is unstable, the weight output by the model, and the error between the investment combination constructed according to the weight and the reference index is overlarge.
A stock index tracking and predicting method and system based on social network clustering is disclosed in the patent document applied by Nanjing university (application number: 2017101004662, application date: 2017.02.23, application publication number: CN106897797A), and the stock index tracking method based on social network clustering comprises the steps of firstly collecting the index and component stock data of the previous month and the current month from a third-party database, and cleaning the data to obtain the data in a sample and the data out of the sample which can be used for research; then, calculating measurement distance by using correlation coefficients among the component stocks, constructing a social network among the component stocks, clustering the network by using a self-adaptive affine propagation clustering algorithm, extracting a clustering center of each cluster to form a stock pool, realizing optimal tracking of stock in the stock pool to the index of the stock, and determining the optimal weight of index tracking; and finally, applying the stock pool and the optimal weight obtained by training in the sample to index tracking of data outside the sample to obtain a predicted index. The invention also provides a stock index tracking and predicting system, the constructed stock pool has low correlation, small tracking error and good stability of a copied result, and the index is accurately tracked. The method has the disadvantages that the weight obtained by the data in the sample is directly used for the data outside the sample, and dynamic adjustment is not carried out, so that the tracking error of the reference index is overlarge.
Disclosure of Invention
The invention aims to provide an enhanced exponential tracking method based on clustering and an LSTM network aiming at the defects of the prior art, which is used for solving the problems of complex model, large calculated amount and overlarge tracking error.
The idea for realizing the purpose of the invention is that the data is preprocessed by using a preprocessing means, the data is screened by using a clustering method, then a training data set of the long-time memory LSTM network is constructed by using a sliding window, the long-time memory LSTM network is trained by using the training data set, and finally the test data set is input into the trained long-time memory LSTM network for calculation to obtain the weight of each stock.
The technical scheme of the invention comprises the following steps:
(1) data preprocessing:
(1a) collecting index point data of each transaction day in 10 years and original component strand data contained in the index from a third-party database, wherein the time span is (1, T + L), the data in the sample is arranged between (1, T), (T +1, T + L) is data outside the sample, the index point data dimension is (1, T + L), the data dimension of the original component strand is (N, P, T + L), wherein N is the total number of component strands contained in the index, P is the total number of features of each component strand, P > 3, and T + L is the total number of all transaction days in 10 years,is a rounding-down operation;
(1b) traversing all component strands in the original component strand data, removing the component strands which do not meet the time length of T + L, and forming the remaining component strands into component strand data with (M, P, T + L) dimension, wherein M is the total number of the component strands contained in the component strand data;
(1c) normalizing all characteristics in the component strand data;
(1d) using a Principal Component Analysis (PCA) method to reduce the dimensions of all the characteristics in the normalized component strand data to obtain reduced-dimension data of (M,3, T + L) dimensions;
(2) generating a training sample set for memorizing the LSTM network at long time:
(2a) forming an initial training sample set by using the in-sample data subjected to dimensionality reduction, and forming a test sample set by using the out-sample data subjected to dimensionality reduction;
(2b) taking out the data of the last 120 days from the initial training sample set after dimensionality reduction, and carrying out K-means clustering on the data of each day to obtain component stock data of (Q,3, T + L) dimensionality, wherein Q is the number of stocks with the largest occurrence frequency in 120 days;
(2c) sliding data in a training data set of (Q,3, T) dimension on a time dimension by a length R, obtaining data of (Q,3, R) dimension each time, sharing T-R +1 group data, and obtaining a training sample set D of (Q,3, R, T-R +1) dimension required by network trainingtrainWherein R is more than 2 and less than T;
(3) constructing a long-time memory LSTM network model:
(3a) a three-layer long-and-short memory LSTM network is built, and the structure of the LSTM network is as follows in sequence: an input layer, a hidden layer, an output layer;
(3b) setting the batch processing size of the long and short time memory LSTM network to be 1, setting the number of nodes of an input layer of the long and short time memory LSTM network to be Y, wherein Y is Q3S + S, wherein X represents multiplication operation, S is the number of steps of delay time step forward propagation of the long and short time memory LSTM network, S is more than 1 and less than R, and the output dimension of the long and short time memory LSTM network is equal to Q;
(3c) setting an activation function of a long-time and short-time memory LSTM network as a hyperbolic tangent activation function;
(3d) the loss function in the long-term memory LSTM network model is set as follows:
wherein f is a loss function, Σ is a summation operation, ln is a logarithm operation based on a natural constant e, a is a serial number in all days obtained by sliding data in the training data set of (Q,3, T) dimension once in the time dimension by a length R in step (2c), b is a serial number of all stocks in the component stock data obtained in step (2b),the price of the b-th stock in the constituent stock data obtained in the step (2b) on the a-th day,the weight of the b th stock in the component stock data obtained in the a th step (2b), laIs the index point location of the day a,the price of the b th stock in the constituent stock data obtained in the step (2b) of the a-1 th day,the weight of the b th stock in the component stock data obtained in step (2b) at days a-1, la-1Is the index point of day a-1,the price of the b th stock in the constituent stock data obtained in the a +1 th day step (2b), la+1Index point location for day a + 1;
(3e) setting an optimization algorithm for memorizing the LSTM network at long time as an optimization algorithm Adam based on adaptive matrix estimation;
(4) training a long-time memory LSTM network model:
will train sample set DtrainInputting the parameters in the step (3b) and the activation function in the step (3c) into the long-short memory LSTM network, performing forward propagation on the long-short memory LSTM network, and performing backward propagation on the error of the long-short memory LSTM network by using the loss function in the step (3d) and the optimization algorithm in the step (3e) until the loss function converges to obtain a trained long-short memory LSTM network model;
(5) calculating the weight of the stock in the test set:
selecting the data of the stocks with the same composition stock data as the data of the composition stock obtained in the step (2b) from the test sample set obtained in the step (2a) to form test data DtestTest data DtestInputting the parameters in the step (3b), the activation function in the step (3c) and the trained long-short memory LSTM network model parameters obtained in the step (4) into a long-short memory LSTM network, carrying out forward propagation on the long-short memory LSTM network, and outputting the weight of each stock.
Compared with the prior art, the invention has the following advantages:
firstly, the method uses Principal Component Analysis (PCA) and K-means clustering method when constructing the training sample set of the network, so that the defect of difficult complex calculation of the model in the prior art is overcome, and the calculation speed of the method is high.
Secondly, because the invention uses a long-time memory LSTM network model and uses a sliding window in the calculation of the stock weight, the defect that the weight cannot be dynamically adjusted in the prior art is overcome, and the index tracking error is smaller.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
The specific steps of the present invention will be further described with reference to fig. 1.
Step 1, data preprocessing.
The method comprises the steps of firstly, collecting exponential point data and original component strand data contained in an index of each transaction day in 10 years from a third-party database, wherein the time span is (1, T, T +1, T, T + L), wherein the data in a sample is arranged between the (1, T, T) and the data in the sample (T +1, T, T + L) is data outside the sample, the dimensionality of the exponential point data is (1, T + L), the data dimensionality of the original component strand data is (N, P, T + L), wherein N is the total number of components contained in the index, P is the total number of features of each component strand, P > 3, and T + L is the total number of all transaction days in 10 years,to round-down operations.
And secondly, traversing all the component strands in the component strand data, removing the component strands which do not meet the time length of T + L, and forming the remaining component strands into (M, P, T + L) dimensional component strand data, wherein M is the total number of the component strands contained in the component strand data.
Thirdly, normalizing all the characteristics in the component stock data according to the following formula:
wherein the content of the first and second substances,is the normalized value, x, of the ith characteristic of all component strands in the component strand dataiIs the value of the ith characteristic of all component strands in the component strand data before normalization,is the minimum value of the ith characteristic of all component strands in the component strand data before normalization,the value is the maximum value of the ith characteristic of all component strands in the component strand data before normalization.
And fourthly, using a Principal Component Analysis (PCA) method to reduce the dimensions of all the characteristics in the normalized component strand data to obtain the (M,3, T + L) dimension reduced data.
And 2, generating a training sample set for memorizing the LSTM network at long time and short time.
Firstly, forming an initial training sample set by using the in-sample data subjected to dimensionality reduction, and forming a test sample set by using the out-sample data subjected to dimensionality reduction.
And secondly, taking out data of last 120 days from the initial training sample set after dimensionality reduction, carrying out K-means clustering on the data of each day to obtain component stock data of (Q,3, T + L) dimensionality, wherein Q is the number of stocks with the largest occurrence frequency in 120 days, the K-means clustering refers to carrying out K-means clustering on the data of each day in the selected 120 days, respectively selecting each stock closest to K clustering centers, selecting P stocks each day, and selecting the Q stocks with the largest occurrence frequency in 120 days from the P stocks selected each day to obtain the component stock data of (Q,3, T + L) dimensionality, wherein the value of P and Q is equal to K.
Thirdly, sliding data in the training data set of (Q,3, T) dimension on the time dimension by the length R, obtaining the data of (Q,3, R) dimension each time, sharing T-R +1 group data, and obtaining the training sample set D of (Q,3, R, T-R +1) dimension required by network trainingtrainWherein R is more than 2 and less than T.
And 3, constructing a long-term memory LSTM network model.
Firstly, a three-layer long-time memory LSTM network is built, and the structure of the LSTM network is as follows in sequence: input layer, hidden layer, output layer.
And secondly, setting the batch processing size of the long and short time memory LSTM network to be 1, setting the number of nodes of an input layer of the long and short time memory LSTM network to be Y, and setting Y to Q to 3 to S + S, wherein X represents multiplication operation, S is the number of steps of delay interval step forward propagation of the long and short time memory LSTM network, S is more than 1 and less than R, and the output dimension of the long and short time memory LSTM network is equal to Q.
And thirdly, setting an activation function of the long-time memory LSTM network as a hyperbolic tangent activation function.
Fourthly, setting a loss function in the long-time memory LSTM network model as follows:
wherein f is a loss function, Σ is a summation operation, ln is a logarithm operation based on a natural constant e, a is a serial number in all days obtained by sliding data in the training data set of (Q,3, T) dimension once in the time dimension by a length R in the third step of step 2, b is a serial number of all stocks in the component stock data obtained in the second step of step 2,the price of the b-th stock in the constituent stock data obtained in the second step of step 2 on day a,the weight of the b th stock in the constituent stock data obtained in the second step of the step 2 on day a, laIs the index point location of the day a,the price of the b-th stock in the constituent stock data obtained in the second step of step 2 on days a-1,the weight of the b th stock in the constituent stock data obtained in the second step of step 2 on days a-1, la-1Is the index point of day a-1,the price of the b th stock in the constituent stock data obtained in the second step of step 2 on day a +1,/a+1The index points at day a + 1.
Fifthly, setting an optimization algorithm of the long-time memory LSTM network as an optimization algorithm Adam based on adaptive matrix estimation;
step 4, training the long-time memory LSTM network model, and collecting the training sample set DtrainInputting the parameters into the long-short time memory LSTM network, performing forward propagation on the long-short time memory LSTM network by using the parameters in the second step in the step 3 and the activation function in the third step in the step 3, and performing backward propagation on the error of the long-short time memory LSTM network by using the loss function in the fourth step in the step 3 and the optimization algorithm in the fifth step in the step 3 until the loss function converges to obtain a trained long-short time memory LSTM network model.
And 5, calculating the weight of the stocks in the test set.
Selecting the data of the stocks with the same composition stock data as the data of the stocks obtained in the second step of step 2 from the test sample set obtained in the first step of step 2 to form test data DtestTest data DtestInputting the parameters in the second step of the step 3, the activation function in the third step of the step 3 and the trained long-short memory LSTM network model parameters obtained in the step 4 into a long-short memory LSTM network, carrying out forward propagation on the long-short memory LSTM network, and outputting the weight of each stock.
The invention will now be further described with reference to the following examples
Step 1, data preprocessing.
In the first step, the method utilizes Wind software to obtain the stock data (containing 18 characteristics) and the index point data of the proof 180 index 2009-2018. The capital 180 index (also called capital component index) is obtained by adjusting and renaming the original 30 index of the Shanghai securities exchange, and the sample stock is 180 sample stocks which are most representative of the market among all A stocks. The upper certificate 180 contains 180 shares of the component stock, 2009 and 2018 all trade days are 2431 days, each stock has 18 characteristics, so that the data dimension of the original component stock is (111,18,2431), the data dimension in the sample is (111,18,2382) and the data dimension out of the sample is (111,18, 49). The above 18 feature list 1 is as follows:
table 118 list of features
Price of opening dish | Closing price | Highest price | Lowest price | Rise and fall | Amplitude of fluctuation |
Volume of business | Rate of hand change | Market profit rate | Net rate of market | Market rate | Total market value |
Market rate | Value of A stock market | Total stock book | A stock through book | Average price | Amount of transaction |
And step two, all stocks of which the time range does not cover 2009-2018 in the data of the original component stock are removed, missing data are complemented by using an average value, and finally 111 stocks are reserved to form component stock data with the dimension of (111,18, 2431).
And thirdly, carrying out Principal Component Analysis (PCA) dimensionality reduction on the component strand data to obtain data of (111, 3, 2431) dimensionality, wherein the data in the sample is (111, 3,2382) dimensionality, and the data outside the sample is (111, 3, 49) dimensionality.
And 2, generating a training sample set for memorizing the LSTM network at long time and short time.
Step one, the data in the samples after the dimensionality reduction obtained in the step 1 and the third step form an initial training sample set, and the data outside the samples after the dimensionality reduction form a testing sample set.
And secondly, selecting the data of the last 120 days in the initial training sample set obtained in the first step, wherein the dimensionality is (111, 3, 120), the data of each day is (111, 3) dimensionality, clustering the data of each day, the number of clustering clusters is 10, selecting 10 stocks closest to a clustering center in the data of each day as alternative stocks, selecting the first 10 stocks with the largest occurrence frequency in 120 days as stocks contained in an investment portfolio, and taking out the data of the selected stocks in the initial training sample set to obtain a training data set of the (10,3,2382) dimensionality.
And thirdly, sliding the training data set of the dimension (10,3,2382) obtained in the second step in the last dimension, namely the time dimension, by the length 50, and obtaining data of the dimension (10,3, 50) each time, wherein 2333 groups of data are obtained, the data are data of the dimension (10,3,2382, 2333), and the data are training sample sets used for training the network.
And 3, constructing a long-term memory LSTM network model.
Firstly, a three-layer long-time memory LSTM network is built, and the structure of the LSTM network is as follows in sequence: input layer, hidden layer, output layer.
And secondly, setting the batch processing size of the long and short time memory LSTM network to be 1, setting the number of nodes of an input layer of the long and short time memory LSTM network to be 620(Y is Q3S + S, wherein Q is 10, S is 20), and setting the output dimension of the long and short time memory LSTM network to be 10.
And thirdly, setting an activation function of the long-time memory LSTM network as a hyperbolic tangent activation function.
Fourthly, setting a loss function in the long-time memory LSTM network model as follows:
wherein f is a loss function, Σ is a summation operation, ln is a logarithm operation based on a natural constant e, a is the serial number of all days obtained by sliding data in the training data set of (10,3,2382) dimension once in the time dimension by the length 50 in the third step of step 2, b is the serial number of all stocks in the component stock data obtained in the second step of step 2,the price of the b-th stock in the constituent stock data obtained in the second step of step 2 on day a,the weight of the b th stock in the constituent stock data obtained in the second step of the step 2 on day a, laIs the index point location of the day a,the price of the b-th stock in the constituent stock data obtained in the second step of step 2 on days a-1,the weight of the b th stock in the constituent stock data obtained in the second step of step 2 on days a-1, la-1Is the index point of day a-1,the price of the b th stock in the constituent stock data obtained in the second step of step 2 on day a +1,/a+1The index points at day a + 1.
Fifthly, setting an optimization algorithm of the long-time memory LSTM network as an optimization algorithm Adam based on adaptive matrix estimation;
and 4, training a long-short-term memory LSTM network model, inputting the training sample set of the network obtained in the step 3 into the long-short-term memory LSTM network constructed in the step 3, performing forward propagation on the long-short-term memory LSTM network by using the parameters in the second step of the step 3 and the activation function in the third step of the step 3, and performing backward propagation on the error of the long-short-term memory LSTM network by using the loss function in the fourth step of the step 3 and the optimization algorithm in the fifth step of the step 3 until the loss function is converged to obtain the trained long-short-term memory LSTM network model.
And 5, calculating the weight of the stocks in the test set.
Selecting the data of the stocks with the same components as the data of the stocks obtained in the second step of the step 2 from the test sample set obtained in the first step of the step 2 to form test data with dimensions (10,3 and 49), inputting the test data into the long-short memory LSTM network, and performing forward propagation on the long-short memory LSTM network by using the trained long-short memory LSTM network model parameters obtained in the step 4 to obtain the weight of each stock.
The effect of the present invention is further explained by combining the simulation experiment as follows:
1. simulation experiment conditions are as follows:
the hardware platform of the simulation experiment of the invention: the processor is Intel (R) CoreTMi7-8700 CPU, the main frequency is 3.2GHz, and the memory is 16 GB.
The software platform of the simulation experiment of the invention is as follows: windows 10 operating system, python3.7.3, tensorflow 1.13.
The data used in the simulation experiment of the invention is all data of Shanghai securities exchange Shanghai 180-proof 2009-2018.
2. Simulation content and result analysis thereof:
the simulation experiment of the invention adopts the invention and five prior arts (deterministic behavior strategy DPG, cycle reinforcement learning RRL, deep deterministic behavior strategy DDPG, genetic algorithm and cycle reinforcement learning GA-RRL, heuristic genetic algorithm HGA) to simulate the data obtained on the proof 180, i.e. the index is tracked by using the component stock of the proof 180 to obtain the investment weight of each stock.
In the simulation experiment, the adopted prior art refers to that:
the DPG method of deterministic behavior strategy in the prior art refers to a deterministic behavior strategy index tracking method proposed by z.jiang et al in the published article "a deterministic learning frame for the deterministic behavior management protocol" (arXiv preprintic xiv:1706.10059,2017), which is called deterministic behavior strategy DPG method for short.
The prior art cyclic reinforcement learning RRL method refers to a cyclic reinforcement learning index tracking method, which is proposed in an article 'effective reinforced using recovery learning and lstm neural networks' (arXiv prediction: 1707.07338,2017) published by D.W. Lu et al, and is called a cyclic reinforcement learning RRL method for short.
The deep deterministic behavior strategy DDPG in the prior art refers to a deep deterministic behavior strategy index tracking method, which is called a deep deterministic behavior strategy DDPG method for short, proposed by Z.Liang et al in the published paper "adaptive deep deterministic behavior learning in portal foundry" (arXivpreprint arXiv:1808.09940,2018).
The genetic algorithm and the cyclic reinforcement learning GA-RRL in the prior art refer to an index tracking method combining the genetic algorithm and the cyclic reinforcement learning, which is proposed in a paper "Using genetic algorithm to improve the cyclic reinforcement learning" published by J.Zhang et al (computerized information, vol.47, No.4, pp.551-567,2016.), and are referred to as the genetic algorithm and the cyclic reinforcement learning GA-RRL method for short.
The heuristic genetic algorithm HGA in the prior art is an index tracking method based on the heuristic genetic algorithm, which is proposed in a paper "analytical genetic algorithm for the index tracking protocol" (vol.148, No.3, pp.621-643,2003.) published by J.E. Beaseley et al.
The results of the six methods of the simulation experiment of the invention are evaluated by two evaluation indexes (tracking error TE, excess profit ER).
The following formulas are used for respectively calculating the tracking error TE and the excess profit ER of each simulation experiment, and then all calculation results are drawn into a table 2.
Wherein, sigma is summation operation, ln is logarithm operation with natural constant e as base, a is serial number of days, b is serial number of stock,the price of the b-th stock on day a,the weight of the b-th stock on day a, laIs the index point location of the day a,the price of the b-th stock on days a-1,the weight of the b-th stock on days a-1, la-1Is the index point of day a-1,price of the b-th stock on day a +1,/a+1The index points at day a + 1.
TABLE 2 quantitative analysis table of the results of the present invention and various prior arts in simulation experiment
Smaller values of the tracking error TE indicate less risk of the portfolio, and larger values of the excess profit ER indicate higher profit of the portfolio. As can be seen from table 2, in the simulation experiment, the tracking error TE of the method for enhancing index tracking is smaller than that of the other five prior arts, and the excess profit ER of the method for enhancing index tracking is larger than that of the other five prior arts, which proves that the method for enhancing index tracking is superior to the prior arts.
Claims (3)
1. A method for enhancing index tracking based on clustering and an LSTM network is characterized in that a training sample set is generated, and a long-time memory LSTM network model is constructed, wherein the method comprises the following steps:
(1) data preprocessing:
(1a) collecting index point data of each transaction day in 10 years and original component strand data contained in the index from a third-party database, wherein the time span is (1, T + L), the data in the sample is between (1, T), the data in the sample is (T +1, T + L) the data out of the sample, the index point data dimension is (1, T + L), the data dimension of the original component strand is (N, P, T + L), wherein N is the total number of component strands contained in the index, P is the total number of features of each component strand, and P is more than 3, for the round-down operation, T + L is the total number of all trading days in 10 years;
(1b) traversing all component strands in the original component strand data, removing the component strands which do not meet the time length of T + L, and forming the remaining component strands into component strand data with (M, P, T + L) dimension, wherein M is the total number of the component strands contained in the component strand data;
(1c) normalizing all characteristics in the component strand data;
(1d) using a Principal Component Analysis (PCA) method to reduce the dimensions of all the characteristics in the normalized component strand data to obtain reduced-dimension data of (M,3, T + L) dimensions;
(2) generating a training sample set for memorizing the LSTM network at long time:
(2a) forming an initial training sample set by using the in-sample data subjected to dimensionality reduction, and forming a test sample set by using the out-sample data subjected to dimensionality reduction;
(2b) taking out the data of the last 120 days from the initial training sample set after dimensionality reduction, and carrying out K-means clustering on the data of each day to obtain component stock data of (Q,3, T + L) dimensionality, wherein Q is the number of stocks with the largest occurrence frequency in 120 days;
(2c) sliding data in a training data set of (Q,3, T) dimension on a time dimension by a length R, obtaining data of (Q,3, R) dimension each time, sharing T-R +1 group data, and obtaining a training sample set D of (Q,3, R, T-R +1) dimension required by network trainingtrainWherein R is more than 2 and less than T;
(3) constructing a long-time memory LSTM network model:
(3a) a three-layer long-and-short memory LSTM network is built, and the structure of the LSTM network is as follows in sequence: an input layer, a hidden layer, an output layer;
(3b) setting the batch processing size of the long and short time memory LSTM network to be 1, setting the number of nodes of an input layer of the long and short time memory LSTM network to be Y, wherein Y is Q3S + S, wherein X represents multiplication operation, S is the number of steps of delay time step forward propagation of the long and short time memory LSTM network, S is more than 1 and less than R, and the output dimension of the long and short time memory LSTM network is equal to Q;
(3c) setting an activation function of a long-time and short-time memory LSTM network as a hyperbolic tangent activation function;
(3d) the loss function in the long-term memory LSTM network model is set as follows:
wherein f is a loss function, Σ is a summation operation, ln is a logarithm operation based on a natural constant e, and a is the step(2c) The sequence numbers of all the stocks in the component stock data obtained in step (2b) are the sequence numbers of all the stocks in the training data set of (Q,3, T) dimension, which are obtained by sliding the data of (Q,3, T) dimension once in the time dimension by the length R,the price of the b-th stock in the constituent stock data obtained in the step (2b) on the a-th day,the weight of the b th stock in the component stock data obtained in the a th step (2b), laIs the index point location of the day a,the price of the b th stock in the constituent stock data obtained in the step (2b) of the a-1 th day,the weight of the b th stock in the component stock data obtained in step (2b) at days a-1, la-1Is the index point of day a-1,the price of the b th stock in the constituent stock data obtained in the a +1 th day step (2b), la+1Index point location for day a + 1;
(3e) setting an optimization algorithm for memorizing the LSTM network at long time as an optimization algorithm Adam based on adaptive matrix estimation;
(4) training a long-time memory LSTM network model:
will train sample set DtrainInputting the parameters in the step (3b) and the activation function in the step (3c) into the long-short memory LSTM network, performing forward propagation on the long-short memory LSTM network, and performing backward propagation on the error of the long-short memory LSTM network by using the loss function in the step (3d) and the optimization algorithm in the step (3e) until the loss function converges to obtain a trained long-short memory LSTM network model;
(5) calculating the weight of the stock in the test set:
selecting the data of the stocks with the same composition stock data as the data of the composition stock obtained in the step (2b) from the test sample set obtained in the step (2a) to form test data DtestTest data DtestInputting the parameters in the step (3b), the activation function in the step (3c) and the trained long-short memory LSTM network model parameters obtained in the step (4) into a long-short memory LSTM network, carrying out forward propagation on the long-short memory LSTM network, and outputting the weight of each stock.
2. The method of claim 1, wherein the normalization of all the features in the component strand data in step (1c) is performed according to the following formula:
wherein the content of the first and second substances,is the normalized value, x, of the ith characteristic of all component strands in the component strand dataiIs the value of the ith characteristic of all component strands in the component strand data before normalization,is the minimum value of the ith characteristic of all component strands in the component strand data before normalization,the value is the maximum value of the ith characteristic of all component strands in the component strand data before normalization.
3. The method for tracking an enhanced index based on clustering and an LSTM network as claimed in claim 1, wherein said K-means clustering in step (2b) is performed by K-means clustering for data of each day of 120 selected days, selecting stocks nearest to each K clustering centers respectively, selecting P stocks each day, and selecting Q stocks with the most frequent occurrence frequency in 120 days from the P stocks selected each day, to obtain (Q,3, T + L) dimensional component stock data, wherein P and Q have the same value as K.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911169339.3A CN111028086A (en) | 2019-11-26 | 2019-11-26 | Enhanced index tracking method based on clustering and LSTM network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911169339.3A CN111028086A (en) | 2019-11-26 | 2019-11-26 | Enhanced index tracking method based on clustering and LSTM network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111028086A true CN111028086A (en) | 2020-04-17 |
Family
ID=70201985
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911169339.3A Pending CN111028086A (en) | 2019-11-26 | 2019-11-26 | Enhanced index tracking method based on clustering and LSTM network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111028086A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111950704A (en) * | 2020-08-07 | 2020-11-17 | 哈尔滨工业大学 | Atmospheric temperature data generation method based on merging long-time and short-time memory networks |
CN112884576A (en) * | 2021-02-02 | 2021-06-01 | 上海卡方信息科技有限公司 | Stock trading method based on reinforcement learning |
-
2019
- 2019-11-26 CN CN201911169339.3A patent/CN111028086A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111950704A (en) * | 2020-08-07 | 2020-11-17 | 哈尔滨工业大学 | Atmospheric temperature data generation method based on merging long-time and short-time memory networks |
CN111950704B (en) * | 2020-08-07 | 2022-11-29 | 哈尔滨工业大学 | Atmospheric temperature data generation method based on merging long-time and short-time memory networks |
CN112884576A (en) * | 2021-02-02 | 2021-06-01 | 上海卡方信息科技有限公司 | Stock trading method based on reinforcement learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Dodge et al. | Show your work: Improved reporting of experimental results | |
CN110782658B (en) | Traffic prediction method based on LightGBM algorithm | |
CN107766929B (en) | Model analysis method and device | |
CN108764460A (en) | A kind of Time Series Forecasting Methods based on time convolution sum LSTM | |
CN106022954B (en) | Multiple BP neural network load prediction method based on grey correlation degree | |
CN109118013A (en) | A kind of management data prediction technique, readable storage medium storing program for executing and forecasting system neural network based | |
CN109583635A (en) | A kind of short-term load forecasting modeling method towards operational reliability | |
CN112529685A (en) | Loan user credit rating method and system based on BAS-FNN | |
CN110837929A (en) | Least square support vector machine electricity utilization prediction method based on adaptive genetic algorithm | |
CN111028086A (en) | Enhanced index tracking method based on clustering and LSTM network | |
CN113590807A (en) | Scientific and technological enterprise credit evaluation method based on big data mining | |
CN113095484A (en) | Stock price prediction method based on LSTM neural network | |
CN110533249B (en) | Metallurgical enterprise energy consumption prediction method based on integrated long-term and short-term memory network | |
CN112650933A (en) | High-order aggregation-based graph convolution and multi-head attention mechanism conversation recommendation method | |
CN112529638A (en) | Service demand dynamic prediction method and system based on user classification and deep learning | |
Xing et al. | Seasonal and trend forecasting of tourist arrivals: An adaptive multiscale ensemble learning approach | |
Zhang et al. | Performance comparisons of Bi-LSTM and Bi-GRU networks in Chinese word segmentation | |
Panigrahi et al. | Normalize time series and forecast using evolutionary neural network | |
CN113887717A (en) | Method for predicting neural network training duration based on deep learning | |
CN107704944A (en) | A kind of fluctuation of stock market interval prediction method based on information theory study | |
CN109523386A (en) | A kind of investment portfolio risk prediction technique of GMM in conjunction with LSTM | |
CN113642632B (en) | Power system customer classification method and device based on self-adaptive competition and equalization optimization | |
CN112529637B (en) | Service demand dynamic prediction method and system based on context awareness | |
CN112766537B (en) | Short-term electric load prediction method | |
CN114820160A (en) | Loan interest rate estimation method, device, equipment and readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20200417 |