CN111028086A - Enhanced index tracking method based on clustering and LSTM network - Google Patents

Enhanced index tracking method based on clustering and LSTM network Download PDF

Info

Publication number
CN111028086A
CN111028086A CN201911169339.3A CN201911169339A CN111028086A CN 111028086 A CN111028086 A CN 111028086A CN 201911169339 A CN201911169339 A CN 201911169339A CN 111028086 A CN111028086 A CN 111028086A
Authority
CN
China
Prior art keywords
data
lstm network
long
stock
component
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911169339.3A
Other languages
Chinese (zh)
Inventor
鲍亮
张晶
宋金秋
任笑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201911169339.3A priority Critical patent/CN111028086A/en
Publication of CN111028086A publication Critical patent/CN111028086A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/06Asset management; Financial planning or analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Development Economics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Strategic Management (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Educational Administration (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Game Theory and Decision Science (AREA)
  • Finance (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Accounting & Taxation (AREA)
  • General Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • Probability & Statistics with Applications (AREA)
  • Technology Law (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)

Abstract

The invention discloses an enhanced index tracking method based on clustering and an LSTM network, which comprises the following steps: (1) data preprocessing, including the steps of data acquisition, data cleaning, normalization, dimension reduction and the like; (2) generating a training sample set for memorizing the LSTM network at long time; (3) constructing a long-time memory LSTM network model; (4) training a long-time memory LSTM network model; (5) and calculating the weight of the stocks in the test set. The method overcomes the defects of too complex model and large tracking error in the prior art in the process of enhancing the index tracking, so that the model adopted in the method for enhancing the index tracking is simple, the weight value can be dynamically adjusted, and the method has the advantage of small tracking error.

Description

Enhanced index tracking method based on clustering and LSTM network
Technical Field
The invention belongs to the technical field of computers, and further relates to an enhanced index tracking method based on a clustering and long Short Term memory network (LSTM) (Long Short Term memory) in the technical field of data processing. The invention can be used for tracking the enhanced index.
Background
The index enhancement is to add an enhanced positive investment means on the basis of passive index tracking, properly adjust the investment portfolio and strive to obtain positive market benefits while controlling risks. The index enhancement strategy does not completely duplicate the constituent shares of the tracking index, but adds weight to partially viewed stocks, and reduces weight or even completely removes unappreciated stocks. In a comprehensive view, excess income is achieved, active risks are controlled, and the investment goal is to obtain the income higher than the benchmark while closely tracking the benchmark index.
There are three types of currently used enhanced index tracking methods: the first rule-based enhanced index tracking method is characterized in that problems are solved by using professional knowledge and various mathematical models, accurate data and massive calculation are required, and factors such as matrix non-positive definite are limited; the second method is an enhanced index tracking method based on a heuristic algorithm, which is to search an optimal solution in a space, wherein the optimal solution is easy to fall into local optimality in high-dimensional space search, so that the performance of the optimal solution is influenced to a certain extent; and the third method is a learning-based enhanced index tracking method, which solves problems by using machine learning models such as various network models, reinforcement learning models and the like.
An enhanced index tracking method based on the deep attention network and the reinforcement learning is disclosed in the patent document 'investment portfolio selection method based on the deep attention network and the reinforcement learning' (application number: 201910390018X, application date: 2019.05.10, application publication number: CN110223180A) applied by Beijing aerospace university. The method introduces a neural network model fused with an attention mechanism into the financial field, takes the Sharper ratio as a reward function, and uses a reinforcement learning framework training model to balance income and risk when generating investment portfolio selection. Meanwhile, the method also provides a brand-new cross-asset attention mechanism to model the correlation among different assets, and carries out deep exploration on model interpretability. The method has the disadvantages that the method is not beneficial to learning of the reinforcement learning model when the sharp ratio is used as the reward function and the sharp ratio is negative, so that the reinforcement learning model is unstable, the weight output by the model, and the error between the investment combination constructed according to the weight and the reference index is overlarge.
A stock index tracking and predicting method and system based on social network clustering is disclosed in the patent document applied by Nanjing university (application number: 2017101004662, application date: 2017.02.23, application publication number: CN106897797A), and the stock index tracking method based on social network clustering comprises the steps of firstly collecting the index and component stock data of the previous month and the current month from a third-party database, and cleaning the data to obtain the data in a sample and the data out of the sample which can be used for research; then, calculating measurement distance by using correlation coefficients among the component stocks, constructing a social network among the component stocks, clustering the network by using a self-adaptive affine propagation clustering algorithm, extracting a clustering center of each cluster to form a stock pool, realizing optimal tracking of stock in the stock pool to the index of the stock, and determining the optimal weight of index tracking; and finally, applying the stock pool and the optimal weight obtained by training in the sample to index tracking of data outside the sample to obtain a predicted index. The invention also provides a stock index tracking and predicting system, the constructed stock pool has low correlation, small tracking error and good stability of a copied result, and the index is accurately tracked. The method has the disadvantages that the weight obtained by the data in the sample is directly used for the data outside the sample, and dynamic adjustment is not carried out, so that the tracking error of the reference index is overlarge.
Disclosure of Invention
The invention aims to provide an enhanced exponential tracking method based on clustering and an LSTM network aiming at the defects of the prior art, which is used for solving the problems of complex model, large calculated amount and overlarge tracking error.
The idea for realizing the purpose of the invention is that the data is preprocessed by using a preprocessing means, the data is screened by using a clustering method, then a training data set of the long-time memory LSTM network is constructed by using a sliding window, the long-time memory LSTM network is trained by using the training data set, and finally the test data set is input into the trained long-time memory LSTM network for calculation to obtain the weight of each stock.
The technical scheme of the invention comprises the following steps:
(1) data preprocessing:
(1a) collecting index point data of each transaction day in 10 years and original component strand data contained in the index from a third-party database, wherein the time span is (1, T + L), the data in the sample is arranged between (1, T), (T +1, T + L) is data outside the sample, the index point data dimension is (1, T + L), the data dimension of the original component strand is (N, P, T + L), wherein N is the total number of component strands contained in the index, P is the total number of features of each component strand, P > 3, and T + L is the total number of all transaction days in 10 years,
Figure BDA0002288283690000021
is a rounding-down operation;
(1b) traversing all component strands in the original component strand data, removing the component strands which do not meet the time length of T + L, and forming the remaining component strands into component strand data with (M, P, T + L) dimension, wherein M is the total number of the component strands contained in the component strand data;
(1c) normalizing all characteristics in the component strand data;
(1d) using a Principal Component Analysis (PCA) method to reduce the dimensions of all the characteristics in the normalized component strand data to obtain reduced-dimension data of (M,3, T + L) dimensions;
(2) generating a training sample set for memorizing the LSTM network at long time:
(2a) forming an initial training sample set by using the in-sample data subjected to dimensionality reduction, and forming a test sample set by using the out-sample data subjected to dimensionality reduction;
(2b) taking out the data of the last 120 days from the initial training sample set after dimensionality reduction, and carrying out K-means clustering on the data of each day to obtain component stock data of (Q,3, T + L) dimensionality, wherein Q is the number of stocks with the largest occurrence frequency in 120 days;
(2c) sliding data in a training data set of (Q,3, T) dimension on a time dimension by a length R, obtaining data of (Q,3, R) dimension each time, sharing T-R +1 group data, and obtaining a training sample set D of (Q,3, R, T-R +1) dimension required by network trainingtrainWherein R is more than 2 and less than T;
(3) constructing a long-time memory LSTM network model:
(3a) a three-layer long-and-short memory LSTM network is built, and the structure of the LSTM network is as follows in sequence: an input layer, a hidden layer, an output layer;
(3b) setting the batch processing size of the long and short time memory LSTM network to be 1, setting the number of nodes of an input layer of the long and short time memory LSTM network to be Y, wherein Y is Q3S + S, wherein X represents multiplication operation, S is the number of steps of delay time step forward propagation of the long and short time memory LSTM network, S is more than 1 and less than R, and the output dimension of the long and short time memory LSTM network is equal to Q;
(3c) setting an activation function of a long-time and short-time memory LSTM network as a hyperbolic tangent activation function;
(3d) the loss function in the long-term memory LSTM network model is set as follows:
Figure BDA0002288283690000031
wherein f is a loss function, Σ is a summation operation, ln is a logarithm operation based on a natural constant e, a is a serial number in all days obtained by sliding data in the training data set of (Q,3, T) dimension once in the time dimension by a length R in step (2c), b is a serial number of all stocks in the component stock data obtained in step (2b),
Figure BDA0002288283690000041
the price of the b-th stock in the constituent stock data obtained in the step (2b) on the a-th day,
Figure BDA0002288283690000042
the weight of the b th stock in the component stock data obtained in the a th step (2b), laIs the index point location of the day a,
Figure BDA0002288283690000043
the price of the b th stock in the constituent stock data obtained in the step (2b) of the a-1 th day,
Figure BDA0002288283690000044
the weight of the b th stock in the component stock data obtained in step (2b) at days a-1, la-1Is the index point of day a-1,
Figure BDA0002288283690000045
the price of the b th stock in the constituent stock data obtained in the a +1 th day step (2b), la+1Index point location for day a + 1;
(3e) setting an optimization algorithm for memorizing the LSTM network at long time as an optimization algorithm Adam based on adaptive matrix estimation;
(4) training a long-time memory LSTM network model:
will train sample set DtrainInputting the parameters in the step (3b) and the activation function in the step (3c) into the long-short memory LSTM network, performing forward propagation on the long-short memory LSTM network, and performing backward propagation on the error of the long-short memory LSTM network by using the loss function in the step (3d) and the optimization algorithm in the step (3e) until the loss function converges to obtain a trained long-short memory LSTM network model;
(5) calculating the weight of the stock in the test set:
selecting the data of the stocks with the same composition stock data as the data of the composition stock obtained in the step (2b) from the test sample set obtained in the step (2a) to form test data DtestTest data DtestInputting the parameters in the step (3b), the activation function in the step (3c) and the trained long-short memory LSTM network model parameters obtained in the step (4) into a long-short memory LSTM network, carrying out forward propagation on the long-short memory LSTM network, and outputting the weight of each stock.
Compared with the prior art, the invention has the following advantages:
firstly, the method uses Principal Component Analysis (PCA) and K-means clustering method when constructing the training sample set of the network, so that the defect of difficult complex calculation of the model in the prior art is overcome, and the calculation speed of the method is high.
Secondly, because the invention uses a long-time memory LSTM network model and uses a sliding window in the calculation of the stock weight, the defect that the weight cannot be dynamically adjusted in the prior art is overcome, and the index tracking error is smaller.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
The specific steps of the present invention will be further described with reference to fig. 1.
Step 1, data preprocessing.
The method comprises the steps of firstly, collecting exponential point data and original component strand data contained in an index of each transaction day in 10 years from a third-party database, wherein the time span is (1, T, T +1, T, T + L), wherein the data in a sample is arranged between the (1, T, T) and the data in the sample (T +1, T, T + L) is data outside the sample, the dimensionality of the exponential point data is (1, T + L), the data dimensionality of the original component strand data is (N, P, T + L), wherein N is the total number of components contained in the index, P is the total number of features of each component strand, P > 3, and T + L is the total number of all transaction days in 10 years,
Figure BDA0002288283690000051
to round-down operations.
And secondly, traversing all the component strands in the component strand data, removing the component strands which do not meet the time length of T + L, and forming the remaining component strands into (M, P, T + L) dimensional component strand data, wherein M is the total number of the component strands contained in the component strand data.
Thirdly, normalizing all the characteristics in the component stock data according to the following formula:
Figure BDA0002288283690000052
wherein the content of the first and second substances,
Figure BDA0002288283690000053
is the normalized value, x, of the ith characteristic of all component strands in the component strand dataiIs the value of the ith characteristic of all component strands in the component strand data before normalization,
Figure BDA0002288283690000054
is the minimum value of the ith characteristic of all component strands in the component strand data before normalization,
Figure BDA0002288283690000055
the value is the maximum value of the ith characteristic of all component strands in the component strand data before normalization.
And fourthly, using a Principal Component Analysis (PCA) method to reduce the dimensions of all the characteristics in the normalized component strand data to obtain the (M,3, T + L) dimension reduced data.
And 2, generating a training sample set for memorizing the LSTM network at long time and short time.
Firstly, forming an initial training sample set by using the in-sample data subjected to dimensionality reduction, and forming a test sample set by using the out-sample data subjected to dimensionality reduction.
And secondly, taking out data of last 120 days from the initial training sample set after dimensionality reduction, carrying out K-means clustering on the data of each day to obtain component stock data of (Q,3, T + L) dimensionality, wherein Q is the number of stocks with the largest occurrence frequency in 120 days, the K-means clustering refers to carrying out K-means clustering on the data of each day in the selected 120 days, respectively selecting each stock closest to K clustering centers, selecting P stocks each day, and selecting the Q stocks with the largest occurrence frequency in 120 days from the P stocks selected each day to obtain the component stock data of (Q,3, T + L) dimensionality, wherein the value of P and Q is equal to K.
Thirdly, sliding data in the training data set of (Q,3, T) dimension on the time dimension by the length R, obtaining the data of (Q,3, R) dimension each time, sharing T-R +1 group data, and obtaining the training sample set D of (Q,3, R, T-R +1) dimension required by network trainingtrainWherein R is more than 2 and less than T.
And 3, constructing a long-term memory LSTM network model.
Firstly, a three-layer long-time memory LSTM network is built, and the structure of the LSTM network is as follows in sequence: input layer, hidden layer, output layer.
And secondly, setting the batch processing size of the long and short time memory LSTM network to be 1, setting the number of nodes of an input layer of the long and short time memory LSTM network to be Y, and setting Y to Q to 3 to S + S, wherein X represents multiplication operation, S is the number of steps of delay interval step forward propagation of the long and short time memory LSTM network, S is more than 1 and less than R, and the output dimension of the long and short time memory LSTM network is equal to Q.
And thirdly, setting an activation function of the long-time memory LSTM network as a hyperbolic tangent activation function.
Fourthly, setting a loss function in the long-time memory LSTM network model as follows:
Figure BDA0002288283690000071
wherein f is a loss function, Σ is a summation operation, ln is a logarithm operation based on a natural constant e, a is a serial number in all days obtained by sliding data in the training data set of (Q,3, T) dimension once in the time dimension by a length R in the third step of step 2, b is a serial number of all stocks in the component stock data obtained in the second step of step 2,
Figure BDA0002288283690000072
the price of the b-th stock in the constituent stock data obtained in the second step of step 2 on day a,
Figure BDA0002288283690000073
the weight of the b th stock in the constituent stock data obtained in the second step of the step 2 on day a, laIs the index point location of the day a,
Figure BDA0002288283690000074
the price of the b-th stock in the constituent stock data obtained in the second step of step 2 on days a-1,
Figure BDA0002288283690000075
the weight of the b th stock in the constituent stock data obtained in the second step of step 2 on days a-1, la-1Is the index point of day a-1,
Figure BDA0002288283690000076
the price of the b th stock in the constituent stock data obtained in the second step of step 2 on day a +1,/a+1The index points at day a + 1.
Fifthly, setting an optimization algorithm of the long-time memory LSTM network as an optimization algorithm Adam based on adaptive matrix estimation;
step 4, training the long-time memory LSTM network model, and collecting the training sample set DtrainInputting the parameters into the long-short time memory LSTM network, performing forward propagation on the long-short time memory LSTM network by using the parameters in the second step in the step 3 and the activation function in the third step in the step 3, and performing backward propagation on the error of the long-short time memory LSTM network by using the loss function in the fourth step in the step 3 and the optimization algorithm in the fifth step in the step 3 until the loss function converges to obtain a trained long-short time memory LSTM network model.
And 5, calculating the weight of the stocks in the test set.
Selecting the data of the stocks with the same composition stock data as the data of the stocks obtained in the second step of step 2 from the test sample set obtained in the first step of step 2 to form test data DtestTest data DtestInputting the parameters in the second step of the step 3, the activation function in the third step of the step 3 and the trained long-short memory LSTM network model parameters obtained in the step 4 into a long-short memory LSTM network, carrying out forward propagation on the long-short memory LSTM network, and outputting the weight of each stock.
The invention will now be further described with reference to the following examples
Step 1, data preprocessing.
In the first step, the method utilizes Wind software to obtain the stock data (containing 18 characteristics) and the index point data of the proof 180 index 2009-2018. The capital 180 index (also called capital component index) is obtained by adjusting and renaming the original 30 index of the Shanghai securities exchange, and the sample stock is 180 sample stocks which are most representative of the market among all A stocks. The upper certificate 180 contains 180 shares of the component stock, 2009 and 2018 all trade days are 2431 days, each stock has 18 characteristics, so that the data dimension of the original component stock is (111,18,2431), the data dimension in the sample is (111,18,2382) and the data dimension out of the sample is (111,18, 49). The above 18 feature list 1 is as follows:
table 118 list of features
Price of opening dish Closing price Highest price Lowest price Rise and fall Amplitude of fluctuation
Volume of business Rate of hand change Market profit rate Net rate of market Market rate Total market value
Market rate Value of A stock market Total stock book A stock through book Average price Amount of transaction
And step two, all stocks of which the time range does not cover 2009-2018 in the data of the original component stock are removed, missing data are complemented by using an average value, and finally 111 stocks are reserved to form component stock data with the dimension of (111,18, 2431).
And thirdly, carrying out Principal Component Analysis (PCA) dimensionality reduction on the component strand data to obtain data of (111, 3, 2431) dimensionality, wherein the data in the sample is (111, 3,2382) dimensionality, and the data outside the sample is (111, 3, 49) dimensionality.
And 2, generating a training sample set for memorizing the LSTM network at long time and short time.
Step one, the data in the samples after the dimensionality reduction obtained in the step 1 and the third step form an initial training sample set, and the data outside the samples after the dimensionality reduction form a testing sample set.
And secondly, selecting the data of the last 120 days in the initial training sample set obtained in the first step, wherein the dimensionality is (111, 3, 120), the data of each day is (111, 3) dimensionality, clustering the data of each day, the number of clustering clusters is 10, selecting 10 stocks closest to a clustering center in the data of each day as alternative stocks, selecting the first 10 stocks with the largest occurrence frequency in 120 days as stocks contained in an investment portfolio, and taking out the data of the selected stocks in the initial training sample set to obtain a training data set of the (10,3,2382) dimensionality.
And thirdly, sliding the training data set of the dimension (10,3,2382) obtained in the second step in the last dimension, namely the time dimension, by the length 50, and obtaining data of the dimension (10,3, 50) each time, wherein 2333 groups of data are obtained, the data are data of the dimension (10,3,2382, 2333), and the data are training sample sets used for training the network.
And 3, constructing a long-term memory LSTM network model.
Firstly, a three-layer long-time memory LSTM network is built, and the structure of the LSTM network is as follows in sequence: input layer, hidden layer, output layer.
And secondly, setting the batch processing size of the long and short time memory LSTM network to be 1, setting the number of nodes of an input layer of the long and short time memory LSTM network to be 620(Y is Q3S + S, wherein Q is 10, S is 20), and setting the output dimension of the long and short time memory LSTM network to be 10.
And thirdly, setting an activation function of the long-time memory LSTM network as a hyperbolic tangent activation function.
Fourthly, setting a loss function in the long-time memory LSTM network model as follows:
Figure BDA0002288283690000091
wherein f is a loss function, Σ is a summation operation, ln is a logarithm operation based on a natural constant e, a is the serial number of all days obtained by sliding data in the training data set of (10,3,2382) dimension once in the time dimension by the length 50 in the third step of step 2, b is the serial number of all stocks in the component stock data obtained in the second step of step 2,
Figure BDA0002288283690000092
the price of the b-th stock in the constituent stock data obtained in the second step of step 2 on day a,
Figure BDA0002288283690000093
the weight of the b th stock in the constituent stock data obtained in the second step of the step 2 on day a, laIs the index point location of the day a,
Figure BDA0002288283690000094
the price of the b-th stock in the constituent stock data obtained in the second step of step 2 on days a-1,
Figure BDA0002288283690000095
the weight of the b th stock in the constituent stock data obtained in the second step of step 2 on days a-1, la-1Is the index point of day a-1,
Figure BDA0002288283690000096
the price of the b th stock in the constituent stock data obtained in the second step of step 2 on day a +1,/a+1The index points at day a + 1.
Fifthly, setting an optimization algorithm of the long-time memory LSTM network as an optimization algorithm Adam based on adaptive matrix estimation;
and 4, training a long-short-term memory LSTM network model, inputting the training sample set of the network obtained in the step 3 into the long-short-term memory LSTM network constructed in the step 3, performing forward propagation on the long-short-term memory LSTM network by using the parameters in the second step of the step 3 and the activation function in the third step of the step 3, and performing backward propagation on the error of the long-short-term memory LSTM network by using the loss function in the fourth step of the step 3 and the optimization algorithm in the fifth step of the step 3 until the loss function is converged to obtain the trained long-short-term memory LSTM network model.
And 5, calculating the weight of the stocks in the test set.
Selecting the data of the stocks with the same components as the data of the stocks obtained in the second step of the step 2 from the test sample set obtained in the first step of the step 2 to form test data with dimensions (10,3 and 49), inputting the test data into the long-short memory LSTM network, and performing forward propagation on the long-short memory LSTM network by using the trained long-short memory LSTM network model parameters obtained in the step 4 to obtain the weight of each stock.
The effect of the present invention is further explained by combining the simulation experiment as follows:
1. simulation experiment conditions are as follows:
the hardware platform of the simulation experiment of the invention: the processor is Intel (R) CoreTMi7-8700 CPU, the main frequency is 3.2GHz, and the memory is 16 GB.
The software platform of the simulation experiment of the invention is as follows: windows 10 operating system, python3.7.3, tensorflow 1.13.
The data used in the simulation experiment of the invention is all data of Shanghai securities exchange Shanghai 180-proof 2009-2018.
2. Simulation content and result analysis thereof:
the simulation experiment of the invention adopts the invention and five prior arts (deterministic behavior strategy DPG, cycle reinforcement learning RRL, deep deterministic behavior strategy DDPG, genetic algorithm and cycle reinforcement learning GA-RRL, heuristic genetic algorithm HGA) to simulate the data obtained on the proof 180, i.e. the index is tracked by using the component stock of the proof 180 to obtain the investment weight of each stock.
In the simulation experiment, the adopted prior art refers to that:
the DPG method of deterministic behavior strategy in the prior art refers to a deterministic behavior strategy index tracking method proposed by z.jiang et al in the published article "a deterministic learning frame for the deterministic behavior management protocol" (arXiv preprintic xiv:1706.10059,2017), which is called deterministic behavior strategy DPG method for short.
The prior art cyclic reinforcement learning RRL method refers to a cyclic reinforcement learning index tracking method, which is proposed in an article 'effective reinforced using recovery learning and lstm neural networks' (arXiv prediction: 1707.07338,2017) published by D.W. Lu et al, and is called a cyclic reinforcement learning RRL method for short.
The deep deterministic behavior strategy DDPG in the prior art refers to a deep deterministic behavior strategy index tracking method, which is called a deep deterministic behavior strategy DDPG method for short, proposed by Z.Liang et al in the published paper "adaptive deep deterministic behavior learning in portal foundry" (arXivpreprint arXiv:1808.09940,2018).
The genetic algorithm and the cyclic reinforcement learning GA-RRL in the prior art refer to an index tracking method combining the genetic algorithm and the cyclic reinforcement learning, which is proposed in a paper "Using genetic algorithm to improve the cyclic reinforcement learning" published by J.Zhang et al (computerized information, vol.47, No.4, pp.551-567,2016.), and are referred to as the genetic algorithm and the cyclic reinforcement learning GA-RRL method for short.
The heuristic genetic algorithm HGA in the prior art is an index tracking method based on the heuristic genetic algorithm, which is proposed in a paper "analytical genetic algorithm for the index tracking protocol" (vol.148, No.3, pp.621-643,2003.) published by J.E. Beaseley et al.
The results of the six methods of the simulation experiment of the invention are evaluated by two evaluation indexes (tracking error TE, excess profit ER).
The following formulas are used for respectively calculating the tracking error TE and the excess profit ER of each simulation experiment, and then all calculation results are drawn into a table 2.
Figure BDA0002288283690000111
Figure BDA0002288283690000112
Wherein, sigma is summation operation, ln is logarithm operation with natural constant e as base, a is serial number of days, b is serial number of stock,
Figure BDA0002288283690000121
the price of the b-th stock on day a,
Figure BDA0002288283690000122
the weight of the b-th stock on day a, laIs the index point location of the day a,
Figure BDA0002288283690000123
the price of the b-th stock on days a-1,
Figure BDA0002288283690000124
the weight of the b-th stock on days a-1, la-1Is the index point of day a-1,
Figure BDA0002288283690000125
price of the b-th stock on day a +1,/a+1The index points at day a + 1.
TABLE 2 quantitative analysis table of the results of the present invention and various prior arts in simulation experiment
Figure BDA0002288283690000126
Smaller values of the tracking error TE indicate less risk of the portfolio, and larger values of the excess profit ER indicate higher profit of the portfolio. As can be seen from table 2, in the simulation experiment, the tracking error TE of the method for enhancing index tracking is smaller than that of the other five prior arts, and the excess profit ER of the method for enhancing index tracking is larger than that of the other five prior arts, which proves that the method for enhancing index tracking is superior to the prior arts.

Claims (3)

1. A method for enhancing index tracking based on clustering and an LSTM network is characterized in that a training sample set is generated, and a long-time memory LSTM network model is constructed, wherein the method comprises the following steps:
(1) data preprocessing:
(1a) collecting index point data of each transaction day in 10 years and original component strand data contained in the index from a third-party database, wherein the time span is (1, T + L), the data in the sample is between (1, T), the data in the sample is (T +1, T + L) the data out of the sample, the index point data dimension is (1, T + L), the data dimension of the original component strand is (N, P, T + L), wherein N is the total number of component strands contained in the index, P is the total number of features of each component strand, and P is more than 3,
Figure FDA0002288283680000011
Figure FDA0002288283680000012
for the round-down operation, T + L is the total number of all trading days in 10 years;
(1b) traversing all component strands in the original component strand data, removing the component strands which do not meet the time length of T + L, and forming the remaining component strands into component strand data with (M, P, T + L) dimension, wherein M is the total number of the component strands contained in the component strand data;
(1c) normalizing all characteristics in the component strand data;
(1d) using a Principal Component Analysis (PCA) method to reduce the dimensions of all the characteristics in the normalized component strand data to obtain reduced-dimension data of (M,3, T + L) dimensions;
(2) generating a training sample set for memorizing the LSTM network at long time:
(2a) forming an initial training sample set by using the in-sample data subjected to dimensionality reduction, and forming a test sample set by using the out-sample data subjected to dimensionality reduction;
(2b) taking out the data of the last 120 days from the initial training sample set after dimensionality reduction, and carrying out K-means clustering on the data of each day to obtain component stock data of (Q,3, T + L) dimensionality, wherein Q is the number of stocks with the largest occurrence frequency in 120 days;
(2c) sliding data in a training data set of (Q,3, T) dimension on a time dimension by a length R, obtaining data of (Q,3, R) dimension each time, sharing T-R +1 group data, and obtaining a training sample set D of (Q,3, R, T-R +1) dimension required by network trainingtrainWherein R is more than 2 and less than T;
(3) constructing a long-time memory LSTM network model:
(3a) a three-layer long-and-short memory LSTM network is built, and the structure of the LSTM network is as follows in sequence: an input layer, a hidden layer, an output layer;
(3b) setting the batch processing size of the long and short time memory LSTM network to be 1, setting the number of nodes of an input layer of the long and short time memory LSTM network to be Y, wherein Y is Q3S + S, wherein X represents multiplication operation, S is the number of steps of delay time step forward propagation of the long and short time memory LSTM network, S is more than 1 and less than R, and the output dimension of the long and short time memory LSTM network is equal to Q;
(3c) setting an activation function of a long-time and short-time memory LSTM network as a hyperbolic tangent activation function;
(3d) the loss function in the long-term memory LSTM network model is set as follows:
Figure FDA0002288283680000021
wherein f is a loss function, Σ is a summation operation, ln is a logarithm operation based on a natural constant e, and a is the step(2c) The sequence numbers of all the stocks in the component stock data obtained in step (2b) are the sequence numbers of all the stocks in the training data set of (Q,3, T) dimension, which are obtained by sliding the data of (Q,3, T) dimension once in the time dimension by the length R,
Figure FDA0002288283680000022
the price of the b-th stock in the constituent stock data obtained in the step (2b) on the a-th day,
Figure FDA0002288283680000023
the weight of the b th stock in the component stock data obtained in the a th step (2b), laIs the index point location of the day a,
Figure FDA0002288283680000031
the price of the b th stock in the constituent stock data obtained in the step (2b) of the a-1 th day,
Figure FDA0002288283680000032
the weight of the b th stock in the component stock data obtained in step (2b) at days a-1, la-1Is the index point of day a-1,
Figure FDA0002288283680000033
the price of the b th stock in the constituent stock data obtained in the a +1 th day step (2b), la+1Index point location for day a + 1;
(3e) setting an optimization algorithm for memorizing the LSTM network at long time as an optimization algorithm Adam based on adaptive matrix estimation;
(4) training a long-time memory LSTM network model:
will train sample set DtrainInputting the parameters in the step (3b) and the activation function in the step (3c) into the long-short memory LSTM network, performing forward propagation on the long-short memory LSTM network, and performing backward propagation on the error of the long-short memory LSTM network by using the loss function in the step (3d) and the optimization algorithm in the step (3e) until the loss function converges to obtain a trained long-short memory LSTM network model;
(5) calculating the weight of the stock in the test set:
selecting the data of the stocks with the same composition stock data as the data of the composition stock obtained in the step (2b) from the test sample set obtained in the step (2a) to form test data DtestTest data DtestInputting the parameters in the step (3b), the activation function in the step (3c) and the trained long-short memory LSTM network model parameters obtained in the step (4) into a long-short memory LSTM network, carrying out forward propagation on the long-short memory LSTM network, and outputting the weight of each stock.
2. The method of claim 1, wherein the normalization of all the features in the component strand data in step (1c) is performed according to the following formula:
Figure FDA0002288283680000034
wherein the content of the first and second substances,
Figure FDA0002288283680000041
is the normalized value, x, of the ith characteristic of all component strands in the component strand dataiIs the value of the ith characteristic of all component strands in the component strand data before normalization,
Figure FDA0002288283680000042
is the minimum value of the ith characteristic of all component strands in the component strand data before normalization,
Figure FDA0002288283680000043
the value is the maximum value of the ith characteristic of all component strands in the component strand data before normalization.
3. The method for tracking an enhanced index based on clustering and an LSTM network as claimed in claim 1, wherein said K-means clustering in step (2b) is performed by K-means clustering for data of each day of 120 selected days, selecting stocks nearest to each K clustering centers respectively, selecting P stocks each day, and selecting Q stocks with the most frequent occurrence frequency in 120 days from the P stocks selected each day, to obtain (Q,3, T + L) dimensional component stock data, wherein P and Q have the same value as K.
CN201911169339.3A 2019-11-26 2019-11-26 Enhanced index tracking method based on clustering and LSTM network Pending CN111028086A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911169339.3A CN111028086A (en) 2019-11-26 2019-11-26 Enhanced index tracking method based on clustering and LSTM network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911169339.3A CN111028086A (en) 2019-11-26 2019-11-26 Enhanced index tracking method based on clustering and LSTM network

Publications (1)

Publication Number Publication Date
CN111028086A true CN111028086A (en) 2020-04-17

Family

ID=70201985

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911169339.3A Pending CN111028086A (en) 2019-11-26 2019-11-26 Enhanced index tracking method based on clustering and LSTM network

Country Status (1)

Country Link
CN (1) CN111028086A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111950704A (en) * 2020-08-07 2020-11-17 哈尔滨工业大学 Atmospheric temperature data generation method based on merging long-time and short-time memory networks
CN112884576A (en) * 2021-02-02 2021-06-01 上海卡方信息科技有限公司 Stock trading method based on reinforcement learning

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111950704A (en) * 2020-08-07 2020-11-17 哈尔滨工业大学 Atmospheric temperature data generation method based on merging long-time and short-time memory networks
CN111950704B (en) * 2020-08-07 2022-11-29 哈尔滨工业大学 Atmospheric temperature data generation method based on merging long-time and short-time memory networks
CN112884576A (en) * 2021-02-02 2021-06-01 上海卡方信息科技有限公司 Stock trading method based on reinforcement learning

Similar Documents

Publication Publication Date Title
Dodge et al. Show your work: Improved reporting of experimental results
CN110782658B (en) Traffic prediction method based on LightGBM algorithm
CN107766929B (en) Model analysis method and device
CN108764460A (en) A kind of Time Series Forecasting Methods based on time convolution sum LSTM
CN106022954B (en) Multiple BP neural network load prediction method based on grey correlation degree
CN109118013A (en) A kind of management data prediction technique, readable storage medium storing program for executing and forecasting system neural network based
CN109583635A (en) A kind of short-term load forecasting modeling method towards operational reliability
CN112529685A (en) Loan user credit rating method and system based on BAS-FNN
CN110837929A (en) Least square support vector machine electricity utilization prediction method based on adaptive genetic algorithm
CN111028086A (en) Enhanced index tracking method based on clustering and LSTM network
CN113590807A (en) Scientific and technological enterprise credit evaluation method based on big data mining
CN113095484A (en) Stock price prediction method based on LSTM neural network
CN110533249B (en) Metallurgical enterprise energy consumption prediction method based on integrated long-term and short-term memory network
CN112650933A (en) High-order aggregation-based graph convolution and multi-head attention mechanism conversation recommendation method
CN112529638A (en) Service demand dynamic prediction method and system based on user classification and deep learning
Xing et al. Seasonal and trend forecasting of tourist arrivals: An adaptive multiscale ensemble learning approach
Zhang et al. Performance comparisons of Bi-LSTM and Bi-GRU networks in Chinese word segmentation
Panigrahi et al. Normalize time series and forecast using evolutionary neural network
CN113887717A (en) Method for predicting neural network training duration based on deep learning
CN107704944A (en) A kind of fluctuation of stock market interval prediction method based on information theory study
CN109523386A (en) A kind of investment portfolio risk prediction technique of GMM in conjunction with LSTM
CN113642632B (en) Power system customer classification method and device based on self-adaptive competition and equalization optimization
CN112529637B (en) Service demand dynamic prediction method and system based on context awareness
CN112766537B (en) Short-term electric load prediction method
CN114820160A (en) Loan interest rate estimation method, device, equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200417