CN110059852A

CN110059852A - A kind of stock yield prediction technique based on improvement random forests algorithm

Info

Publication number: CN110059852A
Application number: CN201910180723.7A
Authority: CN
Inventors: 方昕; 陈玲玲
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2019-03-11
Filing date: 2019-03-11
Publication date: 2019-07-26

Abstract

The invention discloses a stock return rate prediction method based on an improved random forest algorithm. The present invention aims at the difficulty of parameter selection and classification performance problems in the random forest when the stock return rate is classified and predicted, and the RF algorithm itself cannot identify and select a more efficient one. The shortcomings of the features, combined with the particle swarm algorithm to optimize the feature selection mechanism, when the trend change is not obvious in the early stage, the optimal features are screened out, and they are input into the RF algorithm as attributes, and a hybrid method of PSO-GRID-RF stock trend prediction is proposed; this paper The invention reduces feature subsets, eliminates irrelevant or duplicated feature attributes, reduces the dimension of input, and reduces the time for stock trend prediction; in the multi-attribute feature environment, an efficient feature selection method is proposed, and a grid search algorithm is introduced to optimize Random forest parameters, thereby improving the classification prediction performance of random forest, and greatly improving the accuracy of stock trend prediction.

Description

A Stock Return Prediction Method Based on Improved Random Forest Algorithm

技术领域technical field

本发明属于金融数据挖掘技术领域，针对股票收益率的分类预测研究中随机森林存在的参数选择困难以及分类性能问题，提出了一种基于粒子群算法的特征选择和网格搜索算法优化随机森林的参数的新算法。通过粒子群算法对训练集进行特征选择，剔除去除指标体系中的冗余指标以降低输入维度，同时引入网格搜索算法优化随机森林参数，从而提高随机森林的分类预测性能。The invention belongs to the technical field of financial data mining. Aiming at the difficulty of parameter selection and the problem of classification performance in random forests in the classification and prediction research of stock returns, a particle swarm algorithm-based feature selection and grid search algorithm are proposed to optimize random forests. New algorithm for parameters. The particle swarm algorithm is used to select the features of the training set, and the redundant indicators in the index system are eliminated to reduce the input dimension. At the same time, the grid search algorithm is introduced to optimize the parameters of the random forest, thereby improving the classification and prediction performance of the random forest.

背景技术Background technique

在股票市场中，对于投资者来说，股票价格走势的预测一直是热门问题。准确判断和把握整个股票市场的变化趋势，不仅可以减少股票市场中盲目投资的现象，对于提高股市中投资者的理性程度具有较高的现实意义，更可以为国家制定相关经济政策提供参考。In the stock market, prediction of stock price movements has always been a hot issue for investors. Accurately judging and grasping the changing trend of the entire stock market can not only reduce the phenomenon of blind investment in the stock market, but also has a high practical significance for improving the rationality of investors in the stock market, and can also provide a reference for the country to formulate relevant economic policies.

国内外学者对股票价格预测进行了深入的研究，提出了各种预测方法。现在应用的方法主要有两种，基本面分析和技术分析。第一类是基于对公司成长性和盈利能力等基本因素的考虑。第二类是基于过去股票数据的数学分析，这种简单的分析是通过观察股票运动趋势图来进行预测。更复杂的分析是采用复杂的统计方法和机器学习算法。Scholars at home and abroad have conducted in-depth research on stock price forecasting and have proposed various forecasting methods. There are two main methods currently used, fundamental analysis and technical analysis. The first category is based on consideration of fundamental factors such as company growth and profitability. The second type is mathematical analysis based on past stock data, this simple analysis is to make predictions by observing the trend chart of stock movements. More sophisticated analyses employ sophisticated statistical methods and machine learning algorithms.

时间序列分析是最先应用于股票价格预测的方法，对股票开盘价格建立ARMA模型进行短期预测。由于受到各种因素的影响，股票价格呈现出非线性变化，基于线性模型的时间序列分析法不能很好地反映股票非线性变化规律，预测精度低，应用受限。随着人工智能技术的兴起，BP神经网络因为其强大的非线性映射能力被广泛应用在股票价格预测中.基于BP神经网络的股票价格预测模型SPPM，对股票价格建立多个神经网络模型进行预测.神经网络在非线性的股票预测中取得了良好的效果，但同时存在着学习记忆不稳定，收敛速度慢，容易陷入局部最优值的问题。Time series analysis is the first method applied to stock price forecasting, which establishes an ARMA model for short-term forecasting of stock opening prices. Due to the influence of various factors, the stock price presents nonlinear changes. The time series analysis method based on the linear model cannot reflect the nonlinear change law of the stock well, and the prediction accuracy is low, and the application is limited. With the rise of artificial intelligence technology, BP neural network is widely used in stock price prediction because of its powerful nonlinear mapping ability. Based on the stock price prediction model SPPM of BP neural network, multiple neural network models are established to predict stock prices. .Neural network has achieved good results in nonlinear stock prediction, but at the same time there are problems of unstable learning and memory, slow convergence speed, and easy to fall into the local optimal value.

随机森林算法(Random Forest)作为一种分类技术已经在金融领域中得到了应用，相较于支持向量机(Support Vector Machine)和人工神经网络(Artificial NeuralNetworks)，RF在股票趋势预测中得到更好的结果。随机森林算法是一种模型组合，应用到不同的领域上均获得不俗的成果。基于随机森林算法具有训练速度快、模型泛化能力强等优点，将该算法运用到股票涨跌预测中，能够避免上述预测模型的不足。随机森林法预测主要是先对建立的初始指标体系进行筛选，将筛选后的指标数据作为影响变量代入到随机森林中，涨跌情况作为响应变量输出。但现有方法对随机森林本身的模型优化有所欠缺，不能进一步提升预测精确度。Random Forest algorithm (Random Forest) has been applied in the financial field as a classification technology. Compared with Support Vector Machine (Support Vector Machine) and Artificial Neural Networks (Artificial Neural Networks), RF is better in stock trend prediction the result of. The random forest algorithm is a combination of models, and it has achieved good results in different fields. Based on the advantages of the random forest algorithm, such as fast training speed and strong model generalization ability, the application of this algorithm to the prediction of stock fluctuations can avoid the shortcomings of the above prediction models. Random forest prediction is mainly to first screen the established initial indicator system, and then substitute the screened indicator data into the random forest as influence variables, and output the fluctuations as response variables. However, the existing methods lack the model optimization of the random forest itself, and cannot further improve the prediction accuracy.

发明内容SUMMARY OF THE INVENTION

本发明针对在股票收益率的分类预测研究中技术的不足，提出了一种基于改进随机森林算法的股票收益率预测方法。Aiming at the technical deficiencies in the research on the classification and prediction of stock returns, the invention proposes a stock return prediction method based on an improved random forest algorithm.

一种基于改进随机森林算法的股票收益率预测方法，具体包括以下步骤：A stock return prediction method based on an improved random forest algorithm, which specifically includes the following steps:

步骤1：数据获取，通过网站获取股票日数据；Step 1: Data acquisition, get the stock daily data through the website;

将数据分为训练集，验证集，测试集Divide the data into training set, validation set, test set

步骤2：获取数据进行指数平滑：Step 2: Get the data for exponential smoothing:

S₀＝Y₀ t＝0 (1)S ₀ =Y ₀ t=0 (1)

S_t＝α*Y_t+(1-α)*S_t-1 t>0 (2)S _t =α*Y _t +(1-α)*S _t-1 t>0 (2)

式中：S_t表示时间t的平滑值，Y_t表示时间t的实际值；In the formula: S _t represents the smoothed value of time t, Y _t represents the actual value of time t;

S₀表示t＝0时的数据平滑值，Y₀表示t＝0的实际值，t表示获取股票日的天数；S ₀ represents the smoothed value of the data at t=0, Y ₀ represents the actual value at t=0, and t represents the number of days on which the stock was acquired;

S_t-1时间为t-1的平滑值，α为指数平滑因子，0<α<1。S _t-1 time is the smoothing value of t-1, α is the exponential smoothing factor, 0<α<1.

指数平滑消除了来自历史数据的变化的随机性或噪声，使模型能够轻松识别长期价格趋势。Exponential smoothing removes randomness or noise from changes in historical data, allowing models to easily identify long-term price trends.

步骤3：特征提取：根据指数平滑结果计算技术指标，平滑的时间序列数据计算特征矩阵，将被投资者用来判断股票趋势涨跌的技术指标作为特征。Step 3: Feature extraction: Calculate the technical indicators according to the exponential smoothing results, calculate the feature matrix from the smoothed time series data, and use the technical indicators used by investors to judge the stock trend as a feature.

步骤4：PSO算法进行特征选择：Step 4: PSO algorithm for feature selection:

要确定必要的影响指标作为模型的输入，必要的响应变量作为模型的输出，因为构建股票指标体系是进行后续评价和综合分析的基础，所以我们对技术指标进行特征提取。我们将技术指标作为粒子群算法中的粒子，PSO算法中粒子的初始速度与位置都是随机分配，局部最优解P_idbest是当前迭代情况下粒子的最优位置，全局最优解P_gdbest是整个种群的最优位置。假设粒子群搜索空间维度为D，共有m个粒子，则粒子在空间的位置为x_i＝[x_i1,x_i2,…,x_iD]，速度为v_i＝[v_i1,v_i2,…,v_iD]，i＝1…m，计算公式如下所示：To determine the necessary impact indicators as the input of the model, and the necessary response variables as the output of the model, because the construction of a stock indicator system is the basis for subsequent evaluation and comprehensive analysis, we perform feature extraction on technical indicators. We take the technical index as the particle in the particle swarm algorithm. The initial velocity and position of the particle in the PSO algorithm are randomly assigned, the local optimal solution P _idbest is the optimal position of the particle in the current iteration, and the global optimal solution P _gdbest is the optimal location for the entire population. Assuming that the dimension of the particle swarm search space is D, and there are m particles in total, the position of the particle in the space is x _i =[x _i1 ,x _i2 ,...,x _iD ], and the speed is _{vi =[v i1} _, v _i2 ,... ,v _iD ], i=1...m, the calculation formula is as follows:

调整空间位置Adjust space position

式中：V^k对应某粒子局部极值在第k维的速度，X^k粒子局部值第k维最优位置，代表第k次迭代过程时粒子群的局部最优位置，代表第k次迭代过程时粒子群的全局最优位置，S(·)表示sigmoid函数，以速度作为sigmoid函数的变量，调整空间位置是将粒子速度映射到[0,1]之间，并与随机数比较，更新粒子的位置状态，c₁,c₂是学习因子，且为正数，w是惯性权重，rand₁,rand₂∈[0,1]，随机均匀分布；In the formula: V ^k corresponds to the velocity of the local extremum of a particle in the k-th dimension, X ^k is the optimal position of the k-th dimension of the local value of the particle, represents the local optimal position of the particle swarm in the k-th iteration process, Represents the global optimal position of the particle swarm in the k-th iteration process, S( ) represents the sigmoid function, and takes the velocity as the variable of the sigmoid function. To adjust the spatial position, the particle velocity is mapped to between [0, 1] and combined with Random number comparison, update the position state of the particle, c ₁ , c ₂ are learning factors, and are positive numbers, w is the inertia weight, rand ₁ , rand ₂ ∈ [0,1], random uniform distribution;

步骤5：设定判定条件：Step 5: Set the judgment conditions:

设定判定条件：若迭代次数超过最大迭代次数，适应度低于设定的值，则跳出循环。Set the judgment condition: if the number of iterations exceeds the maximum number of iterations and the fitness is lower than the set value, the loop will be jumped out.

步骤6：特征选择：Step 6: Feature Selection:

将步骤4粒子群特征选择得到的二进制编码作为输入特征用于趋势预测，其中1表示被选中，0表示不被选中；The binary code obtained by the particle swarm feature selection in step 4 is used as the input feature for trend prediction, where 1 means selected and 0 means not selected;

步骤7：输出最优特征：Step 7: Output optimal features:

若满足步骤5,设定的条件输出最优特征，否则返回步骤4；If step 5 is satisfied, the set condition outputs the optimal feature, otherwise returns to step 4;

步骤8：构建数据矩阵：Step 8: Build the data matrix:

根据步骤7选择出的最优特征构建数据矩阵；Build a data matrix according to the optimal feature selected in step 7;

步骤9：训练集，验证集交叉验证：Step 9: Training set, validation set cross-validation:

将训练集，验证集采用交叉验证进行调参，90％用于训练模型，10％用于验证模型。利用网格搜索算法对随机森林进行参数寻优，包括树的深度，随机状态，树节点的变量数、树的个数、OOB误分率以及变量重要性估计来提升预测准确度，从而得到预测模型，使得模型对数据有较好的适应度。The training set and the validation set are adjusted by cross-validation, 90% are used to train the model and 10% are used to validate the model. Use the grid search algorithm to optimize the parameters of the random forest, including the depth of the tree, the random state, the number of variables in the tree node, the number of trees, the OOB misclassification rate, and the estimation of variable importance to improve the prediction accuracy, so as to obtain the prediction. model, so that the model has better fitness to the data.

步骤10：股票交易信号即数据标签的建立：Step 10: Establishment of stock trading signals, namely data labels:

将步骤8构建的数据矩阵作为训练数据，输入到随机森林算法进行训练，构建交易信号Yj＝{y1,y2,…,yj}，其中j＝1,2,…,n为样本编号。交易信号的具体构建步骤如下:Use the data matrix constructed in step 8 as training data, input it into the random forest algorithm for training, and construct the trading signal Yj={y1,y2,...,yj}, where j=1,2,...,n is the sample number. The specific construction steps of the trading signal are as follows:

1)计算日均价p_j 1) Calculate the daily average price p _j

其中C_j表示股票收盘价，H_j表股票最高价，L_j表示最低价。Among them, C _j represents the closing price of the stock, H _j represents the highest price of the stock, and L _j represents the lowest price.

2)计算未来k天的算数收益V_j，k＝1，2，…，10；2) Calculate the arithmetic return V _j of the next k days, k=1, 2, . . . , 10;

3)构建交易信号y_j 3) Construct the trading signal y _j

步骤11：样本内训练：Step 11: In-Sample Training:

将由最优特征构建的数据矩阵输入随机森林算法模型进行训练，并用网格搜索算法对新的最优特征数据集进行参数寻优，并与实际股票趋势进行比较，得出股票预测的趋势以及预测的准确性。Input the data matrix constructed by the optimal features into the random forest algorithm model for training, and use the grid search algorithm to optimize the parameters of the new optimal feature data set, and compare it with the actual stock trend to obtain the stock forecast trend and forecast. accuracy.

步骤12：模型评价：Step 12: Model Evaluation:

根据随机森林算法分类过程中，分类预测结果可以用混淆矩阵表示,如下表1所示：In the classification process according to the random forest algorithm, the classification prediction results can be represented by a confusion matrix, as shown in Table 1 below:

表1混淆矩阵Table 1 Confusion Matrix

预测为+1Predicted +1 预测为0Predicted to be 0 预测为-1Predicted as -1 真实为+1True is +1 TPTP FZ1FZ1 FN1FN1 真实为0true is 0 FP1FP1 TZTZ FN2FN2 真实为-1true is -1 FP2FP2 FZ2FZ2 TNTN

其中TP为正确分类的+1，TZ正确分类的0，TN为正确分类的负类,FP1为0类错误分为+1类，FP2为-1类错误分为+1类，FZ1为+1类错误分为0类，FZ2为-1类错误分为0类，FN1为+1类错误分为-1类，FN2为0类错误分为-1类，FP为FP1+FP2，FN(为FN1+FN2，FZ为FZ1+FZ2，N＝NTP+NFN+NFP+NTN+NTZ表示样本总量。Among them, TP is +1 for correct classification, 0 for correct classification in TZ, TN for negative class for correct classification, FP1 for class 0 error is divided into +1 class, FP2 for -1 class error is divided into +1 class, FZ1 is +1 class Class error is divided into 0 class, FZ2 is -1 class error is classified into 0 class, FN1 is +1 class error is classified into -1 class, FN2 is 0 class error is classified into -1 class, FP is FP1+FP2, FN (is FN1+FN2, FZ is FZ1+FZ2, and N=NTP+NFN+NFP+NTN+NTZ represents the total amount of samples.

准确率Accuracy表示测试集中被预测正确的概率，召回率Recall表示原来是正类的样本预测对的概率，查准率Precision表示所有被预测为正类的样本中正确的概率，计算公式分别如下所示：Accuracy represents the probability of being correctly predicted in the test set, Recall represents the probability that the samples of the positive class were predicted to be correct, and Precision represents the correct probability of all samples that were predicted to be positive classes. The calculation formulas are as follows: :

Recall＝TP/(TP+FN) (10)Recall=TP/(TP+FN) (10)

Precision＝TP/(TP+FP) (11)Precision=TP/(TP+FP) (11)

F是由灵敏度和查准率的加权平均值组成的综合性能指标，F值越趋近于1表示分类结果越好，公式如下：F is a comprehensive performance index composed of the weighted average of sensitivity and precision. The closer the F value is to 1, the better the classification result. The formula is as follows:

以上这些参数是从混沌矩阵中获得的另一方面，在随机森林生成过程中，用bootstrap方法生成训练集，由于是有放回的重复抽样，与原始数据相比，只有大约63％的数据被重复抽取，其余数据不会出现，其中其余数据就是袋外数据OOB，使用袋外数据来估计随机森林算法的泛化能力，称之为OOB估计；以一棵树为单位，用OOB数据检测到的正确率为OOB_score，检测到的误差就是袋外误差OOB_error，将所有树的OOB_error取平均就是随机森林的OOB'_error，OOB'_error越小说明RF的泛化能力越强；适应度值Fitness由F与OOB'_error组成，值越小越好，公式如下：The above parameters are obtained from the chaos matrix. On the other hand, in the random forest generation process, the training set is generated by the bootstrap method. Due to the repeated sampling with replacement, only about 63% of the data is generated compared to the original data. Repeat the extraction, and the rest of the data will not appear, of which the rest of the data is the out-of-bag data OOB. The out-of-bag data is used to estimate the generalization ability of the random forest algorithm, which is called OOB estimation; with a tree as the unit, the OOB data is used to detect The correct rate is OOB _score , the detected error is the out-of-bag error OOB _error , and the average OOB _error of all trees is the OOB' _error of the random forest. The smaller the OOB' _error , the stronger the generalization ability of RF; the fitness The value Fitness is composed of F and OOB' _error . The smaller the value, the better. The formula is as follows:

OOB_error＝1-OOB_score (13)OOB _error = 1-OOB _score (13)

Fitness＝OOB'_error+(1-F) (14)Fitness=OOB' _error +(1-F) (14)

步骤13：样本外测试：Step 13: Out-of-sample testing:

确定最优参数后，再用测试数据来测试训练完后的随机森林算法模型，得到分类结果，以测试集所有预处理后的样本特征作为模型的输入，得到每个样本的T+k预测值得到分类结果，并与实际股票趋势进行比较，得出股票预测的趋势以及预测的准确性。After determining the optimal parameters, use the test data to test the trained random forest algorithm model to obtain the classification results, and use all the preprocessed sample features of the test set as the input of the model to obtain the T+k prediction value of each sample. To the classification results, and compared with the actual stock trend, the trend of stock forecast and the accuracy of the forecast.

本发明有益效果如下：The beneficial effects of the present invention are as follows:

(1)本发明提出高效特征选择方法，通过PSO算法全局搜索，选择出最佳特征作为输入变量输入到RF算法，缩小了特征子集，剔除了无关或者效果重复的特征属性，减小了股票预测的时间，大幅度提高了股票趋势预测的准确率。(1) The present invention proposes an efficient feature selection method. Through the global search of the PSO algorithm, the best feature is selected as an input variable and input to the RF algorithm, which reduces the feature subset, eliminates irrelevant or duplicated feature attributes, and reduces the number of stocks. The prediction time greatly improves the accuracy of stock trend prediction.

(2)本发明在训练集采用交叉验证，有效考虑股票价格的时序相关性，有效提高随机森林分类模型的准确率。(2) The present invention adopts cross-validation in the training set, effectively considers the time-series correlation of stock prices, and effectively improves the accuracy of the random forest classification model.

(3)股票收益率是一个平稳序列，用股票收益率作为标签，比用收盘价作为输入标签更能反映价格趋势，可以有效提高股票趋势预测的准确率。(3) The stock return rate is a stationary sequence. Using the stock return rate as a label can better reflect the price trend than using the closing price as an input label, which can effectively improve the accuracy of stock trend prediction.

(4)本发明对随机森林进行参数训练时采用网格搜索算法进行参数寻优，有效避免随机森林算法进行预测时参数选择难问题，选取最佳参数，提高趋势预测的准确率。(4) The present invention adopts the grid search algorithm to optimize the parameters in the parameter training of the random forest, which effectively avoids the difficult problem of parameter selection when the random forest algorithm performs prediction, selects the best parameters, and improves the accuracy of trend prediction.

附图说明Description of drawings

图1为改进随机森林的股票收益率研究框架图；Figure 1 is a framework diagram of the stock return research of improved random forest;

图2为随机森林分类无向图；Figure 2 is an undirected graph of random forest classification;

图3为二进制编码示意图；Fig. 3 is a schematic diagram of binary encoding;

图4为PSO算法流程图；Fig. 4 is the PSO algorithm flow chart;

图5为股票趋势预测中随机森林投票分类方法流程图。Figure 5 is a flowchart of the random forest voting classification method in stock trend prediction.

具体实施方式Detailed ways

下面结合附图和实施例对本发明作进一步说明。The present invention will be further described below with reference to the accompanying drawings and embodiments.

如图1～5所示，本发明基于粒子群算法，网格搜索算法，随机森林算法，对股票收益率研究的股票趋势预测方法。As shown in Figures 1-5, the present invention is based on particle swarm algorithm, grid search algorithm, random forest algorithm, and a stock trend prediction method for stock return research.

本发明提出了一种多属性特征环境下高效特征选择方法的股票收益率研究方法，由于属性特征规模大，且都属于连续变量，所以采用CART算法生成决策树，具体公式如下：The present invention proposes a stock return research method of an efficient feature selection method in a multi-attribute feature environment. Since the attribute features are large in scale and all belong to continuous variables, the CART algorithm is used to generate a decision tree, and the specific formula is as follows:

本发明流程图如图1所示，具体步骤如下：The flow chart of the present invention is shown in Figure 1, and the concrete steps are as follows:

(1)：数据获取：(1): Data acquisition:

通过网站获取股票日数据，本发明用到的股票数据来源是雅虎(Yahoo)等网站，包括股票交易的开盘价，收盘价，成交量，最高价，最低价等，下载为CSV文件，并将数据分为训练集，验证集，测试集。The stock daily data is obtained through the website. The stock data source used in the present invention is a website such as Yahoo (Yahoo), including the opening price, closing price, trading volume, highest price, lowest price, etc. of stock trading, downloaded as a CSV file, and the The data is divided into training set, validation set, and test set.

(2)：获取数据进行指数平滑：(2): Obtain data for exponential smoothing:

S₀＝Y₀ t＝0 (3)S ₀ =Y ₀ t=0 (3)

S_t＝α*Y_t+(1-α)*S_t-1 t>0 (4)S _t =α*Y _t +(1-α)*S _t-1 t>0 (4)

式中S_t表示时间t的平滑值，Y_t表示时间t的实际值；where S _t represents the smoothed value at time t, and Y _t represents the actual value at time t;

(3)：特征提取；(3): Feature extraction;

根据指数平滑结果计算技术指标，指标要考虑市场行为的各个方面,指标值的具体数值和相互间关系,直接反映股市所处的状态,为我们的操作行为提供指导方向。指标反映的东西大多是从行情报表中直接看不到的。技术指标为平滑的时间序列数据计算特征矩阵，将被投资者接受用来判断股票趋势涨跌的技术指标作为特征。Technical indicators are calculated based on the results of exponential smoothing. The indicators should consider all aspects of market behavior. The specific values and interrelationships of indicator values directly reflect the state of the stock market and provide guidance for our operational behavior. Most of the indicators reflect things that are not directly visible from the information sheet. The technical indicators calculate the feature matrix for the smoothed time series data, and use the technical indicators accepted by investors to judge the ups and downs of the stock trend as features.

(4)：PSO算法进行特征选择：(4): PSO algorithm for feature selection:

图4为PSO算法流程图，粒子群进行特征选择的步骤。在pso算法中，优化问题被转化为d维空间里的一个点，称为粒子，粒子当前位置的好坏由目标函数评估，目标函数根据粒子的位置计算相应的适应度。粒子在搜索空间中以一定的速度飞行，这个速度根据它本身的飞行经验和同伴的飞行经验来动态调整，继而被用来计算粒子的新位置。优化搜索一群随机初始化形成的粒子所组成的种群中，以迭代的方式进行，直到满足一定的终止条件，例如达到指定的迭代次数。Figure 4 is a flowchart of the PSO algorithm, the steps of feature selection by particle swarm. In the PSO algorithm, the optimization problem is transformed into a point in the d-dimensional space, called a particle. The current position of the particle is evaluated by the objective function, and the objective function calculates the corresponding fitness according to the position of the particle. The particle flies in the search space at a certain speed, which is dynamically adjusted according to its own flying experience and the flying experience of its companions, and is then used to calculate the new position of the particle. The optimization search is performed in an iterative manner in a population composed of a group of randomly initialized particles until a certain termination condition is met, such as a specified number of iterations.

PSO算法中粒子的初始速度与位置都是随机分配，局部最优解P_idbest是当前迭代情况下粒子的最优位置，全局最优解P_gdbest是整个种群的最优位置。假设粒子群搜索空间维度为D，共有m个粒子，则粒子在空间的位置为x_i＝[x_i1,x_i2,…,x_iD]，速度为v_i＝[v_i1,v_i2,…,v_iD]，i＝1…m，计算公式如下所示：In the PSO algorithm, the initial velocity and position of the particles are randomly assigned, the local optimal solution P _idbest is the optimal position of the particle in the current iteration, and the global optimal solution P _gdbest is the optimal position of the entire population. Assuming that the dimension of the particle swarm search space is D, and there are m particles in total, the position of the particle in the space is x _i =[x _i1 ,x _i2 ,...,x _iD ], and the speed is _{vi =[v i1} _, v _i2 ,... ,v _iD ], i=1...m, the calculation formula is as follows:

调整空间位置Adjust space position

式中：对应某粒子局部极值在第k维的速度，粒子局部值第k维最优位置，代表第k次迭代过程时粒子群的局部最优位置，代表第k次迭代过程时粒子群的全局最优位置，S(·)表示sigmoid函数，以速度作为sigmoid函数的变量，调整空间位置是将粒子速度映射到[0,1]之间，并与随机数比较，更新粒子的位置状态，是学习因子，且为正数，是惯性权重，，随机均匀分布；In the formula: corresponding to the velocity of the local extreme value of a particle in the kth dimension, the optimal position of the particle local value in the kth dimension represents the local optimal position of the particle swarm in the k-th iteration process, and represents the particle swarm in the k-th iteration process. The global optimal position of , S( ) represents the sigmoid function, and the velocity is used as the variable of the sigmoid function. To adjust the spatial position is to map the particle velocity to between [0, 1] and compare it with a random number to update the position state of the particle , is the learning factor, and is a positive number, is the inertia weight, , randomly and uniformly distributed;

(5)：设定判定条件：(5): Set the judgment conditions:

设定判定条件：条件判断如图4中所示，若迭代次数超过最大迭代次数，适应度小于设定值，则跳出循环。Set the judgment condition: The condition judgment is shown in Figure 4. If the number of iterations exceeds the maximum number of iterations and the fitness is less than the set value, the loop will be jumped out.

(6)：特征选择：(6): Feature selection:

二进制编码定义某个特征是否被选中作为输入特征用于趋势预测，如图3所示，其中1表示被选中，0表示不被选中；The binary code defines whether a feature is selected as an input feature for trend prediction, as shown in Figure 3, where 1 means selected and 0 means not selected;

(7)：输出最优特征：(7): Output optimal features:

若满足(5),设定的条件输出最优特征，否则返回(4)；If (5) is satisfied, the set condition will output the optimal feature, otherwise, return to (4);

(8)：构建数据矩阵：(8): Build a data matrix:

根据(7)选择出的最优特征构建数据矩阵；Construct a data matrix according to the optimal feature selected in (7);

(9)：训练集和交叉验证集合成：(9): The training set and the cross-validation set are combined:

训练集和交叉验证集合成：将训练集采用交叉验证进行调参，90％用于训练模型，10％用于验证模型。利用网格搜索算法对随机森林进行参数寻优，包括树的深度，随机状态，树节点的变量数、树的个数、OOB误分率以及变量重要性估计等来提升预测准确度，从而得到预测模型，使得模型对数据有较好的适应度。The training set and the cross-validation set are combined: the training set is adjusted by cross-validation, 90% is used for training the model, and 10% is used for validating the model. Use the grid search algorithm to optimize the parameters of the random forest, including the depth of the tree, the random state, the number of variables in the tree node, the number of trees, the OOB misclassification rate, and the estimation of variable importance to improve the prediction accuracy, so as to get The prediction model makes the model have better fitness for the data.

(10)：股票交易信号建立：(10): Establishment of stock trading signals:

将(8)构建的数据矩阵作为训练数据，输入到随机森林算法进行训练，构建交易信号Yj＝{y1,y2,…,yj}，其中j＝1,2,…,n为样本编号。交易信号的具体构建步骤如下:Using the data matrix constructed in (8) as training data, input it into the random forest algorithm for training, and construct the trading signal Yj={y1,y2,...,yj}, where j=1,2,...,n is the sample number. The specific construction steps of the trading signal are as follows:

1)计算日均价p_j 1) Calculate the daily average price p _j

3)构建交易信号y_j 3) Construct the trading signal y _j

(11)：随机森林进行分类预测：(11): Random forest for classification prediction:

生活中，我们对于事物的认知都是基于特征的判断与分类，譬如通过胎生与否可判断哺乳动物。随机森林就是采用这样的思想，如图2所示为随机森林分类无向图。在树的每个结点处，根据特征的表现通过某种规则分裂出下一层的叶子节点，终端的叶子节点即为最终的分类结果。随机森林学习的关键是选择最优划分属性。随着逐层划分，决策树分支结点所包含的样本类别会逐渐趋于一致，即节点分裂时要使得节点分裂后的信息增益最大。In life, our cognition of things is based on the judgment and classification of characteristics, for example, mammals can be judged by whether they were born or not. Random forest adopts this idea, as shown in Figure 2, which is an undirected graph of random forest classification. At each node of the tree, the leaf nodes of the next layer are split by certain rules according to the performance of the feature, and the leaf node of the terminal is the final classification result. The key to random forest learning is to choose the optimal partitioning attribute. With the layer-by-layer division, the sample categories contained in the branch nodes of the decision tree will gradually converge, that is, when the nodes are split, the information gain after node splitting should be maximized.

随机森林根据以下两步方法建造每棵决策树。具体地，第一步称为“行采样”，从全体训练样本中有放回地抽样，得到一个Bootstrap数据集。第二步称为“列采样”，从全部M个特征中随机选择m个特征(m小于M)，以Bootstrap数据集的m个特征为新的训练集，训练一棵决策树。分类预测是从N棵决策树投出最多票数的类别或者类别之一为最终类别，如图5股票趋势预测中随机森林投票分类方法流程图所示。随机森林模型构建可以达到降低过拟合几率的效果。在随机森林中，虽然每棵树只利用m个因子特征进行划分，单独来看分类效果并不出色但是组合在一起后反而更加稳定。不妨这样理解，每一棵决策树就是一个精通于某一个窄领域(从M个因子中选取m个让每棵树学习)的专家，随机森林则包含很多个精通不同领域的专家，对一个新的问题(新数据集)，可以用不同的角度去看待它，最终投票得到结果。Random Forest builds each decision tree according to the following two-step method. Specifically, the first step is called "row sampling", sampling with replacement from the entire training sample to obtain a Bootstrap dataset. The second step is called "column sampling", randomly select m features from all M features (m is less than M), and train a decision tree with the m features of the Bootstrap dataset as a new training set. The classification prediction is the category or one of the categories with the most votes cast from the N decision trees is the final category, as shown in the flowchart of the random forest voting classification method in the stock trend prediction in Figure 5. Random forest model construction can achieve the effect of reducing the probability of overfitting. In the random forest, although each tree is divided by only m factor features, the classification effect is not good alone, but it is more stable when combined. It may be understood in this way that each decision tree is an expert who is proficient in a certain narrow field (select m from M factors for each tree to learn), and a random forest contains many experts who are proficient in different fields. The problem (new data set), you can look at it from different perspectives, and finally vote to get the result.

(12)：网格搜索算法原理：(12): Principle of grid search algorithm:

网格搜索法是指定参数值的一种穷举搜索方法，然后将每个组合用于随机森林训练，并使用交叉验证来评估性能。拟合函数尝试所有参数组合后，返回一个合适的分类器，并自动调整到最佳参数组合。Grid search is an exhaustive search method that specifies parameter values, then each combination is used for random forest training and cross-validation is used to evaluate performance. After the fit function has tried all parameter combinations, it returns a suitable classifier and automatically adjusts to the best parameter combination.

(13)：样本内训练：(13): In-sample training:

(14)：样本外测试：(14): Out-of-sample testing:

Claims

1. a stock rate of return prediction method based on improved random forest algorithm, is characterized in that, specifically comprises the following steps:

Step 1: Data acquisition, get the stock daily data through the website;

Divide the data into training set, validation set, and test set;

Step 2: Get the data for exponential smoothing:

S ₀ =Y ₀ t=0 (1)

S _t =α*Y _t +(1-α)*S _t-1 t>0 (2)

where S _t represents the smoothed value at time t, and Y _t represents the actual value at time t;

S ₀ represents the smoothed value of the data at t=0, Y ₀ represents the actual value at t=0, t represents the number of days to obtain the stock day; S _t-1 time is the smoothed value of t-1, α is the exponential smoothing factor, 0 <α<1; Step 3: Feature Extraction

Calculate technical indicators according to the results of exponential smoothing, and calculate the characteristic matrix of smoothed time series data, and use the technical indicators used by investors to judge stock trends as features;

Step 4: PSO algorithm for feature selection

Taking the technical indicators as the particles in the particle swarm algorithm, the initial speed and position of the particles in the PSO algorithm are randomly assigned, the local optimal solution P _idbest is the optimal position of the particle in the current iteration, and the global optimal solution P _gdbest is the entire population. Optimal position; Assuming that the dimension of the particle swarm search space is D, and there are m particles in total, the position of the particle in the space is x _i =[x _i1 ,x _i2 ,...,x _iD ], and the speed is v _i =[v _i1 , v _i2 ,...,v _iD ], i=1...m, the calculation formula is as follows:

Adjust space position

In the formula: V ^k corresponds to the velocity of the local extremum of a particle in the k-th dimension, X ^k is the optimal position of the k-th dimension of the local value of the particle, represents the local optimal position of the particle swarm in the k-th iteration process, Represents the global optimal position of the particle swarm in the k-th iteration process, S( ) represents the sigmoid function, and takes the velocity as the variable of the sigmoid function. To adjust the spatial position, the particle velocity is mapped to between [0, 1] and combined with Random number comparison, update the position state of the particle, c ₁ , c ₂ are learning factors, and are positive numbers, w is the inertia weight, rand ₁ , rand ₂ ∈ [0,1], random uniform distribution;

Step 5: Set the judgment conditions:

Set the judgment condition: if the number of iterations exceeds the maximum number of iterations and the fitness is lower than the set value, the loop will be jumped out;

Step 6: Feature Selection:

The binary code obtained by the particle swarm feature selection in step 4 is used as the input feature for trend prediction, where 1 means selected and 0 means not selected;

Step 7: Output optimal features:

If step 5 is satisfied, the set condition outputs the optimal feature, otherwise returns to step 4;

Step 8: Build the data matrix:

According to the optimal feature selected in step 7, construct the data matrix of the input random forest;

Step 9: Cross-validation on training set and validation set:

In order to improve the prediction accuracy of random forest, the training set and validation set are adjusted by cross-validation. The depth of the tree, the random state, the number of variables in the tree node, the number of trees, the OOB misclassification rate, and the estimation of the importance of variables are used to improve the prediction accuracy, so as to obtain a prediction model, which makes the model have better fitness for the data. high precision;

Step 10: Establishment of stock trading signals, namely data labels:

Use the data matrix constructed in step 8 as training data, input it into the random forest algorithm for training, and construct the trading signal Yj={y1,y2,...,yj}, where j=1,2,...,n is the sample number; the trading signal The specific construction steps are as follows:

1) Calculate the daily average price p _j

Among them, C _j represents the closing price of the stock, H _j represents the highest price of the stock, and L _j represents the lowest price;

2) Calculate the arithmetic return V _j of the next k days, k=1, 2, . . . , 10;

3) Construct the trading signal y _j

Step 11: In-Sample Training:

Input the data matrix constructed by the optimal features into the random forest algorithm model for training, and use the grid search algorithm to optimize the parameters of the new optimal feature data set, and compare it with the actual stock trend to obtain the stock forecast trend and forecast. accuracy;

Step 12: Model Evaluation:

In the classification process according to the random forest algorithm, the classification prediction results can be represented by a confusion matrix, as shown in Table 1 below:

Table 1 Confusion Matrix

Predicted +1 Predicted to be 0 Predicted as -1 True is +1 TP FZ1 FN1 true is 0 FP1 TZ FN2 true is -1 FP2 FZ2 TN

Among them, TP is +1 for correct classification, 0 for TZ correct classification, TN for negative class for correct classification, FP1 for class 0 error is divided into +1 class, FP2 for -1 class error is divided into +1 class, FZ1 is +1 class Class error is divided into 0 class, FZ2 is -1 class error is classified into 0 class, FN1 is +1 class error is classified into -1 class, FN2 is 0 class error is classified into -1 class, FP is FP1+FP2, FN (is FN1+FN2, FZ is FZ1+FZ2, N=NTP+NFN+NFP+NTN+NTZ represents the total amount of samples;

Accuracy represents the probability of being correctly predicted in the test set, Recall represents the probability that the samples of the positive class were predicted to be correct, and Precision represents the correct probability of all samples that were predicted to be positive classes. The calculation formulas are as follows: :

Recall=TP/(TP+FN) (11)

Precision=TP/(TP+FP) (12)

F is a comprehensive performance index composed of the weighted average of sensitivity and precision. The closer the F value is to 1, the better the classification result. The formula is as follows:

The above parameters are obtained from the chaos matrix. On the other hand, in the random forest generation process, the training set is generated by the bootstrap method. Due to the repeated sampling with replacement, only about 63% of the data is generated compared to the original data. Repeat the extraction, and the rest of the data will not appear, of which the rest of the data is the out-of-bag data OOB. The out-of-bag data is used to estimate the generalization ability of the random forest algorithm, which is called OOB estimation; with a tree as the unit, the OOB data is used to detect The correct rate is OOB _score , the detected error is the out-of-bag error OOB _error , and the average OOB _error of all trees is the OOB' _error of the random forest. The smaller the OOB' _error , the stronger the generalization ability of RF; the fitness The value Fitness is composed of F and OOB' _error . The smaller the value, the better. The formula is as follows:

OOB _error = 1-OOB _score (14)

Fitness=OOB' _error +(1-F) (15)

Step 13: Out-of-sample testing:

After determining the optimal parameters, use the test data to test the trained random forest algorithm model to obtain the classification results, and use all the preprocessed sample features of the test set as the input of the model to obtain the T+k prediction value of each sample. To the classification results, and compared with the actual stock trend, the trend of stock forecast and the accuracy of the forecast.