CN110210973A

CN110210973A - Insider trading recognition methods based on random forest and model-naive Bayesian

Info

Publication number: CN110210973A
Application number: CN201910472108.3A
Authority: CN
Inventors: 邓尚昆; 王晨光; 曹成航
Original assignee: China Three Gorges University CTGU
Current assignee: China Three Gorges University CTGU
Priority date: 2019-05-31
Filing date: 2019-05-31
Publication date: 2019-09-06

Abstract

The insider trading recognition methods based on random forest and model-naive Bayesian that the invention discloses a kind of, this method obtains the insider trading sample data set under the different time window phase, construction feature index set is screened using Random Forest model, according to the Bayesian recognition model of the characteristic index collection building insider trading filtered out, using Bayesian recognition model carry out insider trading identification, obtain whether the result of insider trading；Whether subsequent supervision verifying insider trading recognition result is correct, and is trained update to Bayesian recognition model according to recognition result.The present invention establishes stock insider trading identification model, realizes and accurately identifies to whether test target carries out insider trading；In conjunction with quasi-Newton method and genetic algorithm, make the parameter of Random Forest model quickly, be accurately optimized to optimal solution, the solution of optimal solution is small to the dependence of initial value；The present invention is easily achieved, and performance is stablized, and as sample data increases, robustness, accuracy can be further increased.

Description

Insider trading recognition methods based on random forest and model-naive Bayesian

Technical field

The invention belongs to Securities Market Regulation fields, and in particular to a kind of based on random forest and model-naive Bayesian Insider trading recognition methods.

Background technique

The securities market insider trading behavior severe jamming normal operation of stock market especially compromises medium and small investors benefit Benefit.In recent years, how to identify that insider trading has become the hot spot of academia's concern.

Naive Bayes Classifier is one of learning efficiency, the preferable classifier of classifying quality compared with other classifiers, Its advantage is simple for algorithm logic, be easily achieved, algorithm performance is stablized etc..The basis of Naive Bayes Classification is probability inference, The probability that each condition known occurs, completes reasoning and decision task.

Genetic algorithm is a kind of chess game optimization algorithm, its main analog mechanism of biogenetics and natural selection.It loses Propagation algorithm has powerful ability of searching optimum, and initiating searches speed quickly, is solving the complicated and changeable optimization problem of state Aspect is advantageous, but genetic algorithm later period speed of searching optimization is slower, and optimizing result is relatively inaccessible to ideal precision.

Random forest is the classifier comprising multiple decision trees, and the classification results of output are by setting output individually As a result comprehensive descision not only overcomes the select permeability of quantity of parameters, possesses higher robustness, but also has and can assess Feature importance, be less prone to over-fitting, with can quick and precisely handle mass data the advantages of.

Therefore, genetic algorithm is improved, study a kind of combination random forests algorithm, Naive Bayes Classifier it is interior Curtain transaction identification method.

Summary of the invention

The technical problem of the invention normal operation of stock market that has been the insider trading behavior severe jamming of securities market, damage Investors' interest, is not easy to be identified；The purpose of the present invention is being directed to this technical problem, provide a kind of based on random forest and Piao The insider trading recognition methods of plain Bayesian model carries out the parameter of Random Forest model in conjunction with quasi-Newton method and genetic algorithm After optimization, building optimal characteristics index set is screened using Random Forest model, then passes through building model-naive Bayesian verification Whether certificate share of market, which occurs insider trading, quickly identifies.

The technical scheme is that the insider trading recognition methods based on random forest and model-naive Bayesian, specifically Steps are as follows,

Step 1: obtaining the insider trading sample data set under the different event time window phase；

Step 2: construction feature index set is screened using Random Forest model；

Step 3: according to the Bayesian recognition model of the characteristic index collection building insider trading filtered out；

Step 4: obtaining test target and construct test target data set；

Step 5: using Bayesian recognition model carry out insider trading identification, obtain whether the result of insider trading；

Step 6: whether subsequent supervision verifying insider trading recognition result is correct；

Step 6.1: if subsequent supervision verifying insider trading recognition result is correct, thening follow the steps 8；

Step 6.2: if subsequent supervision verifying insider trading recognition result is incorrect, thening follow the steps 7；

Step 7: by test target data set and whether sample data set and training update shellfish is added in the result of insider trading This identification model of leaf；

Step 8: judging whether there is next test target；

Step 8.1: if there is next test target, thening follow the steps 4；

Step 8.2: if terminating without next test target.

In the step 2, the parameter of the random forests algorithm uses the optimization method of quasi-Newton method combination genetic algorithm It determining, the optimization method of quasi-Newton method combination genetic algorithm includes the following steps,

Step 1: determining population size N, crossover probability p_c, mutation probability p_s, genetic algorithm target fitness threshold value η；

Step 2: carrying out genetic algorithm iteration, obtain next-generation group；

Step 3: calculating current group average fitness η '；

Step 4: judging whether η ' < η；

Step 4.1: if η ' < η, thens follow the steps 5；

Step 4.2: if η ' < η is invalid, thening follow the steps 2；

Step 5: recording optimal chromosome values；

Step 6: the iteration result that genetic algorithm is obtained is as the initial value of quasi-Newton iteration method；

Step 7: carrying out quasi-Newton iteration method；

Step 8: judging whether to reach default precision；

Step 8.1: if not up to default precision, executes step 7；

Step 8.2: if reaching default precision, terminating.

In the step 2, the screening construction feature index set is carried out special using the Gini coefficient in Random Forest model The different degree for levying index calculates, and chooses optimal characteristic index according to prominence score and combines.

The prominence score of variable indicates that Gini coefficient is indicated with GI with VIM, and the calculation formula of the GI of node m is as follows

K indicates K classification, p in formula_mkIndicate ratio shared by classification k in node m.

Feature X_jIn the importance of node mGI variable quantity i.e. before and after node m branch

GI in formula_l、GI_rRespectively indicate the GI of latter two new node of branch.

If feature X_jThe node set occurred in decision tree is M, then feature X_jIn i-th several importance

If the shared n tree of random forest, then

Prominence score is normalized

In formulaIt is characterized X_jGini coefficient prominence score, c is characterized number.

Beneficial effects of the present invention:

1) present invention establishes stock insider trading identification model, realize to test target with the presence or absence of insider trading into Row accurately identifies；

2) quasi-Newton method and genetic algorithm are combined, so that the parameter of Random Forest model is accurately optimized to optimal solution, most The solution of excellent solution is small to the dependence of initial value；

3) optimal characteristic index combination is chosen using the Gini coefficient prominence score in Random Forest model, makes this hair Bright insider trading recognition methods accuracy is good, high-efficient；

4) Naive Bayes Classifier is used, insider trading recognition methods of the invention is easily achieved, performance is stablized, and With being continuously increased for sample data, robustness, the accuracy of insider trading recognition methods can be further increased.

Detailed description of the invention

Present invention will be further explained below with reference to the attached drawings and examples.

Fig. 1 is the flow chart of the insider trading recognition methods based on random forest and model-naive Bayesian.

Fig. 2 is the flow chart of the optimization method of quasi-Newton method combination genetic algorithm.

Fig. 3 is Random Forest model schematic diagram.

Fig. 4 is the flow chart of model-naive Bayesian assorting process.

Fig. 5 is the flow chart of genetic algorithm.

Specific embodiment

As shown in Figure 1-3, the insider trading recognition methods based on random forest and model-naive Bayesian, specific steps are such as Under,

Step 1: the corresponding difference of stock sample for the generation insider trading that stock supervisory committee announces is obtained using web crawlers Characteristic index under event time window phase specifically includes securities market Microscopic Indexes, Corporate Finance index and company governance and refers to Mark；Obtain the insider trading sample data set under the different event time window phase；

Step 2: the calculating of characteristic index different degree is carried out using the Gini coefficient of Random Forest model, according to prominence score Choose optimal characteristic index combination；

Step 3: according to the Bayesian recognition model of the characteristic index collection building insider trading filtered out, by insider trading sample Notebook data collection is as training dataset training Bayesian recognition model；

Step 4: obtaining test target and construct test target data set；

Step 8: judging whether there is next test target；

Step 8.1: if there is next test target, thening follow the steps 4；

Step 8.2: if terminating without next test target.

Characteristic index includes Company Financial index and company governance index disclosed in securities market personal share in step 1, It further include by CAMP (Capital Asset Pricing Model), GARCH (Generalized AutoRegressive Conditional Heteroskedasticity Model) model calculate personal share securities market Microscopic Indexes.

In step 2, the parameter of the random forests algorithm is determined using the optimization method of quasi-Newton method combination genetic algorithm, The optimization method of quasi-Newton method combination genetic algorithm includes the following steps,

Step 3: calculating current group average fitness η '；

Step 4: judging whether η ' < η；

Step 4.1: if η ' < η, thens follow the steps 5；

Step 4.2: if η ' < η is invalid, thening follow the steps 2；

Step 5: recording optimal chromosome values；

Step 7: carrying out quasi-Newton iteration method；

Step 8: judging whether to reach default precision；

Step 8.1: if not up to default precision, executes step 7；

Step 8.2: if reaching default precision, terminating.

The iterative process of quasi-Newton method are as follows:

x^k+1=x^k+λ_kd^k

WhereinFor Newton direction,For gradient, λ_kFor along Newton direction The step-length of linear search.

In step 2, the screening construction feature index set is chosen optimal characteristic index according to prominence score and is combined, The prominence score of variable indicates that Gini value is indicated with GI with VIM, and the calculation formula of the GI of node m is as follows

Feature X_jIn the importance of node mGI variable quantity i.e. before and after node m branch:

If feature X_jThe node set occurred in decision tree is M, then feature X_jIn i-th several importance:

If the shared n tree of random forest, then

Then prominence score is normalized

As shown in figure 3, the target of Random Forest model is by combining to obtain multiple weak learning machines such as single decision tree One strong learning machine.Assuming that it is currently owned by N number of sample, characteristic M.Random Forest model uses bootstrap to adopt again first Sample samples data set, training dataset of each N number of sample of stochastical sampling as single decision tree.In each node, Algorithm randomly selects certain amount variable first, from finding the feature for being capable of providing optimal segmentation effect among them；Then, often Decision tree all obtains a classification or prediction result.For classification problem, then the maximum class of probability value in prediction classification is selected As final prediction.In random forests algorithm model, the present invention carries out parameter optimization using quasi-Newton method combination genetic algorithm.

Genetic algorithm has powerful ability of searching optimum, and initiating searches speed is fast, but later period speed of searching optimization is slower, and Optimizing result is relatively inaccessible to ideal precision.Therefore, the present invention improves conventional genetic using quasi-Newton method combination genetic algorithm The deficiency of algorithm, the iteration result that genetic algorithm is obtained is as the initial value of quasi-Newton method iteration, the flow chart of genetic algorithm As shown in Figure 5.

Genetic algorithm is sentenced in optimization process close to the degree of optimal solution, and accordingly using fitness evaluation individual in population Whether disconnected individual enters follow-on evolution.Can the selection of fitness directly affect the convergence rate of genetic algorithm and obtain Optimal solution.The inverse that feature combines corresponding Classification and Identification accuracy is set as fitness function by the present invention.

As shown in figure 4, naive Bayesian insider trading identification process is as follows:

The data sample X of label whether one is given without insider trading, with n dimensional feature vector X={ x₁,x₂,…,x_n} It indicates, describes sample X respectively in n feature { A₁,A₂,…,A_nOn characteristic value.Class C₁Indicate insider trading, class C₂It indicates Without insider trading, sample X is distributed into class C_iCondition is

P(C_i/ X) > P (C_j/X)(1≤j≤m,j≠i)

I.e. sample is class C_iProbability be greater than sample be other classes probability.

According to Bayes' theorem:

P (X) refers to that any one data object meets the probability of sample X in formula, and for all classes, it is constant.P (C_i) be any one data object be class C_iProbability, P (C can be used_i)=s_i/ s is calculated, s_iIt is class C_iMiddle number of training, s are Training sample sum.

When label whether given sample insider trading, it is assumed that each characteristic value condition of reciprocity is independent, such P (X/C_i) It calculates and uses formula

Probability P (x_k/C_i) estimated by training sample, in the present invention, feature A_kIt is continuous, it is assumed that characteristic value is obeyed high This distribution.Therefore have

In formulaIt is characterized A_kGauss specification density function,Respectively class in training sample It Wei not C_iFeature A_kMean value, variance.

For test target collection E, to each class C_iCalculate P (X/C_i)P(C_i).Sample E is assigned to class C_i, and if only if P (C_i/ X) > P (C_j/X),1≤j≤m,j≠i。

In embodiment, the stock of generation insider trading between 2001 to 2017 years of stock supervisory committee's announcement is obtained by web crawlers Sample with corresponding of the same trade, scale quite and do not occurred similar case white sample it is 335 total, wherein black sample 171 It is a, 164, white sample.

18 are had chosen in terms of subsidiary company equity structure, financial data, improvement system and securities market Microscopic respectively A characteristic index, as shown in Table 1.

One characteristic index table of table

Quasi-Newton method combination genetic algorithm optimizes the parameter of Random Forest model, parameter optimization result such as two institute of table Show.

Two Random Forest model parameter list of table

Parameter	Optimal value
		The variable number mtry of binary tree is used in node	4
The quantity ntree set in random forest	341
		The minimum number nodesize of decision tree nodes	9
The maximum number maxnodes of decision tree nodes	17

In a kind of embodiment, in order to detect the identification effect of insider trading recognition methods of the invention on unknown stock sample Collected data set is divided into training set and forecast set in the ratio of 8:2 by fruit.Using training set as learning data, training is simple Bayesian recognition model.Then it is inputted using forecast set as the test of Bayesian recognition model, carries out the identification inspection of insider trading It surveys.

As shown in Table 3, the insider trading totality recognition correct rate that event time window phase is 30 is 76.19%.

The insider trading recognition result table that three event time window phase of table is 30

As shown in Table 4, the insider trading totality recognition correct rate that event time window phase is 60 is 74.6%.

The insider trading recognition result table that four event time window phase of table is 60

As shown in Table 5, the insider trading totality recognition correct rate that event time window phase is 90 is 79.37%.

The insider trading recognition result table that five event time window phase of table is 90

Claims

1. the insider trading recognition methods based on random forest and model-naive Bayesian, which is characterized in that specific step is as follows,

Step 2: construction feature index set is screened using Random Forest model；

Step 4: obtaining test target and construct test target data set；

Step 7: by test target data set and whether sample data set and training update Bayes is added in the result of insider trading Identification model；

Step 8: judging whether there is next test target；

Step 8.1: if there is next test target, thening follow the steps 4；

Step 8.2: if terminating without next test target.

2. the insider trading recognition methods according to claim 1 based on random forest and model-naive Bayesian, special Sign is, in step 2, the parameter of the random forests algorithm is determined using the optimization method of quasi-Newton method combination genetic algorithm, The optimization method of quasi-Newton method combination genetic algorithm includes the following steps,

Step 3: calculating current group average fitness η '；

Step 4: judging whether η ' < η；

Step 4.1: if η ' < η, thens follow the steps 5；

Step 4.2: if η ' < η is invalid, thening follow the steps 2；

Step 5: recording optimal chromosome values；

Step 7: carrying out quasi-Newton iteration method；

Step 8: judging whether to reach default precision；

Step 8.1: if not up to default precision, executes step 7；

Step 8.2: if reaching default precision, terminating.

3. the insider trading recognition methods according to claim 1 or 2 based on random forest and model-naive Bayesian, It is characterized in that, in step 2, the screening construction feature index set carries out feature using the Gini coefficient in Random Forest model The different degree of index calculates, and chooses optimal characteristic index according to prominence score and combines.