CN111428790A

CN111428790A - Double-accuracy weighted random forest algorithm based on particle swarm optimization

Info

Publication number: CN111428790A
Application number: CN202010223029.1A
Authority: CN
Inventors: 张文波; 冯永新; 郝颖; 付立冬; 王晶
Original assignee: Shenyang Ligong University
Current assignee: Shenyang Ligong University
Priority date: 2020-03-26
Filing date: 2020-03-26
Publication date: 2020-07-17

Abstract

The invention discloses a double-accuracy weighted random forest algorithm based on particle swarm optimization, which comprises the following steps of: s1: determining an original data set D and randomly determining the number of decision treesKNumber of features of decision treemWherein m is less than or equal to n, determining the pre-test sample rateX(ii) a S2: training data set S using Bootstrap sampling_kSampling to obtain data O outside the bag_kTraining aid T_k(ii) a S3: generating the second one according to C4.5 algorithmkA decision tree is established, and the final weight of the decision tree is calculated; s4: repeating S2 and S3 until the number of decision trees isKStopping the operation; s5: testing the test data by using the decision tree set for classification; s6: and taking the accuracy obtained in the step S5 as a fitness value of the particle swarm, performing iterative optimization by adopting a particle swarm optimization algorithm, comparing the iteration optimization with the historical classification accuracy, and finally selecting the optimal model parameters. The invention improves the accuracy of the random forest model; while avoidingThe problem of reduced algorithm accuracy caused by selecting parameters by experience is solved, and the performance of the algorithm is further improved.

Description

Double-accuracy weighted random forest algorithm based on particle swarm optimization

Technical Field

The invention relates to a double-accuracy weighted random forest algorithm based on particle swarm optimization, and belongs to the field of data processing.

Background

Random forest is a supervised integrated learning classification technique, and its model is composed of a group of decision tree classifiers, and the classification of data by the model determines the final result by collective average voting of the classification results of the individual decision trees. This way of average voting may cause a poor growing decision tree to affect the final classification result, and is prone to the "flat vote" condition. The traditional random forest algorithm fully ensures the independence and the difference between each decision tree by injecting the randomness into a training sample space and an attribute space, well overcomes the over-fitting problem of the decision trees, and has better robustness on noise and abnormal values. But the practical effect is not ideal due to the randomness of the training samples and attributes and the uncertainty of the decision tree. Therefore, it is of great significance to design an improved weighted random forest algorithm.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides a double-accuracy weighted random forest algorithm based on particle swarm optimization, the algorithm has higher accuracy than the traditional random forest algorithm, and the probability of 'flat tickets' is greatly reduced.

The invention mainly adopts the technical scheme that:

a double-accuracy weighted random forest algorithm based on particle swarm optimization comprises the following steps:

step 1: determining an original dataset D { (x)₁,y₁),(x₂,y₂),...(x_N,y_N) And (c) the step of (c) in which,

for the input example, n is the total number of features, y_i∈{Y₁,Y₂,...,Y_NAnd (5) randomly determining the number K of decision trees, the characteristic number m (m is less than or equal to N) of the decision numbers and determining predicted samples, wherein the i is 1,2, and the N is the sample capacityThe local rate X is the ratio of the number of pretest data sets to the total number of the data sets;

step 2: dividing an original data set D according to a pretest sample rate X to generate a pretest data set P corresponding to a kth decision tree_kAnd a training data set S_kAnd using Bootstrap sampling method to train data set S_kSampling to obtain data O outside the bag_kTraining subset T_k；

Step 3: randomly selecting m feature attributes from n features as node classification features, T_kGenerating a kth decision tree according to a C4.5 algorithm as training data; and using the decision tree to test O_k、P_kA data set, calculating the weight w of the decision tree according to formula (1) and formula (2)_OkAnd w_PkCalculating the final weight w of the decision tree according to equation (3)_k：

Step 4: repeating Step2 and Step3 until the number of the decision trees is K, and obtaining a decision tree set and the weight of each decision tree;

step 5: and testing the test data by using the decision tree set, classifying according to a formula (4), and obtaining the accuracy of the random forest algorithm:

step 6: and taking the accuracy obtained by Step5 as a fitness value of the particle swarm, performing iterative optimization on the number K of decision trees and the pretest sample rate X hyper-parameter by adopting a particle swarm optimization algorithm, comparing the decision tree number K and the pretest sample rate X hyper-parameter with the historical classification accuracy, and finally selecting the optimal model parameter. Preferably, the particle swarm optimization algorithm in Step6 comprises the following specific steps:

s6-1: forming a space vector (KX) by a decision tree K and a pretest sample rate X as a particle in a particle swarm optimization algorithm, wherein the particle is initialized in a uniform and random manner in the whole search space;

s6-2: updating the velocity and position of each particle in each dimension, and updating the entire bird population at each time step, wherein the vector equations for the velocity and position of the particles are shown in equations (5) and (6):

v_id＝w×v_id+c₁×rand×(p_id-x_id)+c₂×rand×(p_gd-x_id) (5)；

x_id＝x_id+v_id(6)；

w is inertia weight, the value range is usually 0.4-1.2, rand () is a random function, and [0,1 ] is generated]Random number in the range, c₁And c₂The acceleration constant represents the particle's perception of the population.

Has the advantages that: the invention provides a double-accuracy weighted random forest algorithm based on particle swarm optimization, wherein when decision trees are generated, each decision tree respectively tests a training data set and a pretest data set, the accuracy of two test results is used as the weight of the decision tree, and the two weights are integrated to be used as the final weight of the decision tree. The double-accuracy weighting mode enables each decision tree to have corresponding weight, and the weight is positively correlated with the accuracy of the decision tree training. In the final decision stage, the model classification result tends to be a correct result, so that the accuracy of the random forest model is improved; meanwhile, when the hyper-parameters are selected, the particle swarm optimization algorithm is used for searching for the optimal solution, the problem that the accuracy of the algorithm is reduced due to the fact that the parameters are selected by experience is solved, the performance of the algorithm is further improved, and the accuracy of the algorithm is improved.

Drawings

FIG. 1 is a comparison of data classification accuracy for three algorithms;

FIG. 2 is a comparison of the data classification accuracy rates of three algorithms;

FIG. 3 is a diagram of the result of PSO versus decision tree K optimization;

fig. 4 is a graph of the results of PSO versus pretest sample rate X optimization.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application are clearly and completely described below, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

for the input example, n is the number of features, y_i∈{Y₁,Y₂,...,Y_NThe method comprises the steps of (1) randomly determining the number K of decision trees, selecting a characteristic number m, and determining a pretest sample rate X, wherein the number of the pretest data sets is a ratio of the number of the pretest data sets to the total number of the data sets;

Step 3: randomly selecting m characteristic attributes as node classification characteristics, T_kGenerating a kth decision tree according to a C4.5 algorithm as training data; and using the decision tree to test O_k、P_kA data set, calculating the weight w of the decision tree according to formula (1) and formula (2)_OkAnd w_PkCalculating the final weight w of the decision tree according to equation (3)_k：

Step 4: repeating Step2 and Step3 until the decision tree is K, and obtaining a decision tree set and the weight of each decision tree;

step 5: and (3) testing the test data by using the decision tree set and classifying according to a formula (4) to obtain the accuracy of the random forest algorithm:

step 6: and taking the accuracy obtained by Step5 as a fitness value of the particle swarm, performing iterative optimization on the number K of decision trees and the pretest sample rate X hyper-parameter by adopting a particle swarm optimization algorithm, comparing the decision tree number K and the pretest sample rate X hyper-parameter with the historical classification accuracy, and finally selecting the optimal model parameter.

In the invention, the iterative optimization in Step6 is specifically as follows: and obtaining the accuracy after the first classification, using the accuracy as the fitness value of the particle swarm, performing parameter optimization by using a particle swarm algorithm, repeating Step2-Step5 to obtain a result every time the parameters are updated, and repeating the steps 2-Step5 in the same way, so as to obtain historical data of all classification results (accuracy) before the classification.

Preferably, the particle swarm optimization algorithm in Step6 comprises the following specific steps:

v_id＝w×v_id+c₁×rand×(p_id-x_id)+c₂×rand×(p_gd-x_id) (5)；

x_id＝x_id+v_id(6)；

In the invention, the particle swarm algorithm is a conventional algorithm, the formulas (5) and (6) are particle swarm general formulas, the particle swarm algorithm is not improved, only the random forest algorithm is improved, and the conventional particle swarm algorithm is used for carrying out parameter optimization on the particle swarm algorithm.

The simulation experiment adopts common evaluation index Accuracy (Accuracy) and Precision (Precision) to measure the performance of a detection mechanism. The Accuracy (Accuracy) and Precision (Precision) are defined as follows:

wherein, tp (true positive) represents that the sample is a fault type sample, and the fault detection model classification result is also a fault type; TN (TN negative) indicates that the sample is a normal type sample, and the fault detection model classification result is also a normal type; FP (false positive) represents that the sample is a normal type sample, but the fault detection model wrongly divides the normal type sample into fault types; FN (false negative) indicates that the sample is a fault type sample, but the fault detection model misclassifies the fault type sample to a normal type. The accuracy rate is the proportion of the number of all pairs of samples to the total number of the samples; the accuracy rate is the ratio of all samples classified as fault types to samples correctly classified as fault types.

The accuracy and precision of the classification of the three algorithms are shown in fig. 1 and 2 below, and fig. 3 and 4 are graphs of the accuracy obtained after iterative optimization of the DAWRF algorithm by using the PSO each time. Simulation results show that the accuracy rate of the DAWRF algorithm which is not optimized by using the PSO is reduced, because the randomly selected hyper-parameters are unreasonable, after iterative optimization is carried out by the PSO algorithm, the accuracy rate of the algorithm has peak values, and the optimal hyper-parameters can be found. Therefore, the DAWRF algorithm based on PSO optimization has better performance in accuracy and precision, and has obvious advantages compared with the traditional random forest algorithm.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. The double-accuracy weighted random forest algorithm based on particle swarm optimization is characterized by comprising the following steps of: step 1: determining an original dataset D { (x)₁,y₁),(x₂,y₂),...(x_N,y_N) And (c) the step of (c) in which,

for the input example, n is the total number of features, y_i∈{Y₁,Y₂,...,Y_NThe method comprises the steps that (1, 2, 1, N and N are sample capacity), the number K of decision trees and the characteristic number m of the decision trees are randomly determined, wherein m is less than or equal to N, a pre-test sample rate X is determined, and the pre-test sample rate is the ratio of the number of pre-test data sets to the total number of the data sets;

2. The particle swarm optimization-based double-accuracy weighted random forest algorithm according to claim 1, wherein the particle swarm optimization algorithm in Step6 comprises the following specific steps:

s6-1: forming a space vector (K, X) by a decision tree K and a pretest sample rate X as a particle in a particle swarm optimization algorithm, wherein the particle is initialized in a uniform and random manner in the whole search space;

v_id＝w×v_id+c₁×rand×(p_id-x_id)+c₂×rand×(p_gd-x_id) (5)；

x_id＝x_id+v_id(6)；