CN111428790A - Double-accuracy weighted random forest algorithm based on particle swarm optimization - Google Patents

Double-accuracy weighted random forest algorithm based on particle swarm optimization Download PDF

Info

Publication number
CN111428790A
CN111428790A CN202010223029.1A CN202010223029A CN111428790A CN 111428790 A CN111428790 A CN 111428790A CN 202010223029 A CN202010223029 A CN 202010223029A CN 111428790 A CN111428790 A CN 111428790A
Authority
CN
China
Prior art keywords
decision tree
accuracy
algorithm
particle swarm
particle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010223029.1A
Other languages
Chinese (zh)
Inventor
张文波
冯永新
郝颖
付立冬
王晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenyang Ligong University
Original Assignee
Shenyang Ligong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenyang Ligong University filed Critical Shenyang Ligong University
Priority to CN202010223029.1A priority Critical patent/CN111428790A/en
Publication of CN111428790A publication Critical patent/CN111428790A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a double-accuracy weighted random forest algorithm based on particle swarm optimization, which comprises the following steps of: s1: determining an original data set D and randomly determining the number of decision treesKNumber of features of decision treemWherein m is less than or equal to n, determining the pre-test sample rateX(ii) a S2: training data set S using Bootstrap samplingkSampling to obtain data O outside the bagkTraining aid Tk(ii) a S3: generating the second one according to C4.5 algorithmkA decision tree is established, and the final weight of the decision tree is calculated; s4: repeating S2 and S3 until the number of decision trees isKStopping the operation; s5: testing the test data by using the decision tree set for classification; s6: and taking the accuracy obtained in the step S5 as a fitness value of the particle swarm, performing iterative optimization by adopting a particle swarm optimization algorithm, comparing the iteration optimization with the historical classification accuracy, and finally selecting the optimal model parameters. The invention improves the accuracy of the random forest model; while avoidingThe problem of reduced algorithm accuracy caused by selecting parameters by experience is solved, and the performance of the algorithm is further improved.

Description

Double-accuracy weighted random forest algorithm based on particle swarm optimization
Technical Field
The invention relates to a double-accuracy weighted random forest algorithm based on particle swarm optimization, and belongs to the field of data processing.
Background
Random forest is a supervised integrated learning classification technique, and its model is composed of a group of decision tree classifiers, and the classification of data by the model determines the final result by collective average voting of the classification results of the individual decision trees. This way of average voting may cause a poor growing decision tree to affect the final classification result, and is prone to the "flat vote" condition. The traditional random forest algorithm fully ensures the independence and the difference between each decision tree by injecting the randomness into a training sample space and an attribute space, well overcomes the over-fitting problem of the decision trees, and has better robustness on noise and abnormal values. But the practical effect is not ideal due to the randomness of the training samples and attributes and the uncertainty of the decision tree. Therefore, it is of great significance to design an improved weighted random forest algorithm.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention provides a double-accuracy weighted random forest algorithm based on particle swarm optimization, the algorithm has higher accuracy than the traditional random forest algorithm, and the probability of 'flat tickets' is greatly reduced.
The invention mainly adopts the technical scheme that:
a double-accuracy weighted random forest algorithm based on particle swarm optimization comprises the following steps:
step 1: determining an original dataset D { (x)1,y1),(x2,y2),...(xN,yN) And (c) the step of (c) in which,
Figure BDA0002426737540000011
for the input example, n is the total number of features, yi∈{Y1,Y2,...,YNAnd (5) randomly determining the number K of decision trees, the characteristic number m (m is less than or equal to N) of the decision numbers and determining predicted samples, wherein the i is 1,2, and the N is the sample capacityThe local rate X is the ratio of the number of pretest data sets to the total number of the data sets;
step 2: dividing an original data set D according to a pretest sample rate X to generate a pretest data set P corresponding to a kth decision treekAnd a training data set SkAnd using Bootstrap sampling method to train data set SkSampling to obtain data O outside the bagkTraining subset Tk
Step 3: randomly selecting m feature attributes from n features as node classification features, TkGenerating a kth decision tree according to a C4.5 algorithm as training data; and using the decision tree to test Ok、PkA data set, calculating the weight w of the decision tree according to formula (1) and formula (2)OkAnd wPkCalculating the final weight w of the decision tree according to equation (3)k
Figure BDA0002426737540000021
Figure BDA0002426737540000022
Figure BDA0002426737540000023
Step 4: repeating Step2 and Step3 until the number of the decision trees is K, and obtaining a decision tree set and the weight of each decision tree;
step 5: and testing the test data by using the decision tree set, classifying according to a formula (4), and obtaining the accuracy of the random forest algorithm:
Figure BDA0002426737540000024
step 6: and taking the accuracy obtained by Step5 as a fitness value of the particle swarm, performing iterative optimization on the number K of decision trees and the pretest sample rate X hyper-parameter by adopting a particle swarm optimization algorithm, comparing the decision tree number K and the pretest sample rate X hyper-parameter with the historical classification accuracy, and finally selecting the optimal model parameter. Preferably, the particle swarm optimization algorithm in Step6 comprises the following specific steps:
s6-1: forming a space vector (KX) by a decision tree K and a pretest sample rate X as a particle in a particle swarm optimization algorithm, wherein the particle is initialized in a uniform and random manner in the whole search space;
s6-2: updating the velocity and position of each particle in each dimension, and updating the entire bird population at each time step, wherein the vector equations for the velocity and position of the particles are shown in equations (5) and (6):
vid=w×vid+c1×rand×(pid-xid)+c2×rand×(pgd-xid) (5);
xid=xid+vid(6);
w is inertia weight, the value range is usually 0.4-1.2, rand () is a random function, and [0,1 ] is generated]Random number in the range, c1And c2The acceleration constant represents the particle's perception of the population.
Has the advantages that: the invention provides a double-accuracy weighted random forest algorithm based on particle swarm optimization, wherein when decision trees are generated, each decision tree respectively tests a training data set and a pretest data set, the accuracy of two test results is used as the weight of the decision tree, and the two weights are integrated to be used as the final weight of the decision tree. The double-accuracy weighting mode enables each decision tree to have corresponding weight, and the weight is positively correlated with the accuracy of the decision tree training. In the final decision stage, the model classification result tends to be a correct result, so that the accuracy of the random forest model is improved; meanwhile, when the hyper-parameters are selected, the particle swarm optimization algorithm is used for searching for the optimal solution, the problem that the accuracy of the algorithm is reduced due to the fact that the parameters are selected by experience is solved, the performance of the algorithm is further improved, and the accuracy of the algorithm is improved.
Drawings
FIG. 1 is a comparison of data classification accuracy for three algorithms;
FIG. 2 is a comparison of the data classification accuracy rates of three algorithms;
FIG. 3 is a diagram of the result of PSO versus decision tree K optimization;
fig. 4 is a graph of the results of PSO versus pretest sample rate X optimization.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application are clearly and completely described below, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
A double-accuracy weighted random forest algorithm based on particle swarm optimization comprises the following steps:
step 1: determining an original dataset D { (x)1,y1),(x2,y2),...(xN,yN) And (c) the step of (c) in which,
Figure BDA0002426737540000031
for the input example, n is the number of features, yi∈{Y1,Y2,...,YNThe method comprises the steps of (1) randomly determining the number K of decision trees, selecting a characteristic number m, and determining a pretest sample rate X, wherein the number of the pretest data sets is a ratio of the number of the pretest data sets to the total number of the data sets;
step 2: dividing an original data set D according to a pretest sample rate X to generate a pretest data set P corresponding to a kth decision treekAnd a training data set SkAnd using Bootstrap sampling method to train data set SkSampling to obtain data O outside the bagkTraining subset Tk
Step 3: randomly selecting m characteristic attributes as node classification characteristics, TkGenerating a kth decision tree according to a C4.5 algorithm as training data; and using the decision tree to test Ok、PkA data set, calculating the weight w of the decision tree according to formula (1) and formula (2)OkAnd wPkCalculating the final weight w of the decision tree according to equation (3)k
Figure BDA0002426737540000041
Figure BDA0002426737540000042
Figure BDA0002426737540000043
Step 4: repeating Step2 and Step3 until the decision tree is K, and obtaining a decision tree set and the weight of each decision tree;
step 5: and (3) testing the test data by using the decision tree set and classifying according to a formula (4) to obtain the accuracy of the random forest algorithm:
Figure BDA0002426737540000044
step 6: and taking the accuracy obtained by Step5 as a fitness value of the particle swarm, performing iterative optimization on the number K of decision trees and the pretest sample rate X hyper-parameter by adopting a particle swarm optimization algorithm, comparing the decision tree number K and the pretest sample rate X hyper-parameter with the historical classification accuracy, and finally selecting the optimal model parameter.
In the invention, the iterative optimization in Step6 is specifically as follows: and obtaining the accuracy after the first classification, using the accuracy as the fitness value of the particle swarm, performing parameter optimization by using a particle swarm algorithm, repeating Step2-Step5 to obtain a result every time the parameters are updated, and repeating the steps 2-Step5 in the same way, so as to obtain historical data of all classification results (accuracy) before the classification.
Preferably, the particle swarm optimization algorithm in Step6 comprises the following specific steps:
s6-1: forming a space vector (KX) by a decision tree K and a pretest sample rate X as a particle in a particle swarm optimization algorithm, wherein the particle is initialized in a uniform and random manner in the whole search space;
s6-2: updating the velocity and position of each particle in each dimension, and updating the entire bird population at each time step, wherein the vector equations for the velocity and position of the particles are shown in equations (5) and (6):
vid=w×vid+c1×rand×(pid-xid)+c2×rand×(pgd-xid) (5);
xid=xid+vid(6);
w is inertia weight, the value range is usually 0.4-1.2, rand () is a random function, and [0,1 ] is generated]Random number in the range, c1And c2The acceleration constant represents the particle's perception of the population.
In the invention, the particle swarm algorithm is a conventional algorithm, the formulas (5) and (6) are particle swarm general formulas, the particle swarm algorithm is not improved, only the random forest algorithm is improved, and the conventional particle swarm algorithm is used for carrying out parameter optimization on the particle swarm algorithm.
The simulation experiment adopts common evaluation index Accuracy (Accuracy) and Precision (Precision) to measure the performance of a detection mechanism. The Accuracy (Accuracy) and Precision (Precision) are defined as follows:
Figure BDA0002426737540000051
Figure BDA0002426737540000052
wherein, tp (true positive) represents that the sample is a fault type sample, and the fault detection model classification result is also a fault type; TN (TN negative) indicates that the sample is a normal type sample, and the fault detection model classification result is also a normal type; FP (false positive) represents that the sample is a normal type sample, but the fault detection model wrongly divides the normal type sample into fault types; FN (false negative) indicates that the sample is a fault type sample, but the fault detection model misclassifies the fault type sample to a normal type. The accuracy rate is the proportion of the number of all pairs of samples to the total number of the samples; the accuracy rate is the ratio of all samples classified as fault types to samples correctly classified as fault types.
The accuracy and precision of the classification of the three algorithms are shown in fig. 1 and 2 below, and fig. 3 and 4 are graphs of the accuracy obtained after iterative optimization of the DAWRF algorithm by using the PSO each time. Simulation results show that the accuracy rate of the DAWRF algorithm which is not optimized by using the PSO is reduced, because the randomly selected hyper-parameters are unreasonable, after iterative optimization is carried out by the PSO algorithm, the accuracy rate of the algorithm has peak values, and the optimal hyper-parameters can be found. Therefore, the DAWRF algorithm based on PSO optimization has better performance in accuracy and precision, and has obvious advantages compared with the traditional random forest algorithm.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (2)

1. The double-accuracy weighted random forest algorithm based on particle swarm optimization is characterized by comprising the following steps of: step 1: determining an original dataset D { (x)1,y1),(x2,y2),...(xN,yN) And (c) the step of (c) in which,
Figure FDA0002426737530000011
for the input example, n is the total number of features, yi∈{Y1,Y2,...,YNThe method comprises the steps that (1, 2, 1, N and N are sample capacity), the number K of decision trees and the characteristic number m of the decision trees are randomly determined, wherein m is less than or equal to N, a pre-test sample rate X is determined, and the pre-test sample rate is the ratio of the number of pre-test data sets to the total number of the data sets;
step 2: dividing an original data set D according to a pretest sample rate X to generate a pretest data set P corresponding to a kth decision treekAnd a training data set SkAnd using Bootstrap sampling method to train data set SkSampling to obtain data O outside the bagkTraining subset Tk
Step 3: randomly selecting m feature attributes from n features as node classification features, TkGenerating a kth decision tree according to a C4.5 algorithm as training data; and using the decision tree to test Ok、PkA data set, calculating the weight w of the decision tree according to formula (1) and formula (2)OkAnd wPkCalculating the final weight w of the decision tree according to equation (3)k
Figure FDA0002426737530000012
Figure FDA0002426737530000013
Figure FDA0002426737530000014
Step 4: repeating Step2 and Step3 until the number of the decision trees is K, and obtaining a decision tree set and the weight of each decision tree;
step 5: and testing the test data by using the decision tree set, classifying according to a formula (4), and obtaining the accuracy of the random forest algorithm:
Figure FDA0002426737530000015
step 6: and taking the accuracy obtained by Step5 as a fitness value of the particle swarm, performing iterative optimization on the number K of decision trees and the pretest sample rate X hyper-parameter by adopting a particle swarm optimization algorithm, comparing the decision tree number K and the pretest sample rate X hyper-parameter with the historical classification accuracy, and finally selecting the optimal model parameter.
2. The particle swarm optimization-based double-accuracy weighted random forest algorithm according to claim 1, wherein the particle swarm optimization algorithm in Step6 comprises the following specific steps:
s6-1: forming a space vector (K, X) by a decision tree K and a pretest sample rate X as a particle in a particle swarm optimization algorithm, wherein the particle is initialized in a uniform and random manner in the whole search space;
s6-2: updating the velocity and position of each particle in each dimension, and updating the entire bird population at each time step, wherein the vector equations for the velocity and position of the particles are shown in equations (5) and (6):
vid=w×vid+c1×rand×(pid-xid)+c2×rand×(pgd-xid) (5);
xid=xid+vid(6);
w is inertia weight, the value range is usually 0.4-1.2, rand () is a random function, and [0,1 ] is generated]Random number in the range, c1And c2The acceleration constant represents the particle's perception of the population.
CN202010223029.1A 2020-03-26 2020-03-26 Double-accuracy weighted random forest algorithm based on particle swarm optimization Pending CN111428790A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010223029.1A CN111428790A (en) 2020-03-26 2020-03-26 Double-accuracy weighted random forest algorithm based on particle swarm optimization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010223029.1A CN111428790A (en) 2020-03-26 2020-03-26 Double-accuracy weighted random forest algorithm based on particle swarm optimization

Publications (1)

Publication Number Publication Date
CN111428790A true CN111428790A (en) 2020-07-17

Family

ID=71548815

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010223029.1A Pending CN111428790A (en) 2020-03-26 2020-03-26 Double-accuracy weighted random forest algorithm based on particle swarm optimization

Country Status (1)

Country Link
CN (1) CN111428790A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112182221A (en) * 2020-10-12 2021-01-05 哈尔滨工程大学 Knowledge retrieval optimization method based on improved random forest
CN112330064A (en) * 2020-11-26 2021-02-05 中国石油大学(华东) New drilling workload prediction method based on ensemble learning
CN116561554A (en) * 2023-04-18 2023-08-08 南方电网电力科技股份有限公司 Feature extraction method, system, equipment and medium of boiler soot blower

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112182221A (en) * 2020-10-12 2021-01-05 哈尔滨工程大学 Knowledge retrieval optimization method based on improved random forest
CN112182221B (en) * 2020-10-12 2022-04-05 哈尔滨工程大学 Knowledge retrieval optimization method based on improved random forest
CN112330064A (en) * 2020-11-26 2021-02-05 中国石油大学(华东) New drilling workload prediction method based on ensemble learning
CN116561554A (en) * 2023-04-18 2023-08-08 南方电网电力科技股份有限公司 Feature extraction method, system, equipment and medium of boiler soot blower

Similar Documents

Publication Publication Date Title
CN111860638B (en) Parallel intrusion detection method and system based on unbalanced data deep belief network
CN111428790A (en) Double-accuracy weighted random forest algorithm based on particle swarm optimization
CN108319987B (en) Filtering-packaging type combined flow characteristic selection method based on support vector machine
CN112288191B (en) Ocean buoy service life prediction method based on multi-class machine learning method
CN107292350A (en) The method for detecting abnormality of large-scale data
CN109934269B (en) Open set identification method and device for electromagnetic signals
CN110135167B (en) Edge computing terminal security level evaluation method for random forest
CN109816044A (en) A kind of uneven learning method based on WGAN-GP and over-sampling
CN112087447B (en) Rare attack-oriented network intrusion detection method
CN110363230B (en) Stacking integrated sewage treatment fault diagnosis method based on weighted base classifier
Al Iqbal et al. Knowledge based decision tree construction with feature importance domain knowledge
CN112784031B (en) Method and system for classifying customer service conversation texts based on small sample learning
CN115580445A (en) Unknown attack intrusion detection method, device and computer readable storage medium
CN113674862A (en) Acute renal function injury onset prediction method based on machine learning
CN110177112B (en) Network intrusion detection method based on double subspace sampling and confidence offset
CN113343123A (en) Training method and detection method for generating confrontation multiple relation graph network
CN115659323A (en) Intrusion detection method based on information entropy theory and convolution neural network
CN114548306A (en) Intelligent monitoring method for early drilling overflow based on misclassification cost
Sudharson et al. Performance Evaluation of Improved Adaboost Framework in Randomized Phases Through Stumps
CN114120049A (en) Long tail distribution visual identification method based on prototype classifier learning
Li et al. Study on the Prediction of Imbalanced Bank Customer Churn Based on Generative Adversarial Network
Zhang et al. Unbalanced data classification based on oversampling and integrated learning
CN113610148B (en) Fault diagnosis method based on bias weighted AdaBoost
Wu et al. Unbalanced data classification algorithm based on hybrid sampling and ensemble learning
Budiman et al. Optimization Of Classification Results By Minimizing Class Imbalance On Decision Tree Algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200717