CN111626418A

CN111626418A - Integrated pruning strategy based on balanced binary tree

Info

Publication number: CN111626418A
Application number: CN202010458446.4A
Authority: CN
Inventors: 邓晓衡; 蔚永; 黑聪; 刘梦杰
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2020-09-04
Anticipated expiration: 2040-05-27
Also published as: CN111626418B

Abstract

The invention provides an integrated pruning strategy based on a balanced binary tree, and the multi-step multi-line ship lock structure comprises the following steps: s1, initializing a base classifier integration pool: segmenting the large data set to form a plurality of sub data sets, and training and testing each sub data set to form an initial complete classifier pool; s2, constructing a balanced binary tree to form a final sub-integration; and S3, predicting and classifying the new data sample by using the retained optimal sub-integration. The invention solves the technical problems that the fitting phenomenon is easy to generate, the base classifier with too high or too low test precision is difficult to remove, and the generalization performance is not high.

Description

Integrated pruning strategy based on balanced binary tree

Technical Field

The invention relates to the technical field of ensemble learning, in particular to an ensemble pruning strategy based on a balanced binary tree.

Background

The ensemble learning solves a plurality of problems of a single classifier in the process of mass data training and learning, but as the ensemble learning is realized by forming an ensemble pool by a plurality of single classifiers to complete a prediction or classification task, higher requirements are put forward on computer hardware resources, and the common mode for solving the problem is to use an ensemble pruning strategy idea to reduce the number of the used single classifiers as much as possible under the condition of ensuring that the final prediction or classification precision of the ensemble learning is not reduced or even improved.

The method is characterized in that a clustering task is completed by taking the testing precision of each base classifier in an integration pool as a data point, and then a subset consisting of corresponding base classifiers and consisting of most data points is selected as a final integration completion task; the integrated pruning strategy based on the optimization problem mainly uses the test results of all base classifiers in an integrated pool as an optimization problem to search the optimal sub-integration, the integrated pruning strategy based on reinforcement learning mainly tentatively searches the optimal sub-integration again and again through a reinforcement algorithm, and the integrated pruning strategy based on the sequence mainly obtains the optimal sub-integration by sequencing the precision of all the base classifiers.

The traditional sequence-based integrated pruning may cause the generation of an overfitting phenomenon, the strategy is improved, the elimination work of a part of basis classifiers with too high test precision or too low test precision in an integrated pool is completed by utilizing the characteristic of a balanced binary tree, and finally the part of basis classifiers with better generalization performance are reserved as final sub-integration.

Disclosure of Invention

The invention provides an integrated pruning strategy based on a balanced binary tree, and aims to solve the problems that the fitting phenomenon is easy to generate, a base classifier with too high or too low test precision is difficult to eliminate, and the generalization performance is not high in the background technology.

In order to achieve the above object, an integrated pruning strategy based on a balanced binary tree according to an embodiment of the present invention is characterized by comprising the following steps:

s1, initializing a base classifier integration pool: segmenting the large data set to form a plurality of sub data sets, and training and testing each sub data set to form an initial complete classifier pool;

s2, constructing a balanced binary tree to form a final sub-integration: constructing a balanced binary tree according to the precision of the base classifiers in the base classifier pool, wherein each node on the balanced binary tree represents the training precision of each base classifier in the integration pool, eliminating partial leaf nodes of the left lower branch and the right lower branch of the balanced binary tree by setting a boundary pruning function, and reserving partial nodes of the middle trunk to form final sub-integration;

and S3, predicting and classifying the new data sample by using the retained optimal sub-integration.

In S1, the artificial neural network ANN is used as the basis classifier to complete training and testing, and obtain the initial basis classifier pool and the training precision of each basis classifier.

In S2, the precision of the base classifier represented by the root node is ranked at the middle position in the precision of the base classifier in the whole integration pool, the base classifier represented by each leaf node of the left lower branch is at the last position in the precision ranking of each base classifier in the integration pool, and the precision of the base classifier represented by each leaf node of the right lower limb is at the front position in the precision ranking in the integration pool.

In S2, the number of left and right branch leaf nodes of the root node is counted, a pruning threshold is set, and node elimination is performed according to the pruning threshold.

The average value of the nodes of the left branch and the right branch of the balanced binary tree is respectively used as a left pruning threshold and a right pruning threshold, and the left branch pruning threshold and the right branch pruning threshold are respectively as follows:

in S1, the big data includes a normal data set and an abnormal data set.

Wherein the number ratio range of the data sets under the normal condition to the data sets under the abnormal condition is 100: 1-1000: 1.

the integrated pruning strategy based on the balanced binary tree according to claim 6, wherein the splitting in S1 specifically comprises: and segmenting the data set under the normal condition to obtain subdata sets under the normal condition, and merging the subdata sets under each normal operation condition with the data sets under other abnormal conditions to form data sets under various operation conditions.

The scheme of the invention has the following beneficial effects:

the method has the advantages that the balanced binary tree is used for removing the partial base classifiers with too poor generalization capability and too poor precision in the integrated pool, in the removing process, the integral base classification pool is sequentially constructed, the base classifiers with too good training precision can be deleted conveniently, the partial base classifiers are often classifiers with too poor generalization capability, and on the other hand, the classifiers with poor training precision are deleted, so that the integral integrated pool is ensured to have higher precision finally. Therefore, the invention considers the whole integrated pool, and further realizes the elimination work of the bad base classifier.

The high-quality characteristics of the balanced binary tree are used for constructing each base classifier in the integration pool according to the precision, so that the constructed balanced binary tree has the sequence characteristic, the operation of removing the classifiers with too good precision and too poor precision in the base classifier pool can be completed by carrying out the trimming operation of the left lower branch and the right lower branch on the tree, on one hand, the phenomenon of over-fitting can be avoided by removing the base classifiers with too good precision, and on the other hand, the integral integration precision can be improved by removing the base classifiers with too poor precision. So that the resulting subset composition can achieve the best generalization performance. Secondly, the dependence of the algorithm model on the hardware resources of the computer can be reduced as much as possible by reducing the scale of the integrated pool.

Drawings

FIG. 1 is a flow chart of the integrated pruning strategy based on a balanced binary tree of the present invention;

fig. 2 shows the test accuracy of the final sub-integration obtained under each pruning strategy by adopting four different pruning strategies to perform the elimination work of the bad base classifier of the base classifier pool.

Detailed Description

In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.

Aiming at the existing problems, the invention provides an integrated pruning strategy based on a balanced binary tree, which comprises the following steps:

s1, initializing a base classifier integration pool: segmenting the large data set to form a plurality of sub data sets, and training and testing each sub data set to form an initial complete classifier pool; the big data comprises a data set under a normal condition and a data set under an abnormal condition. The number ratio of the data sets in the normal case to the data sets in the abnormal case ranges from 100: 1-1000: 1.

the segmentation work specifically comprises the following steps: and segmenting the data set under the normal condition to obtain subdata sets under the normal condition, and merging the subdata sets under each normal operation condition with the data sets under other abnormal conditions to form data sets under various operation conditions.

The training and testing work is specifically to use an artificial neural network ANN as a base classifier to finish the training and testing work and obtain an initial base classifier pool and the training precision of each base classifier.

S2, constructing a balanced binary tree to form a final sub-integration: constructing a balanced binary tree according to the precision of the base classifiers in the base classifier pool, wherein each node on the balanced binary tree represents the training precision of each base classifier in the integration pool, the precision of the base classifier represented by the root node is ranked in the middle position in the precision of the base classifiers in the whole integration pool, the base classifier represented by each leaf node of the left lower branch is located at the tail position in the precision ranking of each base classifier in the integration pool, and the precision of the base classifier represented by each leaf node of the right lower limb is located at the front position in the precision ranking in the integration pool; counting the number of left and right branch and leaf nodes of the root node, setting a pruning threshold value, and removing nodes according to the pruning threshold value; by setting a boundary pruning function, removing partial leaf nodes of a left lower branch and a right lower branch of the balanced binary tree, and reserving partial nodes of a middle trunk to form a final sub-integration; respectively taking the average value of the nodes of the left branch and the right branch of the balanced binary tree as a left pruning threshold and a right pruning threshold, wherein the left pruning threshold and the right pruning threshold are respectively as follows:

The scheme is applied to fault diagnosis of the natural gas compressor, and a good effect is obtained in the fault diagnosis. A plurality of sensor data are built in the selected compressor, all sensors acquire data every 3 seconds, a data set with sample size of tens of millions in a certain time period is selected, and a large data set is more suitable for the proposed model.

The first step is as follows: respectively collecting data under the normal operation condition of the compressor and under the common 4 fault conditions, and respectively counting data volume information under various operation conditions. Each compressor acquires 5 Data in total, namely a Data set Data01 under a normal condition, a Data set Data02 under an abnormal condition of low inlet pressure, a Data set Data03 under an abnormal condition of air leakage of an exhaust valve, a Data set Data04 under an abnormal condition of high pressure of a recovery tank and a Data set Data05 under an abnormal condition of swinging of a rotating shaft, wherein the sample number ratio of each Data set is as follows: data01 Data02 Data03 Data04 Data05 603:3:7:5: 2.

The second step is that: the data set under the normal operation condition is averagely divided into 50 parts to obtain 50 subdata sets under the normal operation condition, and then the subdata sets under each normal operation condition are combined with the data sets under other 4 abnormal conditions to form 50 data sets under 5 operation conditions.

The third step: and (4) training the 50 sub data sets obtained in the second step by using a neural network ANN and obtaining the prediction precision of the ANN model under each sub data set.

The fourth step: and constructing a balanced binary tree containing 50 nodes according to the precision data of each ANN model obtained in the third step, wherein each node in the tree represents one ANN model.

The fifth step: and setting a pruning threshold, wherein the pruning threshold is respectively used as a left pruning threshold and a right pruning threshold by taking the average value of the nodes of the left branch and the right branch of the balanced binary tree, then respectively traversing the nodes of the left branch and comparing the nodes with the left branch pruning threshold, and if the data value of the node is less than the threshold, deleting the node from the balanced binary tree, namely, deleting the node from the balanced binary tree.

The left branch pruning threshold and the right branch pruning threshold are respectively as follows:

and a sixth step: by pruning, a balanced binary tree with a smaller scale is obtained, namely, an ensemble learning model with a smaller scale is obtained.

The seventh step: and collecting a new data set for the compressor, inputting the new data set into the integrated model subjected to balanced binary tree pruning, and performing fault classification and diagnosis.

By means of unbalanced processing and integrated pruning of the data set, the model can be more accurate in the classification process, and a large amount of computer hardware resources can be saved. Has strong application value in industrial production.

By adopting the integrated pruning strategy based on the balanced binary tree, the technical advantages are as follows:

As shown in fig. 2, four different pruning strategies are used to perform the rejection of the poor basis classifier of the basis classifier pool, and the test precision of the final sub-integration obtained under each pruning strategy is obtained. The method specifically comprises four pruning strategies based on fuzzy clustering, optimization problem, sequence and balanced binary tree. Compared with other pruning strategies, the integrated pruning strategy based on the balanced binary tree can achieve better precision and has better practical use value.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. An integrated pruning strategy based on a balanced binary tree is characterized by comprising the following steps:

2. The integrated pruning strategy based on balanced binary tree according to claim 1, wherein in the step S1, an artificial neural network ANN is used as a base classifier to complete training and testing work, and obtain an initial base classifier pool and the training precision of each base classifier.

3. The integrated pruning strategy based on balanced binary tree according to claim 1, wherein in S2, the precision of the base classifier represented by the root node is ranked at the middle position in the precision of the base classifier in the whole integration pool, the base classifier represented by each leaf node of the lower left branch is at the end position in the precision ordering of each base classifier in the integration pool, and the base classifier represented by each leaf node of the lower right limb is at the front position in the precision ordering of the integration pool.

4. The integrated pruning strategy based on the balanced binary tree as claimed in claim 3, wherein in the step S2, the number of left and right branch leaf nodes of the root node is counted, a pruning threshold is set, and node elimination is performed according to the pruning threshold.

5. The balanced binary tree-based integrated pruning strategy according to claim 4, characterized in that the average values of the nodes of the left branch and the right branch of the balanced binary tree are respectively used as the left pruning threshold and the right pruning threshold, and the left pruning threshold and the right pruning threshold are respectively as follows:

6. the balanced binary tree-based integrated pruning strategy according to claim 1, wherein in the S1, the big data comprises a data set under a normal condition and a data set under an abnormal condition.

7. The balanced binary tree based integrated pruning strategy according to claim 6, wherein the ratio of the number of the data sets under the normal condition to the number of the data sets under the abnormal condition ranges from 100: 1-1000: 1.

8. the integrated pruning strategy based on the balanced binary tree according to claim 6, wherein the splitting in S1 specifically comprises: and segmenting the data set under the normal condition to obtain subdata sets under the normal condition, and merging the subdata sets under each normal operation condition with the data sets under other abnormal conditions to form data sets under various operation conditions.