CN111626418A - Integrated pruning strategy based on balanced binary tree - Google Patents

Integrated pruning strategy based on balanced binary tree Download PDF

Info

Publication number
CN111626418A
CN111626418A CN202010458446.4A CN202010458446A CN111626418A CN 111626418 A CN111626418 A CN 111626418A CN 202010458446 A CN202010458446 A CN 202010458446A CN 111626418 A CN111626418 A CN 111626418A
Authority
CN
China
Prior art keywords
binary tree
balanced binary
pool
integration
precision
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010458446.4A
Other languages
Chinese (zh)
Other versions
CN111626418B (en
Inventor
邓晓衡
蔚永
黑聪
刘梦杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202010458446.4A priority Critical patent/CN111626418B/en
Publication of CN111626418A publication Critical patent/CN111626418A/en
Application granted granted Critical
Publication of CN111626418B publication Critical patent/CN111626418B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an integrated pruning strategy based on a balanced binary tree, and the multi-step multi-line ship lock structure comprises the following steps: s1, initializing a base classifier integration pool: segmenting the large data set to form a plurality of sub data sets, and training and testing each sub data set to form an initial complete classifier pool; s2, constructing a balanced binary tree to form a final sub-integration; and S3, predicting and classifying the new data sample by using the retained optimal sub-integration. The invention solves the technical problems that the fitting phenomenon is easy to generate, the base classifier with too high or too low test precision is difficult to remove, and the generalization performance is not high.

Description

Integrated pruning strategy based on balanced binary tree
Technical Field
The invention relates to the technical field of ensemble learning, in particular to an ensemble pruning strategy based on a balanced binary tree.
Background
The ensemble learning solves a plurality of problems of a single classifier in the process of mass data training and learning, but as the ensemble learning is realized by forming an ensemble pool by a plurality of single classifiers to complete a prediction or classification task, higher requirements are put forward on computer hardware resources, and the common mode for solving the problem is to use an ensemble pruning strategy idea to reduce the number of the used single classifiers as much as possible under the condition of ensuring that the final prediction or classification precision of the ensemble learning is not reduced or even improved.
The method is characterized in that a clustering task is completed by taking the testing precision of each base classifier in an integration pool as a data point, and then a subset consisting of corresponding base classifiers and consisting of most data points is selected as a final integration completion task; the integrated pruning strategy based on the optimization problem mainly uses the test results of all base classifiers in an integrated pool as an optimization problem to search the optimal sub-integration, the integrated pruning strategy based on reinforcement learning mainly tentatively searches the optimal sub-integration again and again through a reinforcement algorithm, and the integrated pruning strategy based on the sequence mainly obtains the optimal sub-integration by sequencing the precision of all the base classifiers.
The traditional sequence-based integrated pruning may cause the generation of an overfitting phenomenon, the strategy is improved, the elimination work of a part of basis classifiers with too high test precision or too low test precision in an integrated pool is completed by utilizing the characteristic of a balanced binary tree, and finally the part of basis classifiers with better generalization performance are reserved as final sub-integration.
Disclosure of Invention
The invention provides an integrated pruning strategy based on a balanced binary tree, and aims to solve the problems that the fitting phenomenon is easy to generate, a base classifier with too high or too low test precision is difficult to eliminate, and the generalization performance is not high in the background technology.
In order to achieve the above object, an integrated pruning strategy based on a balanced binary tree according to an embodiment of the present invention is characterized by comprising the following steps:
s1, initializing a base classifier integration pool: segmenting the large data set to form a plurality of sub data sets, and training and testing each sub data set to form an initial complete classifier pool;
s2, constructing a balanced binary tree to form a final sub-integration: constructing a balanced binary tree according to the precision of the base classifiers in the base classifier pool, wherein each node on the balanced binary tree represents the training precision of each base classifier in the integration pool, eliminating partial leaf nodes of the left lower branch and the right lower branch of the balanced binary tree by setting a boundary pruning function, and reserving partial nodes of the middle trunk to form final sub-integration;
and S3, predicting and classifying the new data sample by using the retained optimal sub-integration.
In S1, the artificial neural network ANN is used as the basis classifier to complete training and testing, and obtain the initial basis classifier pool and the training precision of each basis classifier.
In S2, the precision of the base classifier represented by the root node is ranked at the middle position in the precision of the base classifier in the whole integration pool, the base classifier represented by each leaf node of the left lower branch is at the last position in the precision ranking of each base classifier in the integration pool, and the precision of the base classifier represented by each leaf node of the right lower limb is at the front position in the precision ranking in the integration pool.
In S2, the number of left and right branch leaf nodes of the root node is counted, a pruning threshold is set, and node elimination is performed according to the pruning threshold.
The average value of the nodes of the left branch and the right branch of the balanced binary tree is respectively used as a left pruning threshold and a right pruning threshold, and the left branch pruning threshold and the right branch pruning threshold are respectively as follows:
Figure BDA0002510081470000021
Figure BDA0002510081470000022
in S1, the big data includes a normal data set and an abnormal data set.
Wherein the number ratio range of the data sets under the normal condition to the data sets under the abnormal condition is 100: 1-1000: 1.
the integrated pruning strategy based on the balanced binary tree according to claim 6, wherein the splitting in S1 specifically comprises: and segmenting the data set under the normal condition to obtain subdata sets under the normal condition, and merging the subdata sets under each normal operation condition with the data sets under other abnormal conditions to form data sets under various operation conditions.
The scheme of the invention has the following beneficial effects:
the method has the advantages that the balanced binary tree is used for removing the partial base classifiers with too poor generalization capability and too poor precision in the integrated pool, in the removing process, the integral base classification pool is sequentially constructed, the base classifiers with too good training precision can be deleted conveniently, the partial base classifiers are often classifiers with too poor generalization capability, and on the other hand, the classifiers with poor training precision are deleted, so that the integral integrated pool is ensured to have higher precision finally. Therefore, the invention considers the whole integrated pool, and further realizes the elimination work of the bad base classifier.
The high-quality characteristics of the balanced binary tree are used for constructing each base classifier in the integration pool according to the precision, so that the constructed balanced binary tree has the sequence characteristic, the operation of removing the classifiers with too good precision and too poor precision in the base classifier pool can be completed by carrying out the trimming operation of the left lower branch and the right lower branch on the tree, on one hand, the phenomenon of over-fitting can be avoided by removing the base classifiers with too good precision, and on the other hand, the integral integration precision can be improved by removing the base classifiers with too poor precision. So that the resulting subset composition can achieve the best generalization performance. Secondly, the dependence of the algorithm model on the hardware resources of the computer can be reduced as much as possible by reducing the scale of the integrated pool.
Drawings
FIG. 1 is a flow chart of the integrated pruning strategy based on a balanced binary tree of the present invention;
fig. 2 shows the test accuracy of the final sub-integration obtained under each pruning strategy by adopting four different pruning strategies to perform the elimination work of the bad base classifier of the base classifier pool.
Detailed Description
In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.
Aiming at the existing problems, the invention provides an integrated pruning strategy based on a balanced binary tree, which comprises the following steps:
s1, initializing a base classifier integration pool: segmenting the large data set to form a plurality of sub data sets, and training and testing each sub data set to form an initial complete classifier pool; the big data comprises a data set under a normal condition and a data set under an abnormal condition. The number ratio of the data sets in the normal case to the data sets in the abnormal case ranges from 100: 1-1000: 1.
the segmentation work specifically comprises the following steps: and segmenting the data set under the normal condition to obtain subdata sets under the normal condition, and merging the subdata sets under each normal operation condition with the data sets under other abnormal conditions to form data sets under various operation conditions.
The training and testing work is specifically to use an artificial neural network ANN as a base classifier to finish the training and testing work and obtain an initial base classifier pool and the training precision of each base classifier.
S2, constructing a balanced binary tree to form a final sub-integration: constructing a balanced binary tree according to the precision of the base classifiers in the base classifier pool, wherein each node on the balanced binary tree represents the training precision of each base classifier in the integration pool, the precision of the base classifier represented by the root node is ranked in the middle position in the precision of the base classifiers in the whole integration pool, the base classifier represented by each leaf node of the left lower branch is located at the tail position in the precision ranking of each base classifier in the integration pool, and the precision of the base classifier represented by each leaf node of the right lower limb is located at the front position in the precision ranking in the integration pool; counting the number of left and right branch and leaf nodes of the root node, setting a pruning threshold value, and removing nodes according to the pruning threshold value; by setting a boundary pruning function, removing partial leaf nodes of a left lower branch and a right lower branch of the balanced binary tree, and reserving partial nodes of a middle trunk to form a final sub-integration; respectively taking the average value of the nodes of the left branch and the right branch of the balanced binary tree as a left pruning threshold and a right pruning threshold, wherein the left pruning threshold and the right pruning threshold are respectively as follows:
Figure BDA0002510081470000041
Figure BDA0002510081470000042
and S3, predicting and classifying the new data sample by using the retained optimal sub-integration.
The scheme is applied to fault diagnosis of the natural gas compressor, and a good effect is obtained in the fault diagnosis. A plurality of sensor data are built in the selected compressor, all sensors acquire data every 3 seconds, a data set with sample size of tens of millions in a certain time period is selected, and a large data set is more suitable for the proposed model.
The first step is as follows: respectively collecting data under the normal operation condition of the compressor and under the common 4 fault conditions, and respectively counting data volume information under various operation conditions. Each compressor acquires 5 Data in total, namely a Data set Data01 under a normal condition, a Data set Data02 under an abnormal condition of low inlet pressure, a Data set Data03 under an abnormal condition of air leakage of an exhaust valve, a Data set Data04 under an abnormal condition of high pressure of a recovery tank and a Data set Data05 under an abnormal condition of swinging of a rotating shaft, wherein the sample number ratio of each Data set is as follows: data01 Data02 Data03 Data04 Data05 603:3:7:5: 2.
The second step is that: the data set under the normal operation condition is averagely divided into 50 parts to obtain 50 subdata sets under the normal operation condition, and then the subdata sets under each normal operation condition are combined with the data sets under other 4 abnormal conditions to form 50 data sets under 5 operation conditions.
The third step: and (4) training the 50 sub data sets obtained in the second step by using a neural network ANN and obtaining the prediction precision of the ANN model under each sub data set.
The fourth step: and constructing a balanced binary tree containing 50 nodes according to the precision data of each ANN model obtained in the third step, wherein each node in the tree represents one ANN model.
The fifth step: and setting a pruning threshold, wherein the pruning threshold is respectively used as a left pruning threshold and a right pruning threshold by taking the average value of the nodes of the left branch and the right branch of the balanced binary tree, then respectively traversing the nodes of the left branch and comparing the nodes with the left branch pruning threshold, and if the data value of the node is less than the threshold, deleting the node from the balanced binary tree, namely, deleting the node from the balanced binary tree.
The left branch pruning threshold and the right branch pruning threshold are respectively as follows:
Figure BDA0002510081470000051
Figure BDA0002510081470000052
and a sixth step: by pruning, a balanced binary tree with a smaller scale is obtained, namely, an ensemble learning model with a smaller scale is obtained.
The seventh step: and collecting a new data set for the compressor, inputting the new data set into the integrated model subjected to balanced binary tree pruning, and performing fault classification and diagnosis.
By means of unbalanced processing and integrated pruning of the data set, the model can be more accurate in the classification process, and a large amount of computer hardware resources can be saved. Has strong application value in industrial production.
By adopting the integrated pruning strategy based on the balanced binary tree, the technical advantages are as follows:
the method has the advantages that the balanced binary tree is used for removing the partial base classifiers with too poor generalization capability and too poor precision in the integrated pool, in the removing process, the integral base classification pool is sequentially constructed, the base classifiers with too good training precision can be deleted conveniently, the partial base classifiers are often classifiers with too poor generalization capability, and on the other hand, the classifiers with poor training precision are deleted, so that the integral integrated pool is ensured to have higher precision finally. Therefore, the invention considers the whole integrated pool, and further realizes the elimination work of the bad base classifier.
The high-quality characteristics of the balanced binary tree are used for constructing each base classifier in the integration pool according to the precision, so that the constructed balanced binary tree has the sequence characteristic, the operation of removing the classifiers with too good precision and too poor precision in the base classifier pool can be completed by carrying out the trimming operation of the left lower branch and the right lower branch on the tree, on one hand, the phenomenon of over-fitting can be avoided by removing the base classifiers with too good precision, and on the other hand, the integral integration precision can be improved by removing the base classifiers with too poor precision. So that the resulting subset composition can achieve the best generalization performance. Secondly, the dependence of the algorithm model on the hardware resources of the computer can be reduced as much as possible by reducing the scale of the integrated pool.
As shown in fig. 2, four different pruning strategies are used to perform the rejection of the poor basis classifier of the basis classifier pool, and the test precision of the final sub-integration obtained under each pruning strategy is obtained. The method specifically comprises four pruning strategies based on fuzzy clustering, optimization problem, sequence and balanced binary tree. Compared with other pruning strategies, the integrated pruning strategy based on the balanced binary tree can achieve better precision and has better practical use value.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (8)

1. An integrated pruning strategy based on a balanced binary tree is characterized by comprising the following steps:
s1, initializing a base classifier integration pool: segmenting the large data set to form a plurality of sub data sets, and training and testing each sub data set to form an initial complete classifier pool;
s2, constructing a balanced binary tree to form a final sub-integration: constructing a balanced binary tree according to the precision of the base classifiers in the base classifier pool, wherein each node on the balanced binary tree represents the training precision of each base classifier in the integration pool, eliminating partial leaf nodes of the left lower branch and the right lower branch of the balanced binary tree by setting a boundary pruning function, and reserving partial nodes of the middle trunk to form final sub-integration;
and S3, predicting and classifying the new data sample by using the retained optimal sub-integration.
2. The integrated pruning strategy based on balanced binary tree according to claim 1, wherein in the step S1, an artificial neural network ANN is used as a base classifier to complete training and testing work, and obtain an initial base classifier pool and the training precision of each base classifier.
3. The integrated pruning strategy based on balanced binary tree according to claim 1, wherein in S2, the precision of the base classifier represented by the root node is ranked at the middle position in the precision of the base classifier in the whole integration pool, the base classifier represented by each leaf node of the lower left branch is at the end position in the precision ordering of each base classifier in the integration pool, and the base classifier represented by each leaf node of the lower right limb is at the front position in the precision ordering of the integration pool.
4. The integrated pruning strategy based on the balanced binary tree as claimed in claim 3, wherein in the step S2, the number of left and right branch leaf nodes of the root node is counted, a pruning threshold is set, and node elimination is performed according to the pruning threshold.
5. The balanced binary tree-based integrated pruning strategy according to claim 4, characterized in that the average values of the nodes of the left branch and the right branch of the balanced binary tree are respectively used as the left pruning threshold and the right pruning threshold, and the left pruning threshold and the right pruning threshold are respectively as follows:
Figure FDA0002510081460000011
Figure FDA0002510081460000012
6. the balanced binary tree-based integrated pruning strategy according to claim 1, wherein in the S1, the big data comprises a data set under a normal condition and a data set under an abnormal condition.
7. The balanced binary tree based integrated pruning strategy according to claim 6, wherein the ratio of the number of the data sets under the normal condition to the number of the data sets under the abnormal condition ranges from 100: 1-1000: 1.
8. the integrated pruning strategy based on the balanced binary tree according to claim 6, wherein the splitting in S1 specifically comprises: and segmenting the data set under the normal condition to obtain subdata sets under the normal condition, and merging the subdata sets under each normal operation condition with the data sets under other abnormal conditions to form data sets under various operation conditions.
CN202010458446.4A 2020-05-27 2020-05-27 Compressor fault classification method based on balanced binary tree integrated pruning strategy Active CN111626418B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010458446.4A CN111626418B (en) 2020-05-27 2020-05-27 Compressor fault classification method based on balanced binary tree integrated pruning strategy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010458446.4A CN111626418B (en) 2020-05-27 2020-05-27 Compressor fault classification method based on balanced binary tree integrated pruning strategy

Publications (2)

Publication Number Publication Date
CN111626418A true CN111626418A (en) 2020-09-04
CN111626418B CN111626418B (en) 2022-04-22

Family

ID=72273131

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010458446.4A Active CN111626418B (en) 2020-05-27 2020-05-27 Compressor fault classification method based on balanced binary tree integrated pruning strategy

Country Status (1)

Country Link
CN (1) CN111626418B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050050072A1 (en) * 2003-09-03 2005-03-03 Lucent Technologies, Inc. Highly parallel tree search architecture for multi-user detection
US20090324132A1 (en) * 2008-06-25 2009-12-31 Microsoft Corporation Fast approximate spatial representations for informal retrieval
CN106033617A (en) * 2015-03-16 2016-10-19 广州四三九九信息科技有限公司 Method for performing game picture intelligent compression by combining with visualization tool
CN109033632A (en) * 2018-07-26 2018-12-18 北京航空航天大学 A kind of trend forecasting method based on depth quantum nerve network
CN110569867A (en) * 2019-07-15 2019-12-13 山东电工电气集团有限公司 Decision tree algorithm-based power transmission line fault reason distinguishing method, medium and equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050050072A1 (en) * 2003-09-03 2005-03-03 Lucent Technologies, Inc. Highly parallel tree search architecture for multi-user detection
US20090324132A1 (en) * 2008-06-25 2009-12-31 Microsoft Corporation Fast approximate spatial representations for informal retrieval
CN106033617A (en) * 2015-03-16 2016-10-19 广州四三九九信息科技有限公司 Method for performing game picture intelligent compression by combining with visualization tool
CN109033632A (en) * 2018-07-26 2018-12-18 北京航空航天大学 A kind of trend forecasting method based on depth quantum nerve network
CN110569867A (en) * 2019-07-15 2019-12-13 山东电工电气集团有限公司 Decision tree algorithm-based power transmission line fault reason distinguishing method, medium and equipment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
RAJEEV RASTOGI 等: "PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning", 《DATA MINING AND KNOWLEDGE DISCOVERY》 *
XIAOHENG DENG 等: "An imbalanced data classification method based on automatic clustering under-sampling", 《2016 IEEE 35TH INTERNATIONAL PERFORMANCE COMPUTING AND COMMUNICATIONS CONFERENCE (IPCCC)》 *
孙卫祥: "基于数据挖掘与信息融合的故障诊断方法研究", 《万方数据》 *

Also Published As

Publication number Publication date
CN111626418B (en) 2022-04-22

Similar Documents

Publication Publication Date Title
CN112308158B (en) Multi-source field self-adaptive model and method based on partial feature alignment
CN111126386B (en) Sequence domain adaptation method based on countermeasure learning in scene text recognition
CN111898689B (en) Image classification method based on neural network architecture search
CN110851645B (en) Image retrieval method based on similarity maintenance under deep metric learning
CN108614997B (en) Remote sensing image identification method based on improved AlexNet
CN107392919B (en) Adaptive genetic algorithm-based gray threshold acquisition method and image segmentation method
CN111680706A (en) Double-channel output contour detection method based on coding and decoding structure
CN108280236A (en) A kind of random forest visualization data analysing method based on LargeVis
CN112308825B (en) SqueezeNet-based crop leaf disease identification method
CN111259933B (en) High-dimensional characteristic data classification method and system based on distributed parallel decision tree
CN110020712A (en) A kind of optimization population BP neural network forecast method and system based on cluster
CN111950630A (en) Small sample industrial product defect classification method based on two-stage transfer learning
CN110826624A (en) Time series classification method based on deep reinforcement learning
CN110866134A (en) Image retrieval-oriented distribution consistency keeping metric learning method
CN111062425A (en) Unbalanced data set processing method based on C-K-SMOTE algorithm
CN112270405A (en) Filter pruning method and system of convolution neural network model based on norm
CN112070136A (en) Method for classifying unbalanced data based on boost decision tree and improved SMOTE
Ai-jun et al. Research on unbalanced data processing algorithm base tomeklinks-smote
CN113505120B (en) Double-stage noise cleaning method for large-scale face data set
CN113902113A (en) Convolutional neural network channel pruning method
CN111626418B (en) Compressor fault classification method based on balanced binary tree integrated pruning strategy
CN113052353B (en) Air quality prediction and prediction model training method and device and storage medium
CN111488903A (en) Decision tree feature selection method based on feature weight
CN103823843B (en) Gauss mixture model tree and incremental clustering method thereof
CN114254669B (en) Construction method of time sequence period ResNet network model in rolling bearing fault diagnosis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant