CN104391970B - A kind of random forest data processing method of attribute subspace weighting - Google Patents

A kind of random forest data processing method of attribute subspace weighting Download PDF

Info

Publication number
CN104391970B
CN104391970B CN201410734550.6A CN201410734550A CN104391970B CN 104391970 B CN104391970 B CN 104391970B CN 201410734550 A CN201410734550 A CN 201410734550A CN 104391970 B CN104391970 B CN 104391970B
Authority
CN
China
Prior art keywords
node
decision
tree
random forest
built
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410734550.6A
Other languages
Chinese (zh)
Other versions
CN104391970A (en
Inventor
赵鹤
黄哲学
姜青山
吴胤旭
陈会
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN201410734550.6A priority Critical patent/CN104391970B/en
Publication of CN104391970A publication Critical patent/CN104391970A/en
Application granted granted Critical
Publication of CN104391970B publication Critical patent/CN104391970B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of random forest data processing method of attribute subspace weighting, methods described includes:S1, the N number of sample set consistent with the decision tree number that needs to establish is extracted by way of sampling with replacement to the set of data samples that is trained of needs;S2, the decision-tree model without beta pruning is built to each sample set, when building the node of decision-tree model, first the attribute of all participation node structures is weighted using information gain method, M attribute of weight highest is therefrom selected and participates in node structure;S3, N number of decision-tree model of structure is merged into a big Random Forest model.The present invention weights information gain for attribute subspace so that useful information can be extracted, so as to improve the precision of classification.

Description

A kind of random forest data processing method of attribute subspace weighting
Technical field
At technical field of data processing, more particularly to a kind of random forest data of attribute subspace weighting Reason method.
Background technology
With constantly advancing and widely using in all trades and professions for computer, internet and information technology, people Become more and more huger of the Various types of data that accumulates and become increasingly complex.For example, various types of biological datas, internet The attribute dimensions of the data such as text data, digital image data can reach thousands of, and data volume is also constantly increasing Add, cause traditional Classification Algorithms in Data Mining to be difficult to tackle superelevation dimension and the ever-increasing challenge of amount of calculation.
Random forests algorithm is a kind of integrated learning approach for being used to classify, and it uses decision tree as sub-classifier, and Other sorting algorithms are compared, and are had the advantages that classification performance is good, precision is high, generalization ability is strong, are turned into current sort research field Very popular algorithm, it is widely used in the every field of data mining.Its basic thought is to be carried by Ho in nineteen ninety-five earliest Go out, and improved by Breiman in 2001 and eventually become present random forests algorithm.But when in face of high dimensional data, especially It is sparse high dimensional data, and the method for its stochastic subspace used sampling can cause accounting, and natively few useful attribute is more Difficulty is drawn into, and has a strong impact on final classification results.Meanwhile as data volume is increasing, existing unit random forest Algorithm, which is realized, can not meet the needs of current big data so that outstanding algorithm can not be completed in a relatively short time and build Mould, influence its use.
The main flow of existing random forests algorithm is as follows:
1) sample with being put back to from original training data N group samples, then circulation structure decision tree:
A) every group of sample builds a decision tree;
B) when building each node of decision tree, randomly select M attribute and carry out node calculating;
C) during achievement, without the cutting of branch, to the last it is left last sample;
2) constructed N number of decision-tree model is integrated into a Random Forest model.
When in face of high dimensional data, in order to improve the selection probability of valuable attribute, Amaratunga proposes sub- sky Between the method that weights of sampling attribute to improve the probability that is extracted of important attribute when random forest is contribute, so as to improve decision tree Mean intensity reaches improvement classification performance.But this method is directed to two class problems.
In existing R software systems, randomForest and party are that more commonly used two are random gloomy for building The software kit of woods.RandomForest software kits are after changing into C language by the Fortran source codes of Breiman random forests algorithms Directly obtain, and have the maintenance of its team.Party software kits are then by the random of the Hothorn inferred conditions tree structures proposed Forest algorithm.But when for the big data of higher-dimension, the two bags are in the consumption of time and memory source all not to the utmost such as people Meaning.In the related R software kits of existing random forest, neither one can be changed to the selection of attribute, and all still Standalone version, it can not be operated in the computing environment of distributed parallel.
In summary, there is problems with existing random forests algorithm:
Attribute is chosen by the way of random sampling during due to structure decision tree nodes, when in face of superelevation dimension data When, it is difficult situation about being selected that can cause to possess the attribute for producing result material impact so that arithmetic accuracy is by serious shadow Ring;
Existing algorithm is all to use serial manner, and decision-tree model is established in circulation, and a decision tree is established in circulation every time Model, in the case of CPU multinuclears, effectively can not be reached using the advantage parallel computation of multiple CPU cores it is quick establish with The purpose of machine forest model;
When data volume is more and more so that when single machine can not store required data volume, existing random forest Algorithm can not disposably load all data, so as to establish accurate model;
Therefore, for above-mentioned technical problem, it is necessary to provide a kind of random forest data processing of attribute subspace weighting Method.
The content of the invention
In view of this, it is an object of the invention to provide a kind of random forest data processing side of attribute subspace weighting Method, to solve the problems, such as to be effectively treated the big data of superelevation dimension.
In order to achieve the above object, technical scheme provided in an embodiment of the present invention is as follows:
A kind of random forest data processing method of attribute subspace weighting, methods described include:
S1, the set of data samples being trained to needs are extracted by way of sampling with replacement determines with need to establish The consistent N number of sample set of plan tree number;
S2, the decision-tree model without beta pruning is built to each sample set, when building the node of decision-tree model, used Information gain method is first weighted to the attribute of all participation node structures, is therefrom selected M attribute of weight highest and is participated in section Point structure;
S3, N number of decision-tree model of structure is merged into a big Random Forest model.
As a further improvement on the present invention, the decision-tree model in the step S2 uses unit Multi-core mode Or multi-host parallel distributed way is built.
As a further improvement on the present invention, the decision-tree model in the step S2 uses unit Multi-core mode Built, specifically included:
The thread with CPU core number as many is automatically turned on, after each thread obtains an achievement information from process list And proceed by and contribute according to the information, a tree has often been built, just the decision-tree model built up has been put into random forest;
Each thread completes the process contribute simultaneously in parallel, distributes until all achievement information and completes, finally by with All decision trees of machine forest merging obtain Random Forest model to the end.
As a further improvement on the present invention, the step S2 also includes:
Process list is responsible for the distribution of achievement information, and after the tree of required structure is distributed completion, notice forest is complete Into situation.
As a further improvement on the present invention, the decision-tree model in the step S2 uses multi-host parallel distributed way Built, host node is responsible for the scheduling of total volume modeling, is responsible for specific achievement process from node, specifically includes:
Process in host node preserves the information of all achievements, and achievement information is divided into multiple process lists;
Start being contribute from node on other machines as needed, each obtaining one from host node from node enters Cheng Liebiao, decision tree is then independently built on the machine of oneself and generates sub- random forest;
Each it is put back into from node by the sub- random forest each built in host node, it is by host node that all sons is random Forest merges to obtain final Random Forest model.
As a further improvement on the present invention, when the machine from where node is non-multinuclear machine, using random gloomy The serial mode of woods algorithm is modeled.
As a further improvement on the present invention, the step S2 also includes:
When handling big data, host node can not deposit all data messages, now, in the process list of process just Preserve distribution situation of each deblocking on each machine, from node then during achievement according to distribution situation from its He obtains required data on machine.
The invention has the advantages that:
Information gain is weighted for attribute subspace so that useful information can be extracted, so as to improve the essence of classification Degree;
Multiple decision trees are built parallel using the mode of unit multithreading, and the model construction time is shortened with this;
Sub- Random Forest model is built using the mode of one-to-many main and subordinate node is distributed on multiple machines, so as to solve The problem of certainly data can not store on one node, while improve modeling efficiency.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments described in invention, for those of ordinary skill in the art, on the premise of not paying creative work, Other accompanying drawings can also be obtained according to these accompanying drawings.
Fig. 1 is a kind of schematic flow sheet of the random forest data processing method of attribute subspace weighting of the present invention.
Fig. 2 is that unit Multi-core mode is contribute schematic flow sheet in the embodiment of the invention.
Fig. 3 is multi-host parallel distributed way achievement schematic flow sheet in the embodiment of the invention.
Embodiment
In order that those skilled in the art more fully understand the technical scheme in the present invention, below in conjunction with of the invention real The accompanying drawing in example is applied, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described implementation Example only part of the embodiment of the present invention, rather than whole embodiments.It is common based on the embodiment in the present invention, this area The every other embodiment that technical staff is obtained under the premise of creative work is not made, should all belong to protection of the present invention Scope.
The invention discloses a kind of random forest data processing method of attribute subspace weighting, to solve to superelevation dimension The problem of big data is effectively treated.Its major part includes:
1) when establishing decision tree nodes, the selection to useful attribute is improved using the method for attribute subspace weighting Rate, to strengthen precision property of algorithm when in face of superelevation dimension data;
2) when when on multi-core CPU machine, algorithm builds decision tree by the way of parallel multithread so that the same time The time efficiency of algorithm can be improved to establishing multiple decision-tree models;
3) when there is more machines to be available for calculating, algorithm is determined to required foundation automatically by the way of distributed parallel Plan tree-model is allocated on more machines, so as to improve the scalability of algorithm.
Join shown in Fig. 1, a kind of random forest data processing method of attribute subspace of the invention weighting, its feature exists In described
S1, the set of data samples being trained to needs extract the decision-making with needing to establish by way of sampling with replacement Set the consistent N number of sample set of number;
S2, the decision-tree model without beta pruning is built to each sample set, when building the node of decision-tree model, used Information gain method is first weighted to the attribute of all participation node structures, is therefrom selected M attribute of weight highest and is participated in section Point structure;
S3, N number of decision-tree model of structure is merged into a big Random Forest model.
Specific implementation step is as follows:First, as existing random forests algorithm, to the data sample for needing to be trained Collection is extracted by way of sampling with replacement with needing the consistent N number of sample set of the decision tree number established;Then to each Sample set builds the decision-tree model without beta pruning:Wherein, when building the node of decision tree, the present invention uses information gain method First the attribute of all participation node structures is weighted, M attribute of weight highest is therefrom selected and participates in node structure;Finally Constructed N number of decision-tree model is merged into a big Random Forest model.
During decision tree is built, for two kinds of different environment of unit multinuclear and multimachine, it is respectively adopted parallel more The mode of thread and parallel distributed builds each decision-tree model:
1) unit Multi-core mode builds decision-tree model
As shown in Fig. 2 comprising the information contribute in process list (Task list), including need the decision tree established Sample set corresponding to number and each decision tree.In multinuclear stand-alone environment, parallel structure is determined by way of multithreading Plan tree-model, under default situations, algorithm automatically turns on the thread (Thread) with CPU core number as many, each Thread from Obtained in Task list after (fetch) achievement information and proceed by and contribute according to the information, often built one The decision-tree model built up, is just put into random forest (Forest) by tree.
Task list are also responsible for the distribution of achievement information in present embodiment, when the tree of required structure is distributed completion Afterwards, the situation for notifying Forest to complete.Each Thread completes the process contribute simultaneously in parallel like this, until all The distribution of achievement information is completed, and finally merging all decision trees by Forest obtains Random Forest model to the end.Wherein, Thread number is adjustable.
2) multi-host parallel distributed way establishes decision-tree model
As shown in figure 3, when establishing decision-tree model using more machine distributions, will be by two big modules to modeling process It is controlled:Host node (Master node) and from node (Slave node).Wherein, Master node are responsible for total volume modeling Scheduling, Slave node are responsible for specific achievement process.
Comprise the following steps that:First, the process (Tasks) in Master node preserves the information of all achievements, and will build Tree information is divided into multiple process lists (Task list1, Task list2 ...), and it is acted on unit Multi-core mode The Task list built in decision-tree model are consistent;Then the Slave node started as needed on other machines are built Tree, each Slave node obtain a Task list (such as Task list1) from Master node, then oneself Decision tree is independently built on machine and generates random forest Forest, its building process is built certainly with unit Multi-core mode Plan tree, if the non-multinuclear machine of this machine, then under its building process default situations, by using existing random forests algorithm Serial mode models;Finally, the random forest Forest each built is put back into Master node by each Slave node In, all random forests are merged by Master node to obtain final Random Forest model Forests.
Wherein, Slave node number, and the numbers of decision tree completed of each Slave node be by Master node are controllable.In addition, when handling big data, a node can not deposit all data messages, this When, distribution situation of each deblocking on each machine, Slave node are just saved in Tasks Task list Then obtain required data from other machines according to this information during achievement.
In summary, compared with prior art, the invention has the advantages that:
Information gain is weighted for attribute subspace so that useful information can be extracted, so as to improve the essence of classification Degree;
Multiple decision trees are built parallel using the mode of unit multithreading, and the model construction time is shortened with this;
Sub- Random Forest model is built using the mode of one-to-many main and subordinate node is distributed on multiple machines, so as to solve The problem of certainly data can not store on one node, while improve modeling efficiency.
It is obvious to a person skilled in the art that the invention is not restricted to the details of above-mentioned one exemplary embodiment, Er Qie In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, embodiment all should be regarded as exemplary, and be nonrestrictive, the scope of the present invention is by appended Claim rather than described above limit, it is intended that will fall the institute in the implication and scope of the equivalency of claim Change and include in the present invention.Any reference in claim should not be considered as to the involved claim of limitation.
Moreover, it will be appreciated that although the present specification is described in terms of embodiments, not each embodiment is only wrapped Containing an independent technical scheme, this narrating mode of specification is only that those skilled in the art should for clarity Using specification as an entirety, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art It is appreciated that other embodiment.

Claims (4)

1. a kind of random forest data processing method of attribute subspace weighting, it is characterised in that methods described includes:
S1, the set of data samples being trained to needs extract the decision tree number with needing to establish by way of sampling with replacement The consistent N number of sample set of mesh;
S2, the decision-tree model without beta pruning is built to each sample set, when building the node of decision-tree model, using information Gain method is first weighted to the attribute of all participation node structures, is therefrom selected M attribute of weight highest and is participated in node structure Build;
S3, N number of decision-tree model of structure is merged into a big Random Forest model;
Decision-tree model in the step S2 carries out structure using unit Multi-core mode or multi-host parallel distributed way Build;
Decision-tree model in the step S2 is built using unit Multi-core mode, is specifically included:
Automatically turn on thread with CPU core number as many, each thread is obtained after an achievement information and opened from process list Begin contribute according to the information, often built a tree, just the decision-tree model built up is put into random forest;
Each thread completes the process contribute simultaneously in parallel, distributes until all achievement information and completes, finally by random gloomy All decision trees of woods merging obtain Random Forest model to the end;
Or the decision-tree model in the step S2 is built using multi-host parallel distributed way, host node is responsible for totality The scheduling of modeling, it is responsible for specific achievement process from node, specifically includes:
Process in host node preserves the information of all achievements, and achievement information is divided into multiple process lists;
Start being contribute from node on other machines as needed, each obtain a process row from host node from node Table, decision tree is then independently built on the machine of oneself and generates sub- random forest;
Each it is put back into from node by the sub- random forest each built in host node, by host node by all sub- random forests Merging obtains final Random Forest model.
2. according to the method for claim 1, it is characterised in that the step S2 also includes:
Process list is responsible for the distribution of achievement information, and after the tree of required structure is distributed completion, notice forest is completed Situation.
3. according to the method for claim 1, it is characterised in that when the machine from where node is non-multinuclear machine, It is modeled using the serial mode of random forests algorithm.
4. according to the method for claim 1, it is characterised in that the step S2 also includes:
When handling big data, host node can not deposit all data messages, now, just be preserved in the process list of process Distribution situation of each deblocking on each machine, from node then during achievement according to distribution situation from other machines Required data are obtained on device.
CN201410734550.6A 2014-12-04 2014-12-04 A kind of random forest data processing method of attribute subspace weighting Active CN104391970B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410734550.6A CN104391970B (en) 2014-12-04 2014-12-04 A kind of random forest data processing method of attribute subspace weighting

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410734550.6A CN104391970B (en) 2014-12-04 2014-12-04 A kind of random forest data processing method of attribute subspace weighting

Publications (2)

Publication Number Publication Date
CN104391970A CN104391970A (en) 2015-03-04
CN104391970B true CN104391970B (en) 2017-11-24

Family

ID=52609874

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410734550.6A Active CN104391970B (en) 2014-12-04 2014-12-04 A kind of random forest data processing method of attribute subspace weighting

Country Status (1)

Country Link
CN (1) CN104391970B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156786B (en) * 2015-04-19 2019-12-27 北京典赞科技有限公司 Random forest training method based on multiple GPUs
CN104915679A (en) * 2015-05-26 2015-09-16 浪潮电子信息产业股份有限公司 Large-scale high-dimensional data classification method based on random forest weighted distance
CN105046382A (en) * 2015-09-16 2015-11-11 浪潮(北京)电子信息产业有限公司 Heterogeneous system parallel random forest optimization method and system
CN105574544A (en) * 2015-12-16 2016-05-11 平安科技(深圳)有限公司 Data processing method and device
CN109829471B (en) * 2018-12-19 2021-10-15 东软集团股份有限公司 Training method and device for random forest, storage medium and electronic equipment
CN109726826B (en) * 2018-12-19 2021-08-13 东软集团股份有限公司 Training method and device for random forest, storage medium and electronic equipment
CN110108992B (en) * 2019-05-24 2021-07-23 国网湖南省电力有限公司 Cable partial discharge fault identification method and system based on improved random forest algorithm
CN111599477A (en) * 2020-07-10 2020-08-28 吾征智能技术(北京)有限公司 Model construction method and system for predicting diabetes based on eating habits

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101923650A (en) * 2010-08-27 2010-12-22 北京大学 Random forest classification method and classifiers based on comparison mode

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5521881B2 (en) * 2010-08-12 2014-06-18 富士ゼロックス株式会社 Image identification information addition program and image identification information addition device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101923650A (en) * 2010-08-27 2010-12-22 北京大学 Random forest classification method and classifiers based on comparison mode

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"引入信息增益的层次聚类算法",;刘一鸣等;《计算机工程与应用》;20121231;第143页右栏第二段 *
"随机森林方法研究综述";方匡南等;《统计与信息论坛》;20110331;第33页左栏1-4行以及右栏1-20行 *

Also Published As

Publication number Publication date
CN104391970A (en) 2015-03-04

Similar Documents

Publication Publication Date Title
CN104391970B (en) A kind of random forest data processing method of attribute subspace weighting
CN106203395B (en) Face attribute recognition method based on multitask deep learning
CN108154430A (en) A kind of credit scoring construction method based on machine learning and big data technology
CN109635936A (en) A kind of neural networks pruning quantization method based on retraining
CN109886397A (en) A kind of neural network structure beta pruning compression optimization method for convolutional layer
CN103838836B (en) Based on discriminant multi-modal degree of depth confidence net multi-modal data fusion method and system
CN104317970B (en) A kind of data stream type processing method based on data mart modeling center
CN107273429A (en) A kind of Missing Data Filling method and system based on deep learning
CN104217015B (en) Based on the hierarchy clustering method for sharing arest neighbors each other
CN106355192A (en) Support vector machine method based on chaos and grey wolf optimization
CN108614997A (en) A kind of remote sensing images recognition methods based on improvement AlexNet
CN107506350A (en) A kind of method and apparatus of identification information
CN109657039B (en) Work history information extraction method based on double-layer BilSTM-CRF
CN109543899B (en) Two-dimensional contour layout sequencing method based on deep learning
CN106796533A (en) It is adaptive selected the system and method for execution pattern
CN107180053A (en) A kind of data warehouse optimization method and device
CN107330592A (en) A kind of screening technique, device and the computing device of target Enterprise Object
Gorjestani et al. A hybrid COA-DEA method for solving multi-objective problems
CN108364030B (en) A kind of multi-categorizer model building method based on three layers of dynamic particles group's algorithm
Sun et al. Evaluation method for innovation capability and efficiency of high technology enterprises with interval-valued intuitionistic fuzzy information
Wen et al. MapReduce-based BP neural network classification of aquaculture water quality
CN110750572A (en) Adaptive method and device for heuristic evaluation of scientific and technological achievements
CN103761298B (en) Distributed-architecture-based entity matching method
CN104572868B (en) The method and apparatus of information matches based on question answering system
CN107357851A (en) A kind of information processing method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant