CN104391970B

CN104391970B - A kind of random forest data processing method of attribute subspace weighting

Info

Publication number: CN104391970B
Application number: CN201410734550.6A
Authority: CN
Inventors: 赵鹤; 黄哲学; 姜青山; 吴胤旭; 陈会
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2014-12-04
Filing date: 2014-12-04
Publication date: 2017-11-24
Anticipated expiration: 2034-12-04
Also published as: CN104391970A

Abstract

The invention discloses a kind of random forest data processing method of attribute subspace weighting, methods described includes：S1, the N number of sample set consistent with the decision tree number that needs to establish is extracted by way of sampling with replacement to the set of data samples that is trained of needs；S2, the decision-tree model without beta pruning is built to each sample set, when building the node of decision-tree model, first the attribute of all participation node structures is weighted using information gain method, M attribute of weight highest is therefrom selected and participates in node structure；S3, N number of decision-tree model of structure is merged into a big Random Forest model.The present invention weights information gain for attribute subspace so that useful information can be extracted, so as to improve the precision of classification.

Description

A kind of random forest data processing method of attribute subspace weighting

Technical field

At technical field of data processing, more particularly to a kind of random forest data of attribute subspace weighting Reason method.

Background technology

With constantly advancing and widely using in all trades and professions for computer, internet and information technology, people Become more and more huger of the Various types of data that accumulates and become increasingly complex.For example, various types of biological datas, internet The attribute dimensions of the data such as text data, digital image data can reach thousands of, and data volume is also constantly increasing Add, cause traditional Classification Algorithms in Data Mining to be difficult to tackle superelevation dimension and the ever-increasing challenge of amount of calculation.

Random forests algorithm is a kind of integrated learning approach for being used to classify, and it uses decision tree as sub-classifier, and Other sorting algorithms are compared, and are had the advantages that classification performance is good, precision is high, generalization ability is strong, are turned into current sort research field Very popular algorithm, it is widely used in the every field of data mining.Its basic thought is to be carried by Ho in nineteen ninety-five earliest Go out, and improved by Breiman in 2001 and eventually become present random forests algorithm.But when in face of high dimensional data, especially It is sparse high dimensional data, and the method for its stochastic subspace used sampling can cause accounting, and natively few useful attribute is more Difficulty is drawn into, and has a strong impact on final classification results.Meanwhile as data volume is increasing, existing unit random forest Algorithm, which is realized, can not meet the needs of current big data so that outstanding algorithm can not be completed in a relatively short time and build Mould, influence its use.

The main flow of existing random forests algorithm is as follows：

1) sample with being put back to from original training data N group samples, then circulation structure decision tree：

A) every group of sample builds a decision tree；

B) when building each node of decision tree, randomly select M attribute and carry out node calculating；

C) during achievement, without the cutting of branch, to the last it is left last sample；

2) constructed N number of decision-tree model is integrated into a Random Forest model.

When in face of high dimensional data, in order to improve the selection probability of valuable attribute, Amaratunga proposes sub- sky Between the method that weights of sampling attribute to improve the probability that is extracted of important attribute when random forest is contribute, so as to improve decision tree Mean intensity reaches improvement classification performance.But this method is directed to two class problems.

In existing R software systems, randomForest and party are that more commonly used two are random gloomy for building The software kit of woods.RandomForest software kits are after changing into C language by the Fortran source codes of Breiman random forests algorithms Directly obtain, and have the maintenance of its team.Party software kits are then by the random of the Hothorn inferred conditions tree structures proposed Forest algorithm.But when for the big data of higher-dimension, the two bags are in the consumption of time and memory source all not to the utmost such as people Meaning.In the related R software kits of existing random forest, neither one can be changed to the selection of attribute, and all still Standalone version, it can not be operated in the computing environment of distributed parallel.

In summary, there is problems with existing random forests algorithm：

Attribute is chosen by the way of random sampling during due to structure decision tree nodes, when in face of superelevation dimension data When, it is difficult situation about being selected that can cause to possess the attribute for producing result material impact so that arithmetic accuracy is by serious shadow Ring；

Existing algorithm is all to use serial manner, and decision-tree model is established in circulation, and a decision tree is established in circulation every time Model, in the case of CPU multinuclears, effectively can not be reached using the advantage parallel computation of multiple CPU cores it is quick establish with The purpose of machine forest model；

When data volume is more and more so that when single machine can not store required data volume, existing random forest Algorithm can not disposably load all data, so as to establish accurate model；

Therefore, for above-mentioned technical problem, it is necessary to provide a kind of random forest data processing of attribute subspace weighting Method.

The content of the invention

In view of this, it is an object of the invention to provide a kind of random forest data processing side of attribute subspace weighting Method, to solve the problems, such as to be effectively treated the big data of superelevation dimension.

In order to achieve the above object, technical scheme provided in an embodiment of the present invention is as follows：

A kind of random forest data processing method of attribute subspace weighting, methods described include：

S1, the set of data samples being trained to needs are extracted by way of sampling with replacement determines with need to establish The consistent N number of sample set of plan tree number；

S2, the decision-tree model without beta pruning is built to each sample set, when building the node of decision-tree model, used Information gain method is first weighted to the attribute of all participation node structures, is therefrom selected M attribute of weight highest and is participated in section Point structure；

S3, N number of decision-tree model of structure is merged into a big Random Forest model.

As a further improvement on the present invention, the decision-tree model in the step S2 uses unit Multi-core mode Or multi-host parallel distributed way is built.

As a further improvement on the present invention, the decision-tree model in the step S2 uses unit Multi-core mode Built, specifically included：

The thread with CPU core number as many is automatically turned on, after each thread obtains an achievement information from process list And proceed by and contribute according to the information, a tree has often been built, just the decision-tree model built up has been put into random forest；

Each thread completes the process contribute simultaneously in parallel, distributes until all achievement information and completes, finally by with All decision trees of machine forest merging obtain Random Forest model to the end.

As a further improvement on the present invention, the step S2 also includes：

Process list is responsible for the distribution of achievement information, and after the tree of required structure is distributed completion, notice forest is complete Into situation.

As a further improvement on the present invention, the decision-tree model in the step S2 uses multi-host parallel distributed way Built, host node is responsible for the scheduling of total volume modeling, is responsible for specific achievement process from node, specifically includes：

Process in host node preserves the information of all achievements, and achievement information is divided into multiple process lists；

Start being contribute from node on other machines as needed, each obtaining one from host node from node enters Cheng Liebiao, decision tree is then independently built on the machine of oneself and generates sub- random forest；

Each it is put back into from node by the sub- random forest each built in host node, it is by host node that all sons is random Forest merges to obtain final Random Forest model.

As a further improvement on the present invention, when the machine from where node is non-multinuclear machine, using random gloomy The serial mode of woods algorithm is modeled.

As a further improvement on the present invention, the step S2 also includes：

When handling big data, host node can not deposit all data messages, now, in the process list of process just Preserve distribution situation of each deblocking on each machine, from node then during achievement according to distribution situation from its He obtains required data on machine.

The invention has the advantages that：

Information gain is weighted for attribute subspace so that useful information can be extracted, so as to improve the essence of classification Degree；

Multiple decision trees are built parallel using the mode of unit multithreading, and the model construction time is shortened with this；

Sub- Random Forest model is built using the mode of one-to-many main and subordinate node is distributed on multiple machines, so as to solve The problem of certainly data can not store on one node, while improve modeling efficiency.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments described in invention, for those of ordinary skill in the art, on the premise of not paying creative work, Other accompanying drawings can also be obtained according to these accompanying drawings.

Fig. 1 is a kind of schematic flow sheet of the random forest data processing method of attribute subspace weighting of the present invention.

Fig. 2 is that unit Multi-core mode is contribute schematic flow sheet in the embodiment of the invention.

Fig. 3 is multi-host parallel distributed way achievement schematic flow sheet in the embodiment of the invention.

Embodiment

In order that those skilled in the art more fully understand the technical scheme in the present invention, below in conjunction with of the invention real The accompanying drawing in example is applied, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described implementation Example only part of the embodiment of the present invention, rather than whole embodiments.It is common based on the embodiment in the present invention, this area The every other embodiment that technical staff is obtained under the premise of creative work is not made, should all belong to protection of the present invention Scope.

The invention discloses a kind of random forest data processing method of attribute subspace weighting, to solve to superelevation dimension The problem of big data is effectively treated.Its major part includes：

1) when establishing decision tree nodes, the selection to useful attribute is improved using the method for attribute subspace weighting Rate, to strengthen precision property of algorithm when in face of superelevation dimension data；

2) when when on multi-core CPU machine, algorithm builds decision tree by the way of parallel multithread so that the same time The time efficiency of algorithm can be improved to establishing multiple decision-tree models；

3) when there is more machines to be available for calculating, algorithm is determined to required foundation automatically by the way of distributed parallel Plan tree-model is allocated on more machines, so as to improve the scalability of algorithm.

Join shown in Fig. 1, a kind of random forest data processing method of attribute subspace of the invention weighting, its feature exists In described

S1, the set of data samples being trained to needs extract the decision-making with needing to establish by way of sampling with replacement Set the consistent N number of sample set of number；

Specific implementation step is as follows：First, as existing random forests algorithm, to the data sample for needing to be trained Collection is extracted by way of sampling with replacement with needing the consistent N number of sample set of the decision tree number established；Then to each Sample set builds the decision-tree model without beta pruning：Wherein, when building the node of decision tree, the present invention uses information gain method First the attribute of all participation node structures is weighted, M attribute of weight highest is therefrom selected and participates in node structure；Finally Constructed N number of decision-tree model is merged into a big Random Forest model.

During decision tree is built, for two kinds of different environment of unit multinuclear and multimachine, it is respectively adopted parallel more The mode of thread and parallel distributed builds each decision-tree model：

1) unit Multi-core mode builds decision-tree model

As shown in Fig. 2 comprising the information contribute in process list (Task list), including need the decision tree established Sample set corresponding to number and each decision tree.In multinuclear stand-alone environment, parallel structure is determined by way of multithreading Plan tree-model, under default situations, algorithm automatically turns on the thread (Thread) with CPU core number as many, each Thread from Obtained in Task list after (fetch) achievement information and proceed by and contribute according to the information, often built one The decision-tree model built up, is just put into random forest (Forest) by tree.

Task list are also responsible for the distribution of achievement information in present embodiment, when the tree of required structure is distributed completion Afterwards, the situation for notifying Forest to complete.Each Thread completes the process contribute simultaneously in parallel like this, until all The distribution of achievement information is completed, and finally merging all decision trees by Forest obtains Random Forest model to the end.Wherein, Thread number is adjustable.

2) multi-host parallel distributed way establishes decision-tree model

As shown in figure 3, when establishing decision-tree model using more machine distributions, will be by two big modules to modeling process It is controlled：Host node (Master node) and from node (Slave node).Wherein, Master node are responsible for total volume modeling Scheduling, Slave node are responsible for specific achievement process.

Comprise the following steps that：First, the process (Tasks) in Master node preserves the information of all achievements, and will build Tree information is divided into multiple process lists (Task list1, Task list2 ...), and it is acted on unit Multi-core mode The Task list built in decision-tree model are consistent；Then the Slave node started as needed on other machines are built Tree, each Slave node obtain a Task list (such as Task list1) from Master node, then oneself Decision tree is independently built on machine and generates random forest Forest, its building process is built certainly with unit Multi-core mode Plan tree, if the non-multinuclear machine of this machine, then under its building process default situations, by using existing random forests algorithm Serial mode models；Finally, the random forest Forest each built is put back into Master node by each Slave node In, all random forests are merged by Master node to obtain final Random Forest model Forests.

Wherein, Slave node number, and the numbers of decision tree completed of each Slave node be by Master node are controllable.In addition, when handling big data, a node can not deposit all data messages, this When, distribution situation of each deblocking on each machine, Slave node are just saved in Tasks Task list Then obtain required data from other machines according to this information during achievement.

In summary, compared with prior art, the invention has the advantages that：

It is obvious to a person skilled in the art that the invention is not restricted to the details of above-mentioned one exemplary embodiment, Er Qie In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, embodiment all should be regarded as exemplary, and be nonrestrictive, the scope of the present invention is by appended Claim rather than described above limit, it is intended that will fall the institute in the implication and scope of the equivalency of claim Change and include in the present invention.Any reference in claim should not be considered as to the involved claim of limitation.

Moreover, it will be appreciated that although the present specification is described in terms of embodiments, not each embodiment is only wrapped Containing an independent technical scheme, this narrating mode of specification is only that those skilled in the art should for clarity Using specification as an entirety, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art It is appreciated that other embodiment.

Claims

1. a kind of random forest data processing method of attribute subspace weighting, it is characterised in that methods described includes：

S1, the set of data samples being trained to needs extract the decision tree number with needing to establish by way of sampling with replacement The consistent N number of sample set of mesh；

S2, the decision-tree model without beta pruning is built to each sample set, when building the node of decision-tree model, using information Gain method is first weighted to the attribute of all participation node structures, is therefrom selected M attribute of weight highest and is participated in node structure Build；

S3, N number of decision-tree model of structure is merged into a big Random Forest model；

Decision-tree model in the step S2 carries out structure using unit Multi-core mode or multi-host parallel distributed way Build；

Decision-tree model in the step S2 is built using unit Multi-core mode, is specifically included：

Automatically turn on thread with CPU core number as many, each thread is obtained after an achievement information and opened from process list Begin contribute according to the information, often built a tree, just the decision-tree model built up is put into random forest；

Each thread completes the process contribute simultaneously in parallel, distributes until all achievement information and completes, finally by random gloomy All decision trees of woods merging obtain Random Forest model to the end；

Or the decision-tree model in the step S2 is built using multi-host parallel distributed way, host node is responsible for totality The scheduling of modeling, it is responsible for specific achievement process from node, specifically includes：

Start being contribute from node on other machines as needed, each obtain a process row from host node from node Table, decision tree is then independently built on the machine of oneself and generates sub- random forest；

Each it is put back into from node by the sub- random forest each built in host node, by host node by all sub- random forests Merging obtains final Random Forest model.

2. according to the method for claim 1, it is characterised in that the step S2 also includes：

Process list is responsible for the distribution of achievement information, and after the tree of required structure is distributed completion, notice forest is completed Situation.

3. according to the method for claim 1, it is characterised in that when the machine from where node is non-multinuclear machine, It is modeled using the serial mode of random forests algorithm.

4. according to the method for claim 1, it is characterised in that the step S2 also includes：

When handling big data, host node can not deposit all data messages, now, just be preserved in the process list of process Distribution situation of each deblocking on each machine, from node then during achievement according to distribution situation from other machines Required data are obtained on device.