CN104391970B - A kind of random forest data processing method of attribute subspace weighting - Google Patents
A kind of random forest data processing method of attribute subspace weighting Download PDFInfo
- Publication number
- CN104391970B CN104391970B CN201410734550.6A CN201410734550A CN104391970B CN 104391970 B CN104391970 B CN 104391970B CN 201410734550 A CN201410734550 A CN 201410734550A CN 104391970 B CN104391970 B CN 104391970B
- Authority
- CN
- China
- Prior art keywords
- node
- decision
- tree
- random forest
- built
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of random forest data processing method of attribute subspace weighting, methods described includes:S1, the N number of sample set consistent with the decision tree number that needs to establish is extracted by way of sampling with replacement to the set of data samples that is trained of needs;S2, the decision-tree model without beta pruning is built to each sample set, when building the node of decision-tree model, first the attribute of all participation node structures is weighted using information gain method, M attribute of weight highest is therefrom selected and participates in node structure;S3, N number of decision-tree model of structure is merged into a big Random Forest model.The present invention weights information gain for attribute subspace so that useful information can be extracted, so as to improve the precision of classification.
Description
Technical field
At technical field of data processing, more particularly to a kind of random forest data of attribute subspace weighting
Reason method.
Background technology
With constantly advancing and widely using in all trades and professions for computer, internet and information technology, people
Become more and more huger of the Various types of data that accumulates and become increasingly complex.For example, various types of biological datas, internet
The attribute dimensions of the data such as text data, digital image data can reach thousands of, and data volume is also constantly increasing
Add, cause traditional Classification Algorithms in Data Mining to be difficult to tackle superelevation dimension and the ever-increasing challenge of amount of calculation.
Random forests algorithm is a kind of integrated learning approach for being used to classify, and it uses decision tree as sub-classifier, and
Other sorting algorithms are compared, and are had the advantages that classification performance is good, precision is high, generalization ability is strong, are turned into current sort research field
Very popular algorithm, it is widely used in the every field of data mining.Its basic thought is to be carried by Ho in nineteen ninety-five earliest
Go out, and improved by Breiman in 2001 and eventually become present random forests algorithm.But when in face of high dimensional data, especially
It is sparse high dimensional data, and the method for its stochastic subspace used sampling can cause accounting, and natively few useful attribute is more
Difficulty is drawn into, and has a strong impact on final classification results.Meanwhile as data volume is increasing, existing unit random forest
Algorithm, which is realized, can not meet the needs of current big data so that outstanding algorithm can not be completed in a relatively short time and build
Mould, influence its use.
The main flow of existing random forests algorithm is as follows:
1) sample with being put back to from original training data N group samples, then circulation structure decision tree:
A) every group of sample builds a decision tree;
B) when building each node of decision tree, randomly select M attribute and carry out node calculating;
C) during achievement, without the cutting of branch, to the last it is left last sample;
2) constructed N number of decision-tree model is integrated into a Random Forest model.
When in face of high dimensional data, in order to improve the selection probability of valuable attribute, Amaratunga proposes sub- sky
Between the method that weights of sampling attribute to improve the probability that is extracted of important attribute when random forest is contribute, so as to improve decision tree
Mean intensity reaches improvement classification performance.But this method is directed to two class problems.
In existing R software systems, randomForest and party are that more commonly used two are random gloomy for building
The software kit of woods.RandomForest software kits are after changing into C language by the Fortran source codes of Breiman random forests algorithms
Directly obtain, and have the maintenance of its team.Party software kits are then by the random of the Hothorn inferred conditions tree structures proposed
Forest algorithm.But when for the big data of higher-dimension, the two bags are in the consumption of time and memory source all not to the utmost such as people
Meaning.In the related R software kits of existing random forest, neither one can be changed to the selection of attribute, and all still
Standalone version, it can not be operated in the computing environment of distributed parallel.
In summary, there is problems with existing random forests algorithm:
Attribute is chosen by the way of random sampling during due to structure decision tree nodes, when in face of superelevation dimension data
When, it is difficult situation about being selected that can cause to possess the attribute for producing result material impact so that arithmetic accuracy is by serious shadow
Ring;
Existing algorithm is all to use serial manner, and decision-tree model is established in circulation, and a decision tree is established in circulation every time
Model, in the case of CPU multinuclears, effectively can not be reached using the advantage parallel computation of multiple CPU cores it is quick establish with
The purpose of machine forest model;
When data volume is more and more so that when single machine can not store required data volume, existing random forest
Algorithm can not disposably load all data, so as to establish accurate model;
Therefore, for above-mentioned technical problem, it is necessary to provide a kind of random forest data processing of attribute subspace weighting
Method.
The content of the invention
In view of this, it is an object of the invention to provide a kind of random forest data processing side of attribute subspace weighting
Method, to solve the problems, such as to be effectively treated the big data of superelevation dimension.
In order to achieve the above object, technical scheme provided in an embodiment of the present invention is as follows:
A kind of random forest data processing method of attribute subspace weighting, methods described include:
S1, the set of data samples being trained to needs are extracted by way of sampling with replacement determines with need to establish
The consistent N number of sample set of plan tree number;
S2, the decision-tree model without beta pruning is built to each sample set, when building the node of decision-tree model, used
Information gain method is first weighted to the attribute of all participation node structures, is therefrom selected M attribute of weight highest and is participated in section
Point structure;
S3, N number of decision-tree model of structure is merged into a big Random Forest model.
As a further improvement on the present invention, the decision-tree model in the step S2 uses unit Multi-core mode
Or multi-host parallel distributed way is built.
As a further improvement on the present invention, the decision-tree model in the step S2 uses unit Multi-core mode
Built, specifically included:
The thread with CPU core number as many is automatically turned on, after each thread obtains an achievement information from process list
And proceed by and contribute according to the information, a tree has often been built, just the decision-tree model built up has been put into random forest;
Each thread completes the process contribute simultaneously in parallel, distributes until all achievement information and completes, finally by with
All decision trees of machine forest merging obtain Random Forest model to the end.
As a further improvement on the present invention, the step S2 also includes:
Process list is responsible for the distribution of achievement information, and after the tree of required structure is distributed completion, notice forest is complete
Into situation.
As a further improvement on the present invention, the decision-tree model in the step S2 uses multi-host parallel distributed way
Built, host node is responsible for the scheduling of total volume modeling, is responsible for specific achievement process from node, specifically includes:
Process in host node preserves the information of all achievements, and achievement information is divided into multiple process lists;
Start being contribute from node on other machines as needed, each obtaining one from host node from node enters
Cheng Liebiao, decision tree is then independently built on the machine of oneself and generates sub- random forest;
Each it is put back into from node by the sub- random forest each built in host node, it is by host node that all sons is random
Forest merges to obtain final Random Forest model.
As a further improvement on the present invention, when the machine from where node is non-multinuclear machine, using random gloomy
The serial mode of woods algorithm is modeled.
As a further improvement on the present invention, the step S2 also includes:
When handling big data, host node can not deposit all data messages, now, in the process list of process just
Preserve distribution situation of each deblocking on each machine, from node then during achievement according to distribution situation from its
He obtains required data on machine.
The invention has the advantages that:
Information gain is weighted for attribute subspace so that useful information can be extracted, so as to improve the essence of classification
Degree;
Multiple decision trees are built parallel using the mode of unit multithreading, and the model construction time is shortened with this;
Sub- Random Forest model is built using the mode of one-to-many main and subordinate node is distributed on multiple machines, so as to solve
The problem of certainly data can not store on one node, while improve modeling efficiency.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this
Some embodiments described in invention, for those of ordinary skill in the art, on the premise of not paying creative work,
Other accompanying drawings can also be obtained according to these accompanying drawings.
Fig. 1 is a kind of schematic flow sheet of the random forest data processing method of attribute subspace weighting of the present invention.
Fig. 2 is that unit Multi-core mode is contribute schematic flow sheet in the embodiment of the invention.
Fig. 3 is multi-host parallel distributed way achievement schematic flow sheet in the embodiment of the invention.
Embodiment
In order that those skilled in the art more fully understand the technical scheme in the present invention, below in conjunction with of the invention real
The accompanying drawing in example is applied, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described implementation
Example only part of the embodiment of the present invention, rather than whole embodiments.It is common based on the embodiment in the present invention, this area
The every other embodiment that technical staff is obtained under the premise of creative work is not made, should all belong to protection of the present invention
Scope.
The invention discloses a kind of random forest data processing method of attribute subspace weighting, to solve to superelevation dimension
The problem of big data is effectively treated.Its major part includes:
1) when establishing decision tree nodes, the selection to useful attribute is improved using the method for attribute subspace weighting
Rate, to strengthen precision property of algorithm when in face of superelevation dimension data;
2) when when on multi-core CPU machine, algorithm builds decision tree by the way of parallel multithread so that the same time
The time efficiency of algorithm can be improved to establishing multiple decision-tree models;
3) when there is more machines to be available for calculating, algorithm is determined to required foundation automatically by the way of distributed parallel
Plan tree-model is allocated on more machines, so as to improve the scalability of algorithm.
Join shown in Fig. 1, a kind of random forest data processing method of attribute subspace of the invention weighting, its feature exists
In described
S1, the set of data samples being trained to needs extract the decision-making with needing to establish by way of sampling with replacement
Set the consistent N number of sample set of number;
S2, the decision-tree model without beta pruning is built to each sample set, when building the node of decision-tree model, used
Information gain method is first weighted to the attribute of all participation node structures, is therefrom selected M attribute of weight highest and is participated in section
Point structure;
S3, N number of decision-tree model of structure is merged into a big Random Forest model.
Specific implementation step is as follows:First, as existing random forests algorithm, to the data sample for needing to be trained
Collection is extracted by way of sampling with replacement with needing the consistent N number of sample set of the decision tree number established;Then to each
Sample set builds the decision-tree model without beta pruning:Wherein, when building the node of decision tree, the present invention uses information gain method
First the attribute of all participation node structures is weighted, M attribute of weight highest is therefrom selected and participates in node structure;Finally
Constructed N number of decision-tree model is merged into a big Random Forest model.
During decision tree is built, for two kinds of different environment of unit multinuclear and multimachine, it is respectively adopted parallel more
The mode of thread and parallel distributed builds each decision-tree model:
1) unit Multi-core mode builds decision-tree model
As shown in Fig. 2 comprising the information contribute in process list (Task list), including need the decision tree established
Sample set corresponding to number and each decision tree.In multinuclear stand-alone environment, parallel structure is determined by way of multithreading
Plan tree-model, under default situations, algorithm automatically turns on the thread (Thread) with CPU core number as many, each Thread from
Obtained in Task list after (fetch) achievement information and proceed by and contribute according to the information, often built one
The decision-tree model built up, is just put into random forest (Forest) by tree.
Task list are also responsible for the distribution of achievement information in present embodiment, when the tree of required structure is distributed completion
Afterwards, the situation for notifying Forest to complete.Each Thread completes the process contribute simultaneously in parallel like this, until all
The distribution of achievement information is completed, and finally merging all decision trees by Forest obtains Random Forest model to the end.Wherein,
Thread number is adjustable.
2) multi-host parallel distributed way establishes decision-tree model
As shown in figure 3, when establishing decision-tree model using more machine distributions, will be by two big modules to modeling process
It is controlled:Host node (Master node) and from node (Slave node).Wherein, Master node are responsible for total volume modeling
Scheduling, Slave node are responsible for specific achievement process.
Comprise the following steps that:First, the process (Tasks) in Master node preserves the information of all achievements, and will build
Tree information is divided into multiple process lists (Task list1, Task list2 ...), and it is acted on unit Multi-core mode
The Task list built in decision-tree model are consistent;Then the Slave node started as needed on other machines are built
Tree, each Slave node obtain a Task list (such as Task list1) from Master node, then oneself
Decision tree is independently built on machine and generates random forest Forest, its building process is built certainly with unit Multi-core mode
Plan tree, if the non-multinuclear machine of this machine, then under its building process default situations, by using existing random forests algorithm
Serial mode models;Finally, the random forest Forest each built is put back into Master node by each Slave node
In, all random forests are merged by Master node to obtain final Random Forest model Forests.
Wherein, Slave node number, and the numbers of decision tree completed of each Slave node be by
Master node are controllable.In addition, when handling big data, a node can not deposit all data messages, this
When, distribution situation of each deblocking on each machine, Slave node are just saved in Tasks Task list
Then obtain required data from other machines according to this information during achievement.
In summary, compared with prior art, the invention has the advantages that:
Information gain is weighted for attribute subspace so that useful information can be extracted, so as to improve the essence of classification
Degree;
Multiple decision trees are built parallel using the mode of unit multithreading, and the model construction time is shortened with this;
Sub- Random Forest model is built using the mode of one-to-many main and subordinate node is distributed on multiple machines, so as to solve
The problem of certainly data can not store on one node, while improve modeling efficiency.
It is obvious to a person skilled in the art that the invention is not restricted to the details of above-mentioned one exemplary embodiment, Er Qie
In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter
From the point of view of which point, embodiment all should be regarded as exemplary, and be nonrestrictive, the scope of the present invention is by appended
Claim rather than described above limit, it is intended that will fall the institute in the implication and scope of the equivalency of claim
Change and include in the present invention.Any reference in claim should not be considered as to the involved claim of limitation.
Moreover, it will be appreciated that although the present specification is described in terms of embodiments, not each embodiment is only wrapped
Containing an independent technical scheme, this narrating mode of specification is only that those skilled in the art should for clarity
Using specification as an entirety, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art
It is appreciated that other embodiment.
Claims (4)
1. a kind of random forest data processing method of attribute subspace weighting, it is characterised in that methods described includes:
S1, the set of data samples being trained to needs extract the decision tree number with needing to establish by way of sampling with replacement
The consistent N number of sample set of mesh;
S2, the decision-tree model without beta pruning is built to each sample set, when building the node of decision-tree model, using information
Gain method is first weighted to the attribute of all participation node structures, is therefrom selected M attribute of weight highest and is participated in node structure
Build;
S3, N number of decision-tree model of structure is merged into a big Random Forest model;
Decision-tree model in the step S2 carries out structure using unit Multi-core mode or multi-host parallel distributed way
Build;
Decision-tree model in the step S2 is built using unit Multi-core mode, is specifically included:
Automatically turn on thread with CPU core number as many, each thread is obtained after an achievement information and opened from process list
Begin contribute according to the information, often built a tree, just the decision-tree model built up is put into random forest;
Each thread completes the process contribute simultaneously in parallel, distributes until all achievement information and completes, finally by random gloomy
All decision trees of woods merging obtain Random Forest model to the end;
Or the decision-tree model in the step S2 is built using multi-host parallel distributed way, host node is responsible for totality
The scheduling of modeling, it is responsible for specific achievement process from node, specifically includes:
Process in host node preserves the information of all achievements, and achievement information is divided into multiple process lists;
Start being contribute from node on other machines as needed, each obtain a process row from host node from node
Table, decision tree is then independently built on the machine of oneself and generates sub- random forest;
Each it is put back into from node by the sub- random forest each built in host node, by host node by all sub- random forests
Merging obtains final Random Forest model.
2. according to the method for claim 1, it is characterised in that the step S2 also includes:
Process list is responsible for the distribution of achievement information, and after the tree of required structure is distributed completion, notice forest is completed
Situation.
3. according to the method for claim 1, it is characterised in that when the machine from where node is non-multinuclear machine,
It is modeled using the serial mode of random forests algorithm.
4. according to the method for claim 1, it is characterised in that the step S2 also includes:
When handling big data, host node can not deposit all data messages, now, just be preserved in the process list of process
Distribution situation of each deblocking on each machine, from node then during achievement according to distribution situation from other machines
Required data are obtained on device.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410734550.6A CN104391970B (en) | 2014-12-04 | 2014-12-04 | A kind of random forest data processing method of attribute subspace weighting |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410734550.6A CN104391970B (en) | 2014-12-04 | 2014-12-04 | A kind of random forest data processing method of attribute subspace weighting |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104391970A CN104391970A (en) | 2015-03-04 |
CN104391970B true CN104391970B (en) | 2017-11-24 |
Family
ID=52609874
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410734550.6A Active CN104391970B (en) | 2014-12-04 | 2014-12-04 | A kind of random forest data processing method of attribute subspace weighting |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104391970B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106156786B (en) * | 2015-04-19 | 2019-12-27 | 北京典赞科技有限公司 | Random forest training method based on multiple GPUs |
CN104915679A (en) * | 2015-05-26 | 2015-09-16 | 浪潮电子信息产业股份有限公司 | Large-scale high-dimensional data classification method based on random forest weighted distance |
CN105046382A (en) * | 2015-09-16 | 2015-11-11 | 浪潮(北京)电子信息产业有限公司 | Heterogeneous system parallel random forest optimization method and system |
CN105574544A (en) * | 2015-12-16 | 2016-05-11 | 平安科技(深圳)有限公司 | Data processing method and device |
CN109829471B (en) * | 2018-12-19 | 2021-10-15 | 东软集团股份有限公司 | Training method and device for random forest, storage medium and electronic equipment |
CN109726826B (en) * | 2018-12-19 | 2021-08-13 | 东软集团股份有限公司 | Training method and device for random forest, storage medium and electronic equipment |
CN110108992B (en) * | 2019-05-24 | 2021-07-23 | 国网湖南省电力有限公司 | Cable partial discharge fault identification method and system based on improved random forest algorithm |
CN111599477A (en) * | 2020-07-10 | 2020-08-28 | 吾征智能技术(北京)有限公司 | Model construction method and system for predicting diabetes based on eating habits |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101923650A (en) * | 2010-08-27 | 2010-12-22 | 北京大学 | Random forest classification method and classifiers based on comparison mode |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP5521881B2 (en) * | 2010-08-12 | 2014-06-18 | 富士ゼロックス株式会社 | Image identification information addition program and image identification information addition device |
-
2014
- 2014-12-04 CN CN201410734550.6A patent/CN104391970B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101923650A (en) * | 2010-08-27 | 2010-12-22 | 北京大学 | Random forest classification method and classifiers based on comparison mode |
Non-Patent Citations (2)
Title |
---|
"引入信息增益的层次聚类算法",;刘一鸣等;《计算机工程与应用》;20121231;第143页右栏第二段 * |
"随机森林方法研究综述";方匡南等;《统计与信息论坛》;20110331;第33页左栏1-4行以及右栏1-20行 * |
Also Published As
Publication number | Publication date |
---|---|
CN104391970A (en) | 2015-03-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104391970B (en) | A kind of random forest data processing method of attribute subspace weighting | |
CN106203395B (en) | Face attribute recognition method based on multitask deep learning | |
CN108154430A (en) | A kind of credit scoring construction method based on machine learning and big data technology | |
CN109635936A (en) | A kind of neural networks pruning quantization method based on retraining | |
CN109886397A (en) | A kind of neural network structure beta pruning compression optimization method for convolutional layer | |
CN103838836B (en) | Based on discriminant multi-modal degree of depth confidence net multi-modal data fusion method and system | |
CN104317970B (en) | A kind of data stream type processing method based on data mart modeling center | |
CN107273429A (en) | A kind of Missing Data Filling method and system based on deep learning | |
CN104217015B (en) | Based on the hierarchy clustering method for sharing arest neighbors each other | |
CN106355192A (en) | Support vector machine method based on chaos and grey wolf optimization | |
CN108614997A (en) | A kind of remote sensing images recognition methods based on improvement AlexNet | |
CN107506350A (en) | A kind of method and apparatus of identification information | |
CN109657039B (en) | Work history information extraction method based on double-layer BilSTM-CRF | |
CN109543899B (en) | Two-dimensional contour layout sequencing method based on deep learning | |
CN106796533A (en) | It is adaptive selected the system and method for execution pattern | |
CN107180053A (en) | A kind of data warehouse optimization method and device | |
CN107330592A (en) | A kind of screening technique, device and the computing device of target Enterprise Object | |
Gorjestani et al. | A hybrid COA-DEA method for solving multi-objective problems | |
CN108364030B (en) | A kind of multi-categorizer model building method based on three layers of dynamic particles group's algorithm | |
Sun et al. | Evaluation method for innovation capability and efficiency of high technology enterprises with interval-valued intuitionistic fuzzy information | |
Wen et al. | MapReduce-based BP neural network classification of aquaculture water quality | |
CN110750572A (en) | Adaptive method and device for heuristic evaluation of scientific and technological achievements | |
CN103761298B (en) | Distributed-architecture-based entity matching method | |
CN104572868B (en) | The method and apparatus of information matches based on question answering system | |
CN107357851A (en) | A kind of information processing method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |