CN111798003A - Multi-view learning algorithm based on random forest - Google Patents
Multi-view learning algorithm based on random forest Download PDFInfo
- Publication number
- CN111798003A CN111798003A CN202010629341.0A CN202010629341A CN111798003A CN 111798003 A CN111798003 A CN 111798003A CN 202010629341 A CN202010629341 A CN 202010629341A CN 111798003 A CN111798003 A CN 111798003A
- Authority
- CN
- China
- Prior art keywords
- view
- decision tree
- sample
- random forest
- views
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000007637 random forest analysis Methods 0.000 title claims abstract description 41
- 238000004422 calculation algorithm Methods 0.000 title claims abstract description 10
- 238000003066 decision tree Methods 0.000 claims abstract description 25
- 238000000034 method Methods 0.000 claims abstract description 17
- 238000010276 construction Methods 0.000 claims abstract description 8
- 230000004927 fusion Effects 0.000 claims abstract description 5
- 238000005192 partition Methods 0.000 claims description 15
- 230000006870 function Effects 0.000 claims description 13
- 239000011159 matrix material Substances 0.000 claims description 8
- 239000012535 impurity Substances 0.000 claims description 6
- 238000012952 Resampling Methods 0.000 claims description 4
- 238000000691 measurement method Methods 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 claims description 2
- 230000011218 segmentation Effects 0.000 claims 1
- 238000012163 sequencing technique Methods 0.000 claims 1
- 230000000295 complement effect Effects 0.000 abstract description 5
- 230000003993 interaction Effects 0.000 abstract description 3
- 238000010801 machine learning Methods 0.000 abstract description 2
- 239000002699 waste material Substances 0.000 abstract description 2
- 238000012443 analytical study Methods 0.000 abstract 1
- 230000015572 biosynthetic process Effects 0.000 abstract 1
- 230000007812 deficiency Effects 0.000 abstract 1
- 238000005457 optimization Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000007418 data mining Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000013138 pruning Methods 0.000 description 1
- 230000002285 radioactive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
Random forests are one of the most classical machine learning algorithms and have gained widespread use. However, it was observed that despite the fact that there is a lot of multi-view data and extensive analytical studies have been obtained, surprisingly there is considerably less random forest formation for multi-view scenes. The only method for solving the multi-view learning problem by using the random forest is to generate respective random forests for all views and then fuse multi-view information during decision making. One significant drawback of such an approach is that the inter-multi-view correlation is not exploited during the construction phase of its random forest, which undoubtedly wastes information resources. In order to make up for the deficiency, the invention provides an improved multi-view learning algorithm based on random forests. Specifically, view fusion is carried out in the generation process of the decision tree, information interaction between views is fused into the construction stage of the decision tree, and utilization of complementary information between the views in the whole random forest generation process is achieved. In addition, the invention also generates decision boundaries with discriminant properties for the decision tree through discriminant analysis, so that the decision tree is more suitable for classification.
Description
Technical Field
The invention belongs to the field of machine learning, and particularly relates to a multi-view learning method based on a random forest, which is used for solving the classification problem in a multi-view scene.
Background
The random forest first proposed in 2001 by Breiman has become one of the most widely used ensemble learning algorithms. And the random forest constructs a plurality of decision trees by utilizing a random resampling and node random splitting strategy, and then obtains a final classification result by voting. Due to the advantages of high precision, good interpretability, low overfitting risk, good noise tolerance and the like, impressive success is achieved in a plurality of fields including computer vision, data mining and the like, meanwhile, extensive research on random forests by a plurality of successors is stimulated, and mutated random forests such as dynamic random forests, deep forests and the like are developed.
However, the existing random forest and its variation almost focus on single-view learning scenes, and the random forest for multi-view learning is rare. As is known, many classification problems in reality are multi-view in nature, because data features can often be characterized from multiple aspects and complement each other, and single-view data generally cannot describe the whole appearance of data information. For example, a picture can be represented by its texture, shape and color features together, i.e. forming a set of multi-view data. Leveraging complementary information from different views can lead to increased generalization performance and have driven the widespread deployment of multi-view learning. However, surprisingly there is little random forest based multi-view learning. To our knowledge, there are only 2 multiview works with random forests, one being the multiview random forest proposed in 2015 for pedestrian detection and the other being the difference-based multiview random forest proposed in 2019 and used to study the radioactive groups. Both proposed methods generate respective random forests for each view, and then fuse multi-view information in a decision (or back-end) stage. Therefore, the methods do not utilize the correlation among multiple views in the whole process in the construction stage of the random forest, which is undoubtedly a waste of information resources. In order to overcome the defect, the invention provides an improved multi-view learning algorithm based on a random forest, and view interaction information is considered to be integrated into the whole construction stage of a decision tree, so that the full utilization of inter-view complementary information in the whole random forest generation is realized.
Disclosure of Invention
As analyzed above, the conventional random forest establishment for a multi-view scene is to generate a corresponding random forest for each view, and then decide a final prediction result by voting on each random forest. The disadvantage of these methods is that the complementary information between the views is only utilized in the decision phase of the back-end. The invention aims to perform information interaction between views in the whole construction stage of the random forest, so that the correlation between the views is fully utilized in the whole random forest stage.
In order to utilize multi-view information all the way through the decision tree construction stage, the following two problems need to be solved:
(1) how are the multi-view data fused during the construction phase of the decision tree?
(2) How can the fused data be used for classification?
Regarding the first problem, it can be considered to project each view separately and then perform inner product. Taking two views as an example, sample pair (x)i,yi) Respectively projected to a vector wx,wyAnd fusing view data in an inner product mode:
regarding the second problem, discriminant analysis may be performed. For sample setsFinding an optimal matrix W so as to fuseThe distance between classes is the largest and the distance within the class is the smallest. And for the fused samples, calculating a hyperplane of the current optimal partition data space by using an impurity degree measurement method, and generating a sub-tree in each partition created by the hyperplane. And performing recursion according to the steps to finally obtain a two-view decision tree.
Taking the two-view data as an example, the specific implementation process of the invention is as follows:
(1) fusing two-view data
For a set of aligned two-view sample setsCalculating an optimal matrix W by using discriminant analysis so as to enable the fused dataThe distance between classes is the largest and the distance within the class is the smallest.
(2) Training a two-view decision tree
And for the fused samples, calculating a hyperplane of the current optimal partitioned data space by using an impurity degree measurement method, respectively generating subtrees in two partitions created by the hyperplane, and performing recursion in the manner to finally obtain a two-view decision tree.
(3) Training two-view random forest
And (3) randomly extracting K self-help sample sets by using a bootstrap resampling method, and respectively constructing two-view decision trees on the K sample sets, wherein each tree grows freely to the maximum extent, namely pruning is not performed. And obtaining the final prediction result of the forest by a majority voting method.
Drawings
FIG. 1 is a flow chart of a random forest based two-view learning algorithm;
FIG. 2 is a flow chart of a two-view based decision tree algorithm;
FIG. 3 is a multi-view random forest accuracy comparison graph.
Detailed Description
The technical content of the present invention will be further explained with reference to the drawings, and the experimental data in this embodiment are all from the real data set in the UCI standard database. To generate the two-view data, we cut the D-dimensional samples according to the correlation method of operation, before selectionThe features of one dimension act as a first view and the other dimension act as a second view.
FIG. 1 shows a flow chart of a random forest-based two-view learning algorithm provided by the invention, which specifically comprises the following steps:
step 1: fusing two-view data
Let { (x)i,yi)∈Rp×RqIs a set of two-view sample sets, let the data matrix
X=[x1,...,xn]∈Rp×n,Y=[y1,...,yn]∈Rq×n
Representing the data sets of the two views, respectively.
And (3) calculating an optimal matrix W by using discriminant analysis, so that the distance between the fused data classes is maximum and the distance in the classes is minimum. The objective function is:
s.t.(m(1)-m(2))2=1
wherein,
represents the mean value after fusion of the i-th sample, niIndicating the size of the class i sample count. The following objective function is derived:
wherein,
the above function can be converted into:
wherein,
defining the Languge function of the optimization problem as follows by using the Languge multiplier method:
the deviation of w from the above equation can be obtained
The optimization problem can be characterized as solving the following generalized eigenvalue problem:
and matrixing the eigenvector W corresponding to the maximum eigenvalue to obtain the optimal matrix W.
Step 2: generating a two-view decision tree
For the obtained optimal matrix W, a sample pair (x) is calculatedi,yi) Is determined, and is sorted to form n-1 division points qi=(zi+zi+1) And 2, generating a hyperplane by each partition point, and dividing the data space of the current node into two subspaces of partition1 and partition 2:
partition1={(xi,yi)∈N,s.t.pi≤qi}
partition2={(xi,yi)∈N,s.t.pi>qi}
where partition1 represents the relatively pure one of the two partitions. The method for measuring the purity (such as information gain criterion) is used to select the purest sample information from all partitions 1, and the corresponding dividing point q is usediDenoted as q. And q-passing candidate hyperplanes are the optimal hyperplanes. Repeating the above operation for each partition to generate a subtree until the stop growing condition of the decision tree is satisfied.
And step 3: generating two-view random forest
And randomly extracting K self-help sample sets by using a bootstrap resampling technology, and respectively constructing two-view decision trees on each sample set, wherein each decision tree freely grows without limitation. The random forest consists of K decision trees generated according to the method.
In the prediction phase, a two-view sample pair (x) is inputi,yi) The final prediction result of the random forest is jointly determined by all decision tree votes in the forest:
wherein I (-) is an indicator function hiIs a single decision tree classifier in the forest.
To validate the effectiveness of the invention, experimental analysis was performed in conjunction with the embodiments of the invention and compared to existing multi-view random forests. The validation dataset is the UCI standard dataset as shown in table 1.
Table 1 UCI data set description
According to the precision result of the comparison experiment, the precision of the method provided by the invention is obviously improved, and the effectiveness of the improved multi-view random forest is verified.
Claims (4)
1. A multi-view learning algorithm based on random forests takes two views as an example, and is characterized by comprising the following steps:
in a first step, for a set of aligned two-view sample setsAnd (4) randomly drawing K self-help sample sets in a release place by using a bootstrap resampling technology.
Secondly, generating a node N on each self-help sample set, and calculating an optimal matrix W to enable two-view sample pairs (x) in the node Ni,yi) Is a function of a discriminant function ofThe distance between classes is the largest and the distance within the class is the smallest.
Thirdly, n discrimination function values of the current node are calculatedSorting to obtain n-1 dividing points qi=(zi+zi+1) And/2, all hyperplanes passing through the division points are candidate hyperplanes of the current divided data space. And calculating the optimal hyperplane of the current segmentation data space by using an impurity measurement method.
And fourthly, respectively generating subtrees on the partition created on the hyperplane according to the second step, and performing recursion to finally obtain a two-view decision tree.
And fifthly, respectively generating two view decision trees on the K self-help sample sets to form a multi-view random forest. The final prediction result of the forest is jointly decided by all decision tree votes:
wherein I (-) is an indicator function hiRepresenting a single decision tree classifier in the forest.
2. The pass function of the second step of claim 1And fusing the two-view information, wherein the view information fusion is carried out at the construction stage of the decision tree. For a two-view sample pair (x)i,yi) Firstly, each view is projected to a vector w respectivelyx,wyTo obtain a projection valueAnd then, carrying out view data fusion in an inner product mode:fused sampleThe relevance of the two views is fully utilized, so that the view data can have better precision in classification.
3. The second step of claim 1, wherein the optimal matrix W is calculated such that the two-view sample set of the current nodeCorresponding discriminant function valueThe distance between classes is maximum and the distance within classes is minimum, characterized in that the objective function in the second step is:
s.t.(m*1)-m(2))2=1,
wherein,
represents the mean value after fusion of the i-th sample, niIndicating the size of the class i sample count.
4. The third step of claim 1 wherein the subspace is partitioned based on the fused two-view sample values, wherein the discriminant function values of the samples areAnd sequencing to obtain a plurality of partition points, calculating the impurity degree of the subspace corresponding to all the partition points by using an impurity degree measuring method, wherein the partition point corresponding to the subspace with the lowest impurity degree is the optimal partition point of the current partitioned data space.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010629341.0A CN111798003A (en) | 2020-07-02 | 2020-07-02 | Multi-view learning algorithm based on random forest |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010629341.0A CN111798003A (en) | 2020-07-02 | 2020-07-02 | Multi-view learning algorithm based on random forest |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111798003A true CN111798003A (en) | 2020-10-20 |
Family
ID=72811093
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010629341.0A Pending CN111798003A (en) | 2020-07-02 | 2020-07-02 | Multi-view learning algorithm based on random forest |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111798003A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023105348A1 (en) * | 2021-12-06 | 2023-06-15 | International Business Machines Corporation | Accelerating decision tree inferences based on tensor operations |
-
2020
- 2020-07-02 CN CN202010629341.0A patent/CN111798003A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023105348A1 (en) * | 2021-12-06 | 2023-06-15 | International Business Machines Corporation | Accelerating decision tree inferences based on tensor operations |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106682116B (en) | OPTIC point sorting and clustering method based on Spark memory calculation big data platform | |
CN107766933A (en) | A kind of method for visualizing for explaining convolutional neural networks | |
Sikandar et al. | Decision tree based approaches for detecting protein complex in protein protein interaction network (PPI) via link and sequence analysis | |
CN110991653A (en) | Method for classifying unbalanced data sets | |
von Lücken et al. | An overview on evolutionary algorithms for many‐objective optimization problems | |
CN104966075A (en) | Face recognition method and system based on two-dimensional discriminant features | |
CN106202999A (en) | Microorganism high-pass sequencing data based on different scale tuple word frequency analyzes agreement | |
CN111414863B (en) | Enhanced integrated remote sensing image classification method | |
CN106446947A (en) | High-dimension data soft and hard clustering integration method based on random subspace | |
Bruzzese et al. | DESPOTA: DEndrogram slicing through a pemutation test approach | |
CN111798003A (en) | Multi-view learning algorithm based on random forest | |
Shah et al. | Analysis of different clustering algorithms for accurate knowledge extraction from popular datasets | |
Jesus et al. | Dynamic feature selection based on pareto front optimization | |
CN103793714A (en) | Multi-class discriminating device, data discrimination device, multi-class discriminating method and data discriminating method | |
CN107577681B (en) | A kind of terrain analysis based on social media picture, recommended method and system | |
Barr et al. | Framework for active clustering with ensembles | |
Vu et al. | Graph-based clustering with background knowledge | |
CN108062563A (en) | A kind of representative sample based on classification equilibrium finds method | |
Tuan et al. | Object Detection in Remote Sensing Images Using Picture Fuzzy Clustering and MapReduce. | |
Li et al. | Abundance estimation based on band fusion and prioritization mechanism | |
Hoffmann et al. | Music data processing and mining in large databases for active media | |
CN115587297A (en) | Method, apparatus, device and medium for constructing image recognition model and image recognition | |
Hu et al. | Learning deep representations in large integrated network for graph clustering | |
CN118228036B (en) | Integration method and system for integrating multi-source heterogeneous data sets | |
Kavitha et al. | Machine learning paradigm towards content-based image retrieval on high-resolution satellite images |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |