CN111798003A - Multi-view learning algorithm based on random forest - Google Patents

Multi-view learning algorithm based on random forest Download PDF

Info

Publication number
CN111798003A
CN111798003A CN202010629341.0A CN202010629341A CN111798003A CN 111798003 A CN111798003 A CN 111798003A CN 202010629341 A CN202010629341 A CN 202010629341A CN 111798003 A CN111798003 A CN 111798003A
Authority
CN
China
Prior art keywords
view
decision tree
sample
random forest
views
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010629341.0A
Other languages
Chinese (zh)
Inventor
陈松灿
夏笑秋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN202010629341.0A priority Critical patent/CN111798003A/en
Publication of CN111798003A publication Critical patent/CN111798003A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

Random forests are one of the most classical machine learning algorithms and have gained widespread use. However, it was observed that despite the fact that there is a lot of multi-view data and extensive analytical studies have been obtained, surprisingly there is considerably less random forest formation for multi-view scenes. The only method for solving the multi-view learning problem by using the random forest is to generate respective random forests for all views and then fuse multi-view information during decision making. One significant drawback of such an approach is that the inter-multi-view correlation is not exploited during the construction phase of its random forest, which undoubtedly wastes information resources. In order to make up for the deficiency, the invention provides an improved multi-view learning algorithm based on random forests. Specifically, view fusion is carried out in the generation process of the decision tree, information interaction between views is fused into the construction stage of the decision tree, and utilization of complementary information between the views in the whole random forest generation process is achieved. In addition, the invention also generates decision boundaries with discriminant properties for the decision tree through discriminant analysis, so that the decision tree is more suitable for classification.

Description

Multi-view learning algorithm based on random forest
Technical Field
The invention belongs to the field of machine learning, and particularly relates to a multi-view learning method based on a random forest, which is used for solving the classification problem in a multi-view scene.
Background
The random forest first proposed in 2001 by Breiman has become one of the most widely used ensemble learning algorithms. And the random forest constructs a plurality of decision trees by utilizing a random resampling and node random splitting strategy, and then obtains a final classification result by voting. Due to the advantages of high precision, good interpretability, low overfitting risk, good noise tolerance and the like, impressive success is achieved in a plurality of fields including computer vision, data mining and the like, meanwhile, extensive research on random forests by a plurality of successors is stimulated, and mutated random forests such as dynamic random forests, deep forests and the like are developed.
However, the existing random forest and its variation almost focus on single-view learning scenes, and the random forest for multi-view learning is rare. As is known, many classification problems in reality are multi-view in nature, because data features can often be characterized from multiple aspects and complement each other, and single-view data generally cannot describe the whole appearance of data information. For example, a picture can be represented by its texture, shape and color features together, i.e. forming a set of multi-view data. Leveraging complementary information from different views can lead to increased generalization performance and have driven the widespread deployment of multi-view learning. However, surprisingly there is little random forest based multi-view learning. To our knowledge, there are only 2 multiview works with random forests, one being the multiview random forest proposed in 2015 for pedestrian detection and the other being the difference-based multiview random forest proposed in 2019 and used to study the radioactive groups. Both proposed methods generate respective random forests for each view, and then fuse multi-view information in a decision (or back-end) stage. Therefore, the methods do not utilize the correlation among multiple views in the whole process in the construction stage of the random forest, which is undoubtedly a waste of information resources. In order to overcome the defect, the invention provides an improved multi-view learning algorithm based on a random forest, and view interaction information is considered to be integrated into the whole construction stage of a decision tree, so that the full utilization of inter-view complementary information in the whole random forest generation is realized.
Disclosure of Invention
As analyzed above, the conventional random forest establishment for a multi-view scene is to generate a corresponding random forest for each view, and then decide a final prediction result by voting on each random forest. The disadvantage of these methods is that the complementary information between the views is only utilized in the decision phase of the back-end. The invention aims to perform information interaction between views in the whole construction stage of the random forest, so that the correlation between the views is fully utilized in the whole random forest stage.
In order to utilize multi-view information all the way through the decision tree construction stage, the following two problems need to be solved:
(1) how are the multi-view data fused during the construction phase of the decision tree?
(2) How can the fused data be used for classification?
Regarding the first problem, it can be considered to project each view separately and then perform inner product. Taking two views as an example, sample pair (x)i,yi) Respectively projected to a vector wx,wyAnd fusing view data in an inner product mode:
Figure BSA0000212840210000021
wherein,
Figure BSA0000212840210000022
regarding the second problem, discriminant analysis may be performed. For sample sets
Figure BSA0000212840210000023
Finding an optimal matrix W so as to fuse
Figure BSA0000212840210000024
The distance between classes is the largest and the distance within the class is the smallest. And for the fused samples, calculating a hyperplane of the current optimal partition data space by using an impurity degree measurement method, and generating a sub-tree in each partition created by the hyperplane. And performing recursion according to the steps to finally obtain a two-view decision tree.
Taking the two-view data as an example, the specific implementation process of the invention is as follows:
(1) fusing two-view data
For a set of aligned two-view sample sets
Figure BSA0000212840210000025
Calculating an optimal matrix W by using discriminant analysis so as to enable the fused data
Figure BSA0000212840210000026
The distance between classes is the largest and the distance within the class is the smallest.
(2) Training a two-view decision tree
And for the fused samples, calculating a hyperplane of the current optimal partitioned data space by using an impurity degree measurement method, respectively generating subtrees in two partitions created by the hyperplane, and performing recursion in the manner to finally obtain a two-view decision tree.
(3) Training two-view random forest
And (3) randomly extracting K self-help sample sets by using a bootstrap resampling method, and respectively constructing two-view decision trees on the K sample sets, wherein each tree grows freely to the maximum extent, namely pruning is not performed. And obtaining the final prediction result of the forest by a majority voting method.
Drawings
FIG. 1 is a flow chart of a random forest based two-view learning algorithm;
FIG. 2 is a flow chart of a two-view based decision tree algorithm;
FIG. 3 is a multi-view random forest accuracy comparison graph.
Detailed Description
The technical content of the present invention will be further explained with reference to the drawings, and the experimental data in this embodiment are all from the real data set in the UCI standard database. To generate the two-view data, we cut the D-dimensional samples according to the correlation method of operation, before selection
Figure BSA0000212840210000031
The features of one dimension act as a first view and the other dimension act as a second view.
FIG. 1 shows a flow chart of a random forest-based two-view learning algorithm provided by the invention, which specifically comprises the following steps:
step 1: fusing two-view data
Let { (x)i,yi)∈Rp×RqIs a set of two-view sample sets, let the data matrix
X=[x1,...,xn]∈Rp×n,Y=[y1,...,yn]∈Rq×n
Representing the data sets of the two views, respectively.
And (3) calculating an optimal matrix W by using discriminant analysis, so that the distance between the fused data classes is maximum and the distance in the classes is minimum. The objective function is:
Figure BSA0000212840210000032
s.t.(m(1)-m(2))2=1
wherein,
Figure BSA0000212840210000033
represents the mean value after fusion of the i-th sample, niIndicating the size of the class i sample count. The following objective function is derived:
Figure BSA0000212840210000034
s.t.
Figure BSA0000212840210000035
wherein,
Figure BSA0000212840210000036
the above function can be converted into:
Figure BSA0000212840210000037
s.t.
Figure BSA0000212840210000038
wherein,
w=vec(WT),a=vec(AT),b=vec(BT),
Figure BSA0000212840210000039
defining the Languge function of the optimization problem as follows by using the Languge multiplier method:
Figure BSA0000212840210000041
the deviation of w from the above equation can be obtained
Figure BSA0000212840210000042
The optimization problem can be characterized as solving the following generalized eigenvalue problem:
Figure BSA0000212840210000043
and matrixing the eigenvector W corresponding to the maximum eigenvalue to obtain the optimal matrix W.
Step 2: generating a two-view decision tree
For the obtained optimal matrix W, a sample pair (x) is calculatedi,yi) Is determined, and is sorted to form n-1 division points qi=(zi+zi+1) And 2, generating a hyperplane by each partition point, and dividing the data space of the current node into two subspaces of partition1 and partition 2:
partition1={(xi,yi)∈N,s.t.pi≤qi}
partition2={(xi,yi)∈N,s.t.pi>qi}
where partition1 represents the relatively pure one of the two partitions. The method for measuring the purity (such as information gain criterion) is used to select the purest sample information from all partitions 1, and the corresponding dividing point q is usediDenoted as q. And q-passing candidate hyperplanes are the optimal hyperplanes. Repeating the above operation for each partition to generate a subtree until the stop growing condition of the decision tree is satisfied.
And step 3: generating two-view random forest
And randomly extracting K self-help sample sets by using a bootstrap resampling technology, and respectively constructing two-view decision trees on each sample set, wherein each decision tree freely grows without limitation. The random forest consists of K decision trees generated according to the method.
In the prediction phase, a two-view sample pair (x) is inputi,yi) The final prediction result of the random forest is jointly determined by all decision tree votes in the forest:
Figure BSA0000212840210000044
wherein I (-) is an indicator function hiIs a single decision tree classifier in the forest.
To validate the effectiveness of the invention, experimental analysis was performed in conjunction with the embodiments of the invention and compared to existing multi-view random forests. The validation dataset is the UCI standard dataset as shown in table 1.
Table 1 UCI data set description
Figure BSA0000212840210000051
According to the precision result of the comparison experiment, the precision of the method provided by the invention is obviously improved, and the effectiveness of the improved multi-view random forest is verified.

Claims (4)

1. A multi-view learning algorithm based on random forests takes two views as an example, and is characterized by comprising the following steps:
in a first step, for a set of aligned two-view sample sets
Figure FSA0000212840200000011
And (4) randomly drawing K self-help sample sets in a release place by using a bootstrap resampling technology.
Secondly, generating a node N on each self-help sample set, and calculating an optimal matrix W to enable two-view sample pairs (x) in the node Ni,yi) Is a function of a discriminant function of
Figure FSA0000212840200000012
The distance between classes is the largest and the distance within the class is the smallest.
Thirdly, n discrimination function values of the current node are calculated
Figure FSA0000212840200000013
Sorting to obtain n-1 dividing points qi=(zi+zi+1) And/2, all hyperplanes passing through the division points are candidate hyperplanes of the current divided data space. And calculating the optimal hyperplane of the current segmentation data space by using an impurity measurement method.
And fourthly, respectively generating subtrees on the partition created on the hyperplane according to the second step, and performing recursion to finally obtain a two-view decision tree.
And fifthly, respectively generating two view decision trees on the K self-help sample sets to form a multi-view random forest. The final prediction result of the forest is jointly decided by all decision tree votes:
Figure FSA0000212840200000014
wherein I (-) is an indicator function hiRepresenting a single decision tree classifier in the forest.
2. The pass function of the second step of claim 1
Figure FSA0000212840200000015
And fusing the two-view information, wherein the view information fusion is carried out at the construction stage of the decision tree. For a two-view sample pair (x)i,yi) Firstly, each view is projected to a vector w respectivelyx,wyTo obtain a projection value
Figure FSA0000212840200000016
And then, carrying out view data fusion in an inner product mode:
Figure FSA0000212840200000017
fused sample
Figure FSA0000212840200000018
The relevance of the two views is fully utilized, so that the view data can have better precision in classification.
3. The second step of claim 1, wherein the optimal matrix W is calculated such that the two-view sample set of the current node
Figure FSA0000212840200000019
Corresponding discriminant function value
Figure FSA00002128402000000110
The distance between classes is maximum and the distance within classes is minimum, characterized in that the objective function in the second step is:
Figure FSA00002128402000000111
s.t.(m*1)-m(2))2=1,
wherein,
Figure FSA00002128402000000112
represents the mean value after fusion of the i-th sample, niIndicating the size of the class i sample count.
4. The third step of claim 1 wherein the subspace is partitioned based on the fused two-view sample values, wherein the discriminant function values of the samples are
Figure FSA0000212840200000021
And sequencing to obtain a plurality of partition points, calculating the impurity degree of the subspace corresponding to all the partition points by using an impurity degree measuring method, wherein the partition point corresponding to the subspace with the lowest impurity degree is the optimal partition point of the current partitioned data space.
CN202010629341.0A 2020-07-02 2020-07-02 Multi-view learning algorithm based on random forest Pending CN111798003A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010629341.0A CN111798003A (en) 2020-07-02 2020-07-02 Multi-view learning algorithm based on random forest

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010629341.0A CN111798003A (en) 2020-07-02 2020-07-02 Multi-view learning algorithm based on random forest

Publications (1)

Publication Number Publication Date
CN111798003A true CN111798003A (en) 2020-10-20

Family

ID=72811093

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010629341.0A Pending CN111798003A (en) 2020-07-02 2020-07-02 Multi-view learning algorithm based on random forest

Country Status (1)

Country Link
CN (1) CN111798003A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023105348A1 (en) * 2021-12-06 2023-06-15 International Business Machines Corporation Accelerating decision tree inferences based on tensor operations

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023105348A1 (en) * 2021-12-06 2023-06-15 International Business Machines Corporation Accelerating decision tree inferences based on tensor operations

Similar Documents

Publication Publication Date Title
CN106682116B (en) OPTIC point sorting and clustering method based on Spark memory calculation big data platform
CN107766933A (en) A kind of method for visualizing for explaining convolutional neural networks
Sikandar et al. Decision tree based approaches for detecting protein complex in protein protein interaction network (PPI) via link and sequence analysis
CN110991653A (en) Method for classifying unbalanced data sets
von Lücken et al. An overview on evolutionary algorithms for many‐objective optimization problems
CN104966075A (en) Face recognition method and system based on two-dimensional discriminant features
CN106202999A (en) Microorganism high-pass sequencing data based on different scale tuple word frequency analyzes agreement
CN111414863B (en) Enhanced integrated remote sensing image classification method
CN106446947A (en) High-dimension data soft and hard clustering integration method based on random subspace
Bruzzese et al. DESPOTA: DEndrogram slicing through a pemutation test approach
CN111798003A (en) Multi-view learning algorithm based on random forest
Shah et al. Analysis of different clustering algorithms for accurate knowledge extraction from popular datasets
Jesus et al. Dynamic feature selection based on pareto front optimization
CN103793714A (en) Multi-class discriminating device, data discrimination device, multi-class discriminating method and data discriminating method
CN107577681B (en) A kind of terrain analysis based on social media picture, recommended method and system
Barr et al. Framework for active clustering with ensembles
Vu et al. Graph-based clustering with background knowledge
CN108062563A (en) A kind of representative sample based on classification equilibrium finds method
Tuan et al. Object Detection in Remote Sensing Images Using Picture Fuzzy Clustering and MapReduce.
Li et al. Abundance estimation based on band fusion and prioritization mechanism
Hoffmann et al. Music data processing and mining in large databases for active media
CN115587297A (en) Method, apparatus, device and medium for constructing image recognition model and image recognition
Hu et al. Learning deep representations in large integrated network for graph clustering
CN118228036B (en) Integration method and system for integrating multi-source heterogeneous data sets
Kavitha et al. Machine learning paradigm towards content-based image retrieval on high-resolution satellite images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination