CN111798003A

CN111798003A - Multi-view learning algorithm based on random forest

Info

Publication number: CN111798003A
Application number: CN202010629341.0A
Authority: CN
Inventors: 陈松灿; 夏笑秋
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2020-07-02
Filing date: 2020-07-02
Publication date: 2020-10-20

Abstract

Random forests are one of the most classical machine learning algorithms and have gained widespread use. However, it was observed that despite the fact that there is a lot of multi-view data and extensive analytical studies have been obtained, surprisingly there is considerably less random forest formation for multi-view scenes. The only method for solving the multi-view learning problem by using the random forest is to generate respective random forests for all views and then fuse multi-view information during decision making. One significant drawback of such an approach is that the inter-multi-view correlation is not exploited during the construction phase of its random forest, which undoubtedly wastes information resources. In order to make up for the deficiency, the invention provides an improved multi-view learning algorithm based on random forests. Specifically, view fusion is carried out in the generation process of the decision tree, information interaction between views is fused into the construction stage of the decision tree, and utilization of complementary information between the views in the whole random forest generation process is achieved. In addition, the invention also generates decision boundaries with discriminant properties for the decision tree through discriminant analysis, so that the decision tree is more suitable for classification.

Description

Multi-view learning algorithm based on random forest

Technical Field

The invention belongs to the field of machine learning, and particularly relates to a multi-view learning method based on a random forest, which is used for solving the classification problem in a multi-view scene.

Background

The random forest first proposed in 2001 by Breiman has become one of the most widely used ensemble learning algorithms. And the random forest constructs a plurality of decision trees by utilizing a random resampling and node random splitting strategy, and then obtains a final classification result by voting. Due to the advantages of high precision, good interpretability, low overfitting risk, good noise tolerance and the like, impressive success is achieved in a plurality of fields including computer vision, data mining and the like, meanwhile, extensive research on random forests by a plurality of successors is stimulated, and mutated random forests such as dynamic random forests, deep forests and the like are developed.

However, the existing random forest and its variation almost focus on single-view learning scenes, and the random forest for multi-view learning is rare. As is known, many classification problems in reality are multi-view in nature, because data features can often be characterized from multiple aspects and complement each other, and single-view data generally cannot describe the whole appearance of data information. For example, a picture can be represented by its texture, shape and color features together, i.e. forming a set of multi-view data. Leveraging complementary information from different views can lead to increased generalization performance and have driven the widespread deployment of multi-view learning. However, surprisingly there is little random forest based multi-view learning. To our knowledge, there are only 2 multiview works with random forests, one being the multiview random forest proposed in 2015 for pedestrian detection and the other being the difference-based multiview random forest proposed in 2019 and used to study the radioactive groups. Both proposed methods generate respective random forests for each view, and then fuse multi-view information in a decision (or back-end) stage. Therefore, the methods do not utilize the correlation among multiple views in the whole process in the construction stage of the random forest, which is undoubtedly a waste of information resources. In order to overcome the defect, the invention provides an improved multi-view learning algorithm based on a random forest, and view interaction information is considered to be integrated into the whole construction stage of a decision tree, so that the full utilization of inter-view complementary information in the whole random forest generation is realized.

Disclosure of Invention

As analyzed above, the conventional random forest establishment for a multi-view scene is to generate a corresponding random forest for each view, and then decide a final prediction result by voting on each random forest. The disadvantage of these methods is that the complementary information between the views is only utilized in the decision phase of the back-end. The invention aims to perform information interaction between views in the whole construction stage of the random forest, so that the correlation between the views is fully utilized in the whole random forest stage.

In order to utilize multi-view information all the way through the decision tree construction stage, the following two problems need to be solved:

(1) how are the multi-view data fused during the construction phase of the decision tree?

(2) How can the fused data be used for classification?

Regarding the first problem, it can be considered to project each view separately and then perform inner product. Taking two views as an example, sample pair (x)_i，y_i) Respectively projected to a vector w_x，w_yAnd fusing view data in an inner product mode:

wherein,

regarding the second problem, discriminant analysis may be performed. For sample sets

Finding an optimal matrix W so as to fuse

The distance between classes is the largest and the distance within the class is the smallest. And for the fused samples, calculating a hyperplane of the current optimal partition data space by using an impurity degree measurement method, and generating a sub-tree in each partition created by the hyperplane. And performing recursion according to the steps to finally obtain a two-view decision tree.

Taking the two-view data as an example, the specific implementation process of the invention is as follows:

(1) fusing two-view data

For a set of aligned two-view sample sets

Calculating an optimal matrix W by using discriminant analysis so as to enable the fused data

The distance between classes is the largest and the distance within the class is the smallest.

(2) Training a two-view decision tree

And for the fused samples, calculating a hyperplane of the current optimal partitioned data space by using an impurity degree measurement method, respectively generating subtrees in two partitions created by the hyperplane, and performing recursion in the manner to finally obtain a two-view decision tree.

(3) Training two-view random forest

And (3) randomly extracting K self-help sample sets by using a bootstrap resampling method, and respectively constructing two-view decision trees on the K sample sets, wherein each tree grows freely to the maximum extent, namely pruning is not performed. And obtaining the final prediction result of the forest by a majority voting method.

Drawings

FIG. 1 is a flow chart of a random forest based two-view learning algorithm;

FIG. 2 is a flow chart of a two-view based decision tree algorithm;

FIG. 3 is a multi-view random forest accuracy comparison graph.

Detailed Description

The technical content of the present invention will be further explained with reference to the drawings, and the experimental data in this embodiment are all from the real data set in the UCI standard database. To generate the two-view data, we cut the D-dimensional samples according to the correlation method of operation, before selection

The features of one dimension act as a first view and the other dimension act as a second view.

FIG. 1 shows a flow chart of a random forest-based two-view learning algorithm provided by the invention, which specifically comprises the following steps:

step 1: fusing two-view data

Let { (x)_i，y_i)∈R^p×R^qIs a set of two-view sample sets, let the data matrix

X＝[x₁，...，x_n]∈R^p×n，Y＝[y₁，...，y_n]∈R^q×n

Representing the data sets of the two views, respectively.

And (3) calculating an optimal matrix W by using discriminant analysis, so that the distance between the fused data classes is maximum and the distance in the classes is minimum. The objective function is:

s.t.(m⁽¹⁾-m⁽²⁾)²＝1

wherein,

represents the mean value after fusion of the i-th sample, n_iIndicating the size of the class i sample count. The following objective function is derived:

s.t.

wherein,

the above function can be converted into:

s.t.

wherein,

w＝vec(W^T)，a＝vec(A^T)，b＝vec(B^T)，

defining the Languge function of the optimization problem as follows by using the Languge multiplier method:

the deviation of w from the above equation can be obtained

The optimization problem can be characterized as solving the following generalized eigenvalue problem:

and matrixing the eigenvector W corresponding to the maximum eigenvalue to obtain the optimal matrix W.

Step 2: generating a two-view decision tree

For the obtained optimal matrix W, a sample pair (x) is calculated_i，y_i) Is determined, and is sorted to form n-1 division points q_i＝(z_i+z_i+1) And 2, generating a hyperplane by each partition point, and dividing the data space of the current node into two subspaces of partition1 and partition 2:

partition₁＝{(x_i，y_i)∈N，s.t.p_i≤q_i}

partition₂＝{(x_i，y_i)∈N，s.t.p_i＞q_i}

where partition1 represents the relatively pure one of the two partitions. The method for measuring the purity (such as information gain criterion) is used to select the purest sample information from all partitions 1, and the corresponding dividing point q is used_iDenoted as q. And q-passing candidate hyperplanes are the optimal hyperplanes. Repeating the above operation for each partition to generate a subtree until the stop growing condition of the decision tree is satisfied.

And step 3: generating two-view random forest

And randomly extracting K self-help sample sets by using a bootstrap resampling technology, and respectively constructing two-view decision trees on each sample set, wherein each decision tree freely grows without limitation. The random forest consists of K decision trees generated according to the method.

In the prediction phase, a two-view sample pair (x) is input_i，y_i) The final prediction result of the random forest is jointly determined by all decision tree votes in the forest:

wherein I (-) is an indicator function h_iIs a single decision tree classifier in the forest.

To validate the effectiveness of the invention, experimental analysis was performed in conjunction with the embodiments of the invention and compared to existing multi-view random forests. The validation dataset is the UCI standard dataset as shown in table 1.

Table 1 UCI data set description

According to the precision result of the comparison experiment, the precision of the method provided by the invention is obviously improved, and the effectiveness of the improved multi-view random forest is verified.

Claims

1. A multi-view learning algorithm based on random forests takes two views as an example, and is characterized by comprising the following steps:

in a first step, for a set of aligned two-view sample sets

And (4) randomly drawing K self-help sample sets in a release place by using a bootstrap resampling technology.

Secondly, generating a node N on each self-help sample set, and calculating an optimal matrix W to enable two-view sample pairs (x) in the node N_i，y_i) Is a function of a discriminant function of

Thirdly, n discrimination function values of the current node are calculated

Sorting to obtain n-1 dividing points q_i＝(z_i+z_i+1) And/2, all hyperplanes passing through the division points are candidate hyperplanes of the current divided data space. And calculating the optimal hyperplane of the current segmentation data space by using an impurity measurement method.

And fourthly, respectively generating subtrees on the partition created on the hyperplane according to the second step, and performing recursion to finally obtain a two-view decision tree.

And fifthly, respectively generating two view decision trees on the K self-help sample sets to form a multi-view random forest. The final prediction result of the forest is jointly decided by all decision tree votes:

wherein I (-) is an indicator function h_iRepresenting a single decision tree classifier in the forest.

2. The pass function of the second step of claim 1

And fusing the two-view information, wherein the view information fusion is carried out at the construction stage of the decision tree. For a two-view sample pair (x)_i，y_i) Firstly, each view is projected to a vector w respectively_x，w_yTo obtain a projection value

And then, carrying out view data fusion in an inner product mode:

fused sample

The relevance of the two views is fully utilized, so that the view data can have better precision in classification.

3. The second step of claim 1, wherein the optimal matrix W is calculated such that the two-view sample set of the current node

Corresponding discriminant function value

The distance between classes is maximum and the distance within classes is minimum, characterized in that the objective function in the second step is:

s.t.(m^*1)-m⁽²⁾)²＝1，

wherein,

represents the mean value after fusion of the i-th sample, n_iIndicating the size of the class i sample count.

4. The third step of claim 1 wherein the subspace is partitioned based on the fused two-view sample values, wherein the discriminant function values of the samples are

And sequencing to obtain a plurality of partition points, calculating the impurity degree of the subspace corresponding to all the partition points by using an impurity degree measuring method, wherein the partition point corresponding to the subspace with the lowest impurity degree is the optimal partition point of the current partitioned data space.