CN108280236B

CN108280236B - Method for analyzing random forest visual data based on LargeVis

Info

Publication number: CN108280236B
Application number: CN201810170150.5A
Authority: CN
Inventors: 黄立勤; 陈宋
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2018-02-28
Filing date: 2018-02-28
Publication date: 2022-03-15
Anticipated expiration: 2038-02-28
Also published as: CN108280236A

Abstract

The invention relates to a method for analyzing random forest visual data based on LargeVis. Preprocessing a training data set; extracting important features of a training data set through random forests; performing dimensionality reduction treatment by adopting LargeVis; and performing visualization processing on the random forest based on the LargeVis. The invention provides a data analysis method of random forest visualization based on LargeVis, aiming at high-dimensional data, forming new secondary high-dimensional data by using the characteristic importance trained by random forests, and sending the data subjected to LargeVis dimension reduction into random forests for predictive analysis and visualization, so that the classification precision can be improved, the visualization time can be improved, and different data can be adapted.

Description

Method for analyzing random forest visual data based on LargeVis

Technical Field

The invention relates to pattern recognition, machine learning and big data analysis, in particular to a method for analyzing random forest visual data based on LargeVis.

Background

In the big data era, the dimensionality of data features is higher and higher, the analysis of data by a certain dimensionality reduction method becomes particularly important, and meanwhile, how to visualize high-dimensional data is also a research focus in the current environment. At present, the most classical dimensionality reduction method is that pca (principal Component analysis) not only reduces the dimensionality of high-dimensional data, but also removes noise through dimensionality reduction and discovers a mode in the data. PCA replaces the original n features with a smaller number of m features, the new features being linear combinations of the old ones, these linear combinations maximizing the sample variance, trying to make the new m features uncorrelated with each other. The mapping from old features to new features captures the inherent variability in the data. Then, researchers propose manifold learning, which is a main algorithm of nonlinear dimension reduction, and the visualization research is added, wherein the main algorithm of manifold learning is as follows: ISOMap (isometric mapping), LE (laplace feature mapping), LLE (locally linear embedding). Assumptions for manifold learning: data is sampled on a certain manifold. The main algorithms are as follows: ISOMap is a non-iterative global optimization algorithm. The ISOMap modifies MDS (Multidimensional Scaling-Multidimensional Scaling), geodesic distance (curve distance) is used as the distance between two points in space, and the original Euclidean distance is used, so that data on a certain dimensional manifold is mapped to an Euclidean space. ISOMap connects data points to form an adjacent Graph to discretely approximate the original manifold, and geodesic distances are correspondingly approximated by the shortest path on Graph. On the basis, Maaten has recently written a paper to improve the t-SNE algorithm, and various tree-based algorithms are used, which specifically include two parts: firstly, a kNN graph is adopted to represent the similarity of points in a high-dimensional space; and secondly, the solving process of the gradient is optimized, the gradient calculation is divided into two parts of attraction force and repulsion force, and some optimization skills are used. According to the scheme, various dimensionality reduction algorithms can reduce the number of the predictive variables, and can provide a frame explanation for the final result.

At present, the t-SNE algorithm in manifold learning is widely applied, but the following defects exist: when large-scale high-dimensional data is processed, the efficiency of t-SNE is remarkably reduced (including improved algorithm); the parameters in the t-SNE are sensitive to different data sets, the parameters are adjusted on one data set, a good visualization effect is obtained, the method is found to be not suitable for the other data set, a large amount of time is spent on searching for proper parameters, and the limitation on the whole classification model is very large; pure original high-dimensional data directly enter a model for training and classification through a dimension reduction mode, the precision is low, and the training time is long. In addition, at present, the dimension reduction method of data generally uses original data to reduce the dimension and uses the existing model to classify, but this may have the problems of low precision, no interpretability of the dimension reduced data, and the like.

The invention provides a random forest visualization data analysis algorithm based on LargeVis, aiming at high-dimensional data, forming new secondary high-dimensional data by using the characteristic importance trained by random forests, and sending the data subjected to LargeVis dimension reduction into random forests for predictive analysis and visualization. Therefore, the invention provides a new solution to the problem of feature extraction classification and visualization of the fetal heart rate.

Disclosure of Invention

The invention aims to provide a method for analyzing random forest visual data based on LargeVis, which is used for overcoming the defects in the prior art.

In order to achieve the purpose, the technical scheme of the invention is as follows: a method for analyzing random forest visual data based on LargeVis is realized according to the following steps:

step S1: preprocessing a training data set;

step S2: extracting sample characteristics with the specific gravity larger than a preset specific gravity threshold value in the training data set through random forests;

step S3: adopting LargeVis to carry out dimensionality reduction treatment

Step S4: and performing visualization processing on the random forest based on the LargeVis.

In an embodiment of the present invention, in the step S1, a SMOTE method is adopted to perform data imbalance processing, and data abnormal value processing is performed by replacing with a median and an unused number in the data.

In an embodiment of the present invention, in the step S2, the method further includes the following steps:

step S21: preliminary estimation and sorting;

step S211: sorting the characteristic variables in the random forest according to the VI descending order;

step S212: determining a deletion ratio; removing 20% of characteristic variables smaller than a preset specific gravity threshold value from the characteristic variables which are arranged in a descending order at present so as to obtain a new characteristic set;

step S213: establishing a new random forest by using the new feature set, calculating the VI of each feature in the feature set, and sequencing;

step S214: repeating the steps until m characteristics are left;

step S22: and calculating the corresponding out-of-bag error rate according to each feature set obtained in the step S21 and the random forest correspondingly established, and taking the feature set with the lowest out-of-bag error rate as the finally selected feature set.

In an embodiment of the present invention, in the step S3, according to the result obtained in the step S2, a partition space is obtained through a random projection tree, and on the basis of the partition space, K neighbors of each sample point are searched to obtain a preliminary K nearest neighbor; and (4) searching potential neighbors by using a neighbor search algorithm according to the direct neighbor, calculating the distances between the neighbors and the current point and between the neighbors and the current point, putting the distances into a small root heap, and taking k nodes with the minimum distances as k neighbors to obtain a final kNN graph.

In one embodiment of the invention, for an unweighted network, y is used_iAnd y_jRepresenting two points in a low-dimensional space, two points having a binary edge e in the kNN graph_ijThe probability of (c) is:

P(e_ij＝1)＝f(‖y_i-y_j‖²)

wherein f (-) is similarly used for the t distribution in the t-SNE,

if y_iAnd y_jThe smaller the distance between the two points is, the higher the probability that the two points have binary edges in the kNN graph is; on the contrary, if y_iAnd y_jThe larger the distance between the two points is, the smaller the probability that the two points have binary edges in the kNN graph is;

for a weighted network, the edge weight is w_ijThe probability of (c) is:

the whole optimization target is to maximize the probability that the node pairs of the positive samples have connecting edges in the kNN graph and minimize the probability that the node pairs of the negative samples have connecting edges in the kNN graph; and recording gamma as a weight value set by the negative sample side, and taking a logarithm, wherein the optimization target is changed into:

for each point i, according to a noise distribution P_n(j) Randomly selecting M points and i to form a negative sample, and adopting the noise distribution

Wherein d is_jIn degrees at point j, the objective function is:

in an embodiment of the invention, after negative sampling and side sampling optimization are completed, asynchronous random gradient descent is adopted for training.

In an embodiment of the present invention, the time complexity of the LargeVis is linear with the number of nodes in the network.

In an embodiment of the invention, in the step S4, a distribution map of the low dimensional data is drawn according to the obtained low dimensional spatial data.

Compared with the prior art, the invention has the following beneficial effects:

(1) the method based on the LargeVis can improve the operation speed firstly, has good adaptability to different data sets and can effectively improve the performance of the whole model.

(2) According to the invention, a random forest interpretable model is adopted, a round of feature extraction is firstly carried out on data, unnecessary features are reduced, important features are left, a new feature sample is formed, dimension reduction is carried out, the data after dimension reduction is input into the random forest for classification, on one hand, the performance of the whole model is improved, on the other hand, the data after dimension reduction is visual and more intuitive, and the interpretability is stronger for a user.

(3) The model of the invention only has two basic models, but can realize classification, visualization, dimension reduction, data preprocessing and feature extraction, and has stronger availability compared with other algorithms.

Drawings

FIG. 1 is a flow chart of a method for analyzing random forest visual data based on LargeVis in the invention.

Detailed Description

The technical scheme of the invention is specifically explained below with reference to the accompanying drawings.

The invention relates to a method for analyzing random forest visual data based on LargeVis, which is realized according to the following steps:

step S1: preprocessing a training data set;

step S3: adopting LargeVis to carry out dimensionality reduction treatment

In this embodiment, since in practical applications, problems of data sample imbalance and abnormal values may occur, this may result in a poor classification result. The training data set is unbalanced, which can cause many problems in pattern recognition. For example, if the data set is unbalanced, the classifier tends to "learn" the largest proportion of samples, i.e., the classifier makes its accuracy the highest more biased towards the higher proportion of samples and clusters them with the highest accuracy. In practical applications, this prejudice is not acceptable. To achieve a uniform distribution of sample data, this problem is solved in this example by using a synthetic minority oversampling technique, creating a "synthetic" instance for each minority class with few samples, and solving the problem of outliers with a median.

In this embodiment, a few oversampling algorithms are synthesized as follows:

1. for each sample x in the minority class, calculating the distance from the sample x to all samples in the minority class sample set D by using the Euclidean distance as a standard to obtain the k neighbor of the sample x.

2. And setting a sampling ratio according to the sample imbalance ratio to determine a sampling multiplying factor N, and randomly selecting a plurality of samples from k neighbors of each few class sample x, wherein the selected neighbors are assumed to be y.

3 for each randomly selected neighbor y, respectively constructing a new sample with the original sample according to the following formula

New sample x + rand (0,1) | x-y-

In this embodiment, the imbalance of the training data set during the preprocessing stage causes many problems in pattern recognition. For example, if the data is not balanced. To solve this problem, the SMOTE method is adopted. The problem of abnormal points often appears in the data, which causes deviation of precision when training the model, so in this embodiment, the median and the number not existing in the data are adopted for replacement, and the data abnormal value is processed.

Further, in a stage of extracting important features from random forests, namely after training of the random forests is completed, sample feature proportion sorting is performed, and the stage is used for extracting features with higher proportion in the samples, and the method further comprises the following steps:

1: preliminary estimation and ranking

a) The feature variables in the random forest are sorted in descending order by VI (variable import).

b) Determining the deletion proportion, and removing 20% of the characteristic variables smaller than a preset specific gravity threshold value from the current characteristic variables which are arranged in a descending order, thereby obtaining a new characteristic set.

c) And establishing a new random forest by using the new feature set, and calculating and sequencing the VI of each feature in the feature set.

d) The above steps are repeated until m features remain. The m value is determined by the entire model, preferably the set of features with the lowest error rate.

2: and (3) calculating a corresponding out-of-bag error rate (OOB err) according to each feature set obtained in the step (1) and the random forest established by the feature sets, and taking the feature set with the lowest out-of-bag error rate as a final selected feature set.

In this embodiment, if high-dimensional data is directly input to the dimensionality reduction model for dimensionality reduction, the data processing time is too long, the calculation parameters are too many, and performance may be reduced.

Further, in the LargeVis dimensionality reduction processing stage:

inputting: data samples of new features selected by the random forest obtained through step S2.

Firstly, a space division is obtained by utilizing a random projection tree, and a K nearest neighbor (kNN, K-nearest neighbor) graph which does not require complete accuracy is obtained on the basis of searching the K nearest neighbor of each sample point. And then searching potential neighbors by using a neighbor search algorithm according to the idea of neighbor direct, calculating the distances between the neighbors and the current point and between the neighbors and the current point, putting the distances into a small root heap, and taking k nodes with the minimum distances as k neighbors to finally obtain an accurate kNN graph.

1. For the case of considering the unweighted network, use y_iAnd y_jRepresenting two points in a low-dimensional space, the two points having a binary edge e in the kNN graph_ijThe probability (edge with weight 1) is:

P(e_ij＝1)＝f(‖y_i-y_j‖²)

wherein f (-) is similarly used for the t distribution in the t-SNE,

if y_iAnd y_jThe smaller the distance between the two points is, the higher the probability that the two points have binary edges in the kNN graph is; on the contrary, if y_iAnd y_jThe larger the distance between the two points, the smaller the probability that two points have a binary edge in the kNN graph.

2. For the case of weighted networks, define the edge weight as w_ijThe probability of (c) is:

the whole optimization goal is to maximize the probability that the node pairs of the positive samples have connecting edges in the kNN graph, and minimize the probability that the node pairs of the negative samples have connecting edges in the kNN graph. Wherein γ is a weight value set for the negative sample side in a unified manner. Taking another logarithm, the optimization objective becomes:

the following formula uses all negative examples

The calculated amount is too large and is solved by using a negative sampling algorithm. For each point i, according to a noise distribution P_n(j) Randomly selecting M points and i to form a negative sample, the noise distribution taking a form similar to that used by Mikolov et al, i.e.

Wherein d is_jIn degrees of point j. The objective function may now be redefined:

in this embodiment, the LargeVis is also trained with asynchronous random gradient descent after optimization with negative sampling and side sampling. This technique is very effective on sparse graphs because the two nodes connected by edges sampled by different threads are rarely duplicated and there is little conflict between different threads. From the aspect of time complexity, the time complexity of each round of descending random gradient is o (sm), where M is the number of negative samples, s is the dimension of the low-dimensional space (2 or 3), and the number of steps of the random gradient is generally proportional to the number of nodes N. Thus, the total temporal complexity is o (sm). It can be concluded that the time complexity of LargeVis is linear with the number of nodes in the network.

Further, in a random forest visualization stage based on the LargeVis, a distribution diagram of low-dimensional data is drawn.

In the embodiment, a data set is given, the original ultrasonic data is subjected to feature extraction to obtain data without dimension reduction, the obtained data is still a high-dimensional data space, a low-dimensional data is obtained through a LargeVis popular learning algorithm to be visualized, and the overall data performance can be observed.

In this embodiment, the algorithm process mainly includes the steps of:

inputting: data set { a₁,a₂,…a_n}, random forest parameters ntree, mtry

And (3) outputting: distribution map of low dimensional spatial data

In this embodiment, as shown in fig. 1, the specific process is as follows:

1. initialization

2. Read-in feature matrix

3. Obtaining a space division by adopting a random projection tree, and searching k neighbors of each point on the basis; and then searching potential neighbors by using a neighbor search algorithm, calculating the distances between the neighbors and the current point and between the neighbors and the current point, putting the distances into a small root heap, and taking k nodes with the minimum distances as k neighbors to finally obtain an accurate kNN graph.

4.For(iin1:k)

4.1. Asynchronous random gradient descent to train

4.2. The time complexity is linear with the number of nodes in the network

5. The calculated local optimal solution obtains the low-dimensional space representation of the data, and the distribution diagram of the low-dimensional data is drawn

The above are preferred embodiments of the present invention, and all changes made according to the technical scheme of the present invention that produce functional effects do not exceed the scope of the technical scheme of the present invention belong to the protection scope of the present invention.

Claims

1. A method for analyzing random forest visual data based on LargeVis is characterized by comprising the following steps:

step S1: preprocessing a training data set;

in step S2, the method further includes the steps of:

step S21: preliminary estimation and sorting;

step S214: repeating the steps until m characteristics are left;

step S22: calculating a corresponding out-of-bag error rate according to each feature set obtained in the step S21 and the random forest correspondingly established, and taking the feature set with the lowest out-of-bag error rate as a finally selected feature set;

step S3: performing dimensionality reduction treatment by adopting LargeVis;

in step S3, according to the result obtained in step S2, a partition space is obtained through a random projection tree, and on the basis of the partition space, K neighbors of each sample point are searched to obtain a preliminary K nearest neighbor; according to the direct neighbor, searching potential neighbors by using a neighbor search algorithm, calculating the distances between the neighbors and the current point and between the neighbors and the current point, putting the distances into a small root stack, and taking k nodes with the minimum distances as k neighbors to obtain a final kNN graph;

2. The method as claimed in claim 1, wherein in step S1, SMOTE method is used to perform data imbalance processing, and data outlier processing is performed by using median and the number not existing in the data to replace.

3. The method for analyzing random forest visual data based on LargeVis according to claim 1,

for weightless networks, use y_iAnd y_jRepresenting two points in a low-dimensional space, two points having in the kNN graphA binary edge e_ijThe probability of (c) is:

P(e_ij＝1)＝f(‖y_i-y_j‖²)

wherein f (-) is similarly used for the t distribution in the t-SNE,

for a weighted network, the edge weight is w_ijThe probability of (c) is:

Wherein d is_jIn degrees at point j, the objective function is:

。

4. the method as claimed in claim 3, wherein the training is performed by asynchronous random gradient descent after the optimization of negative sampling and side sampling is completed.

5. A method for analyzing random forest visual data based on LargeVis as claimed in claim 3, wherein the time complexity of LargeVis is in linear relation with the number of nodes in the network.

6. The method for analyzing random forest visual data based on LargeVis as claimed in claim 1, wherein in step S4, a distribution map of low dimensional data is drawn according to the obtained low dimensional spatial data.