CN108280236A

CN108280236A - A kind of random forest visualization data analysing method based on LargeVis

Info

Publication number: CN108280236A
Application number: CN201810170150.5A
Authority: CN
Inventors: 黄立勤; 陈宋
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2018-02-28
Filing date: 2018-02-28
Publication date: 2018-07-13
Anticipated expiration: 2038-02-28
Also published as: CN108280236B

Abstract

The present invention relates to a kind of, and the random forest based on LargeVis visualizes data analysing method.Training dataset pre-processes；Training dataset important feature is extracted by random forest；Dimension-reduction treatment is carried out using LargeVis；Random forest based on LargeVis carries out visualization processing.The present invention proposes a kind of visual data analysing method of the random forest based on LargeVis, for high dimensional data, the feature importance trained using random forest, form new secondary high dimensional data, the data after LargeVis dimensionality reductions are recycled, random forest is sent into and carries out forecast analysis and form visualization, nicety of grading can be improved, the visual time can be improved again, while adapting to different data.

Description

A kind of random forest visualization data analysing method based on LargeVis

Technical field

It is especially a kind of random gloomy based on LargeVis the present invention relates to pattern-recognition, machine learning, big data analysis Woods visualizes data analysing method.

Background technology

The dimension in big data epoch, data characteristics is higher and higher, and is divided data by the method for certain dimensionality reduction Analysis just becomes to be even more important, meanwhile, how to the research emphasis under high dimensional data visualization and current environment.Currently, most passing through The dimension reduction method of allusion quotation is that PCA (Principal Component Analysis) is not only to carry out dimensionality reduction to high dimensional data, more It is important that eliminating noise by dimensionality reduction, it was found that the pattern in data.PCA is the n original less m of feature number A feature substitution, new feature are the linear combination of old feature, these linear combinations maximize sample variance, make new m as possible Feature is orthogonal.The intrinsic variability in mapping capture data from old feature to new feature.Later, researcher proposes stream Shape learns, and increases visual research, the main algorithm of manifold learning, that is, Nonlinear Dimension Reduction has：ISOMap (Isometric Maps), LE (laplacian eigenmaps), LLE (being locally linear embedding into).The hypothesis of manifold learning：On data sampling Mr. Yu's Manifold.It is main Algorithm is wanted to have：ISOMap is a kind of non-iterative global optimization approach.ISOMap is to MDS (Multidimensional Scaling- Multidimensional Scalings) it is transformed, it uses geodesic curve distance (curve distance) as two point distance in space, is originally With Euclidean distance, be mapped on an Euclidean space to which the data in manifold will be tieed up positioned at certain.ISOMap connects data point Get up to constitute an adjacent Graph and carrys out discretely approximate original manifold, and geodesic distance is then accordingly by most short on Graph Path is come approximate.On this basis, recently, Maaten has write a paper and has been improved t-SNE algorithms again, uses The various algorithms based on tree, specifically include two parts content：First, using kNN figures to indicate the similar of higher dimensional space midpoint Property；It is gravitation and repulsion two parts by gradiometer point counting second is that optimizing the solution procedure of gradient, some has equally been used to optimize Skill.From said program it is found that the algorithm of various dimensionality reductions can reduce the number of predictive variable, last result can be provided One frame is explained.

Currently, the algorithm of t-SNE is widely applied in manifold learning, but have the following disadvantages：Processing is extensive high When dimension data, the efficiency of t-SNE significantly reduces (including improved algorithm)；Parameter in t-SNE to different data collection more Sensitivity has mixed up parameter on a data set, has obtained a good effect of visualization, it has been found that cannot be in another number It is applicable according on collection, must take a significant amount of time and find suitable parameter, and this is very huge for the limitation of entire disaggregated model Greatly；Simple original high dimensional data is directly entered model training by dimensionality reduction mode and classifies, and precision is relatively low, the training time compared with It is more.Substantially all it is to carry out dimensionality reduction using initial data, and utilize existing model in addition, at present for the method for Data Dimensionality Reduction Classify, but this that there may be precision is not high, the data of dimensionality reduction do not have the problems such as explanatory.

The present invention proposes a kind of visual data analysis algorithm of the random forest based on LargeVis, for high dimension According to the feature importance trained using random forest forms new secondary high dimensional data, recycles the number after LargeVis dimensionality reductions According to feeding random forest carries out forecast analysis and forms visualization.Therefore, for the feature extraction of fetal heart frequency classification and visually The problem of change, the present invention propose new solution.

Invention content

The random forest that the purpose of the present invention is to provide a kind of based on LargeVis visualizes data analysing method, with gram Take defect existing in the prior art.

To achieve the above object, the technical scheme is that：A kind of random forest visualization number based on LargeVis According to analysis method, realize in accordance with the following steps：

Step S1：Training dataset pre-processes；

Step S2：The sample characteristics that training data concentrates proportion to be more than default gravity thresholds are extracted by random forest；

Step S3：Dimension-reduction treatment is carried out using LargeVis

Step S4：Random forest based on LargeVis carries out visualization processing.

In an embodiment of the present invention, in the step S1, data nonbalance processing is carried out using SMOTE methods, and It is replaced by using the number not having in median and data and carries out data outliers processing.

In an embodiment of the present invention, further include following steps in the step S2：

Step S21：According to a preliminary estimate and sort；

Step S211：To the characteristic variable in random forest according to VI descending sorts；

Step S212：Determine deletion ratio；20% is rejected from the characteristic variable that currently descending arranges is less than default ratio The characteristic variable of weight threshold value, to obtain a new feature set；

Step S213：New random forest is established with new feature set, and calculates the VI of each feature in feature set, side by side Sequence；

Step S214：Above step is repeated, until being left m feature；

Step S22：According to the random forest that each feature set and correspondence establishment that are obtained in step S21 are got up, calculating pair The outer error rate of bag answered, using the minimum feature set of error rate outside bag as finally selected feature set.

In an embodiment of the present invention, in the step S3, according to step S2 acquisitions as a result, passing through an accidental projection Tree obtains one and divides space, finds the k neighbours of each sample point on this basis, obtains a preliminary K arest neighbors；According to neighbour It is through, find potential neighbours using neighbor seaching algorithm, calculate neighbours and current point, the neighbours of neighbours and current point away from From, and be put into a bit of heap, k node for taking distance minimum obtains a final kNN figures as k neighbours.

In an embodiment of the present invention, for no weights network, y is used_iAnd y_jTwo points in expression lower dimensional space, two Point has a binary side e in the kNN figures_ijProbability be：

P(e_ij=1)=f (‖ y_i-y_j‖²)

Wherein, the t distributions that f () was similar used in t-SNE,If y_iAnd y_jThe distance between more Small, 2 points have the probability on binary side larger in the kNN figures；If conversely, y_iAnd y_jThe distance between it is bigger, then 2 points in institute Stating has the probability on binary side smaller in kNN figures；

It is w for having weights network, side right value_ijProbability be：

Entire optimization aim is to maximize the node of positive sample to there is the probability on connection side in the kNN figures, is minimized The node of negative sample in the kNN figures to there is the probability on connection side；Remember γ be negative sample side setting weights, then take one it is right Number, optimization aim become：

To each point i, according to a noise profile P_n(j) it randomly selects M point and constitutes negative sample, the noise point with i Cloth usesWherein d_jIt is for the degree namely object function of point j：

In an embodiment of the present invention, after by completing negative sampling and side sampling optimization, using asynchronous stochastic gradient descent It is trained.

In an embodiment of the present invention, the number of nodes in the time complexity and network of LargeVis is in a linear relationship.

In an embodiment of the present invention, in the step S4, according to the lower dimensional space data obtained, low-dimensional number is drawn out According to distribution map.

Compared to the prior art, the invention has the advantages that：

(1) present invention uses the method based on LargeVis, and first can improve the speed of service, second, for different Data set, which has, well adapts to ability, can effectively promote the performance of block mold.

(2) model can be explained using random forest in the present invention, first carries out a wheel feature extraction to data, reduces unnecessary Feature leaves important feature, forms new feature samples, and carry out dimensionality reduction, by after dimensionality reduction data input random forest into On the one hand row classification improves block mold performance, the data visualization after another aspect dimensionality reduction, more intuitively, for user For, it is explanatory stronger.

(3) model of the present invention only exists two basic models, but classification, visualization, dimensionality reduction, data prediction may be implemented And feature extraction, utilizability is stronger compared with other algorithms.

Description of the drawings

Fig. 1 is the flow chart that the random forest based on LargeVis visualizes data analysing method in the present invention.

Specific implementation mode

Below in conjunction with the accompanying drawings, technical scheme of the present invention is specifically described.

A kind of random forest based on LargeVis of the present invention visualizes data analysing method, real in accordance with the following steps It is existing：

Step S1：Training dataset pre-processes；

Step S3：Dimension-reduction treatment is carried out using LargeVis

Step S4：Random forest based on LargeVis carries out visualization processing.

In the present embodiment, due in practical applications, it may appear that the problem of data sample imbalance and exceptional value, this will Lead to bad classification results.Training dataset is uneven, can cause many problems in pattern-recognition.For example, if data Collection is uneven, then the accuracy rate highest that grader tends to " learn " sample of maximum ratio namely grader makes it is more inclined They are clustered in the high sample of ratio, and with highest precision.In practical applications, this prejudice is unacceptable 's.In order to realize being uniformly distributed for sample data, solve the problems, such as this using a small number of oversampling techniques of synthesis in this example, with Seldom sample is that each minority class creates " synthesis " example, and exceptional value is solved the problems, such as with median.

In the present embodiment, it is as follows that a small number of over-sampling algorithms are synthesized：

1. for each sample x in minority class, using Euclidean distance as criterion calculation it to owning in minority class sample set D The distance of sample obtains its k neighbour.

2. multiplying power N is sampled according to one oversampling ratio of sample imbalance ratio setting to determine, for each minority class Sample x randomly chooses several samples from its k neighbour, it is assumed that the neighbour selected is y.

The 3 neighbour y selected at random for each, build new sample according to following formula with original sample respectively

New samples=x+rand (0,1) * | x-y |

In the present embodiment, in pretreatment stage, training dataset imbalance can cause many problems in pattern-recognition. For example, if data nonbalance.In order to solve this problem, using SMOTE methods.Often occur asking for abnormal point in data Topic, when this can lead to training pattern, there is deviation in precision, therefore, using the number not having in median and data in the present embodiment Word is replaced, and carries out data outliers processing.

Further, the important feature stage is extracted in random forest, namely after the completion of random forest is trained, sample can be carried out Eigen ratio reorders, and extraction of the stage for the higher feature of proportion inside sample, further includes following steps：

1：According to a preliminary estimate and sort

A) to the characteristic variable in random forest according to VI (Variable Importance) descending sort.

B) it determines deletion ratio, 20% is rejected from the characteristic variable that currently descending arranges and is less than default gravity thresholds Characteristic variable, to obtain a new feature set.

C) new random forest is established with new feature set, and calculates the VI of each feature in feature set, and is sorted.

D) above step is repeated, until being left m feature.The m values are determined by entire model, preferably, by error rate Minimum feature set determines.

2：According to the random forest that each feature set obtained in step 1 is set up with them, calculate outside corresponding bag Error rate (OOB err), using the minimum feature set of error rate outside bag as finally selected feature set.

In the present embodiment, if high dimensional data, which is directly input to dimensionality reduction model, carries out dimensionality reduction, data processing time is long, Calculating parameter is excessive, may result in performance decline, and new feature collection is extracted using many decision tree Nearest Neighbor with Weighted Voting of random forest Data sample carry out dimensionality reduction, and classified using random forest, the accuracy and speed of entire model can be promoted.

Further, in the LargeVis dimension-reduction treatment stages：

Input：Pass through the data sample for the new feature that the random forest obtained by step S2 selects.

A space is obtained first with accidental projection tree to divide, and is found the k neighbours of each sample point on this basis, is obtained One does not tentatively require entirely accurate K arest neighbors (kNN, k-NearestNeighbor) figure.Further according to the thought that neighbour goes directly, Potential neighbours are found using neighbor seaching algorithm, neighbours is calculated and at a distance from current point, the neighbours of neighbours and current point and puts Among entering a rootlet heap, k node for taking distance minimum finally obtains an accurate kNN figure as k neighbours.

1. the case where for considering without weights network, using y_iAnd y_jIndicate two points in lower dimensional space, two points are in kNN There is a binary side e in figure_ijThe probability of (weights be 1 side) is：

P(e_ij=1)=f (‖ y_i-y_j‖²)

Wherein, the t distributions that f () was similar used in t-SNE,If y_iAnd y_jThe distance between more Small, 2 points have the probability on binary side larger in kNN figures；If conversely, y_iAnd y_jThe distance between it is bigger, then 2 points in kNN figures There is the probability on binary side smaller.

2. being w for having the case where weights network, definition side right value_ijProbability be：

Entire optimization aim is exactly to maximize the node of positive sample to there is the probability on connection side in kNN figures, is minimized negative The node of sample in kNN figures to there is the probability on connection side.Wherein, γ is the weights for being unified for the setting of negative sample side.One is taken again A logarithm, optimization aim become：

Following formula uses all negative samplesCalculation amount is too big, is solved using negative sampling algorithm.To each point i, root According to a noise profile P_n(j) it randomly selects M point and constitutes negative sample with i, which uses and Mikolov et al. use The similar form of noise profile, i.e.,Wherein d_jFor the degree of point j.Redefinable object function at this time：

In the present embodiment, after using negative sampling and side sampling optimization, LargeVis has also been used under asynchronous stochastic gradient It drops to be trained.The technology is very effective on sparse graph, because of two sections that the side of different threads sampling is connected The few repetitions of point, conflict is hardly generated between different threads.From time complexity, each round stochastic gradient The time complexity of decline is O (sM), wherein M is negative sample number, and s is the dimension (2 or 3) of lower dimensional space, stochastic gradient Step number is usually again directly proportional to joint number amount N.Therefore, total time complexity is O (sM).So as to obtain, LargeVis's Time complexity is in a linear relationship with the number of nodes in network.

Further, the stage is visualized in the random forest based on LargeVis, draws the distribution map of low-dimensional data.

In the present embodiment, a data set is given, by carrying out feature extraction to raw ultrasound data, obtains non-dimensionality reduction Data, the data obtained at this time are still a high-dimensional data space, by LargeVis prevalence learning algorithms, obtain one Lower dimensional space data are visualized, the performance of Observable overall data.

In the present embodiment, algorithmic procedure mainly comprises the following steps：

Input：Data set { a₁,a₂,…a_n, random forest parameter ntree, mtry

Output：The distribution map of lower dimensional space data

In the present embodiment, as shown in Figure 1, detailed process is as follows：

1. initialization

2. reading in eigenmatrix

3. obtaining a space using accidental projection tree to divide, the k neighbours each put are found on this basis；It recycles adjacent It occupies searching algorithm and finds potential neighbours, calculate neighbours and at a distance from current point, the neighbours of neighbours and current point and be put into one Among rootlet heap, k node for taking distance minimum finally obtains an accurate kNN figure as k neighbours.

4.For(iin1:k)

4.1. asynchronous stochastic gradient descent is trained

4.2. time complexity is in a linear relationship with the number of nodes in network

5. calculated locally optimal solution show that data lower dimensional space indicates, draws out the distribution map of low-dimensional data

The above are preferred embodiments of the present invention, all any changes made according to the technical solution of the present invention, and generated function is made When with range without departing from technical solution of the present invention, all belong to the scope of protection of the present invention.

Claims

1. a kind of random forest based on LargeVis visualizes data analysing method, which is characterized in that real in accordance with the following steps It is existing：

Step S1：Training dataset pre-processes；

Step S3：Dimension-reduction treatment is carried out using LargeVis

Step S4：Random forest based on LargeVis carries out visualization processing.

2. a kind of random forest based on LargeVis according to claim 1 visualizes data analysing method, feature It is, in the step S1, data nonbalance processing is carried out using SMOTE methods, and by using in median and data The number not having, which is replaced, carries out data outliers processing.

3. a kind of random forest based on LargeVis according to claim 1 visualizes data analysing method, feature It is, further includes following steps in the step S2：

Step S21：According to a preliminary estimate and sort；

Step S212：Determine deletion ratio；20% is rejected from the characteristic variable that currently descending arranges is less than default proportion threshold The characteristic variable of value, to obtain a new feature set；

Step S213：New random forest is established with new feature set, and calculates the VI of each feature in feature set, and is sorted；

Step S214：Above step is repeated, until being left m feature；

Step S22：According to the random forest that each feature set and correspondence establishment that are obtained in step S21 are got up, calculate corresponding The outer error rate of bag, using the minimum feature set of error rate outside bag as finally selected feature set.

4. a kind of random forest based on LargeVis according to claim 1 visualizes data analysing method, feature It is, in the step S3, according to step S2 acquisitions as a result, obtaining one by an accidental projection tree divides space, herein On the basis of find the k neighbours of each sample point, obtain a preliminary K arest neighbors；It is through according to neighbour, utilize neighbor seaching algorithm Potential neighbours are found, neighbours is calculated at a distance from current point, the neighbours of neighbours and current point, and be put into a bit of heap, takes K minimum node of distance obtains a final kNN figures as k neighbours.

5. a kind of random forest based on LargeVis according to claim 4 visualizes data analysing method, feature It is,

For no weights network, y is used_iAnd y_jIndicate that two points in lower dimensional space, two points have one two in the kNN figures First side e_ijProbability be：

P(e_ij=1)=f (‖ y_i-y_j‖²)

Wherein, the t distributions that f () was similar used in t-SNE,If y_iAnd y_jThe distance between it is smaller, two Point has the probability on binary side larger in the kNN figures；If conversely, y_iAnd y_jThe distance between it is bigger, then 2 points in the kNN There is the probability on binary side smaller in figure；

It is w for having weights network, side right value_ijProbability be：

Entire optimization aim is to maximize the node of positive sample to there is the probability on connection side in the kNN figures, minimizes and bears sample This node in the kNN figures to there is the probability on connection side；Remember that γ is the weights of negative sample side setting, then take a logarithm, Optimization aim becomes：

To each point i, according to a noise profile P_n(j) it randomly selects M point and constitutes negative sample with i, which usesWherein d_jIt is for the degree namely object function of point j：

。

6. a kind of random forest based on LargeVis according to claim 5 visualizes data analysing method, feature It is, after completing negative sampling and side sampling optimization, is trained using asynchronous stochastic gradient descent.

7. a kind of random forest based on LargeVis according to claim 5 visualizes data analysing method, feature It is, the number of nodes in the time complexity and network of LargeVis is in a linear relationship.

8. a kind of random forest based on LargeVis according to claim 1 visualizes data analysing method, feature It is, in the step S4, according to the lower dimensional space data obtained, draws out the distribution map of low-dimensional data.