CN107844461A

CN107844461A - A kind of Gaussian process based on broad sense N body problems returns computational methods

Info

Publication number: CN107844461A
Application number: CN201710966946.7A
Authority: CN
Inventors: 何克晶; 李智博
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2017-10-17
Filing date: 2017-10-17
Publication date: 2018-03-27

Abstract

The invention discloses a kind of Gaussian process based on broad sense N body problems to return computational methods, including：Data set partitioning based on double kd trees, double kd traversal of trees beta pruning methods, the kernel matrix solving method divided and ruled based on high-order and Cholesky decomposition algorithms, key step are included：Division spatially is carried out to input data set with kd trees, then tries to achieve simplified Euclidean distance with double kd trees traversal beta pruning methods and stores；According to Euclidean distance and relative index is simplified, using a square exponential kernel functions, training dataset and the kernel matrix K of test data set are tried to achieve_*And the kernel matrix K of training dataset itself；Kernel matrix K inverse matrix K is quickly tried to achieve with Cholesky decomposition algorithms^‑1；Finally, by kernel matrix K inverse matrix K^‑1, kernel matrix K_*And the target function value of input, can be in the hope of prediction result vector.By the method increase forecasting efficiency and the ability of processing big data that Gaussian process returns, Gaussian process is promoted to return the extensive use in big data analysis.

Description

Gaussian process regression calculation method based on generalized N-body problem

Technical Field

The invention relates to a big data regression technology under the big data background, in particular to a Gaussian process regression calculation method based on a generalized N-body problem general solving method, so as to improve the efficiency of big data processing and analysis.

Background

In the big data era, the collection, access, management, analysis and utilization of mass data become a global research and application hotspot. The big data analysis is an important component of big data related research and application. The Gaussian process regression is a machine learning algorithm developed based on Bayes theory and statistical learning theory, and is suitable for processing the problem of high-dimensional nonlinear regression. Compared with other big data analysis classical algorithms such as SVM, neural Network and the like, the Gaussian process regression has the advantages of flexible nonparametric inference, probability significance of output and the like. However, the general calculation method of gaussian process regression has the disadvantages of large calculation amount, high-dimensional data trap and the like, and cannot be well adapted to massive high-dimensional data in the big data era. The generalized N-body problem is a problem of calculating the relationship such as distance, kernel, similarity and the like between point pairs, comprises the problems of Gaussian process regression, N-point correlation function, all-near-Neighbor, nonparametric Bayesian classification, kernel density estimation and the like, and is widely applied to big data analysis. A high-order divide-and-conquer algorithm based on a space division data structure is a general calculation method for a generalized N-body problem. Based on the method, the general solution method of the generalized N-body problem is improved and applied to the calculation of Gaussian process regression, so that the limitations of low calculation efficiency and low big data processing capability of a general algorithm of Gaussian process regression can be overcome, and the wide application of Gaussian process regression in big data analysis is promoted.

Disclosure of Invention

In order to solve the problems, the invention provides a Gaussian process regression calculation method based on the generalized N-body problem, which can accelerate the model fitting and prediction efficiency and improve the big data adaptability of Gaussian process regression. In order to achieve the above purpose, the invention adopts the following technical scheme:

the invention discloses a Gaussian process regression calculation method based on a generalized N-body problem, which is characterized in that the Gaussian process regression calculation is cooperatively realized by utilizing a data set partition method based on a double-kd tree, a traversal pruning method based on the double-kd tree, a kernel function matrix solution based on high-order partition and a Cholesky decomposition algorithm;

the formula for the gaussian process regression is:

wherein K _* ^T Is the kernel function matrix between the training data set and the test data set, K is the kernel function matrix of the training data set itself, σ _n ² Is noise, I represents the identity matrix, y is the input training objective function value vector,is a predicted target value vector;

the method specifically comprises the following steps:

performing spatial division on an input data set by using a data set division method based on a double-kd tree, and then obtaining and storing a simplified Euclidean distance by using a traversal pruning method of the double-kd tree; according to the simplified Euclidean distance and related indexes, a square exponential kernel function is adopted to obtain a kernel function matrix K of the training data set and the test data set _* And training a kernel function matrix K of the data set; fast solving inverse matrix K of kernel function matrix K by using Cholesky decomposition algorithm ^-1 (ii) a Finally, the inverse K of the kernel function K of the training data set itself ^-1 Kernel function matrix K between training data set and test data set _* And the input objective function value to obtain a prediction result vector;

the data set partitioning method based on the double-kd tree is used for spatially partitioning data, and in the calculation of Gaussian process regression, for a training data set R and a testing data set Q, the kd tree T of each R and Q is respectively constructed according to the same rule _R 、T _Q (ii) a Traversing, calculating and pruning the two kd trees simultaneously to finally obtain a kernel function matrix;

the traversal pruning method of the double-kd tree is used for traversal, calculation and pruning of the double-kd tree; respectively calculating the distance between the corresponding nodes of the two kd-trees from the respective root nodes of the two kd-trees; setting a threshold value epsilon, when the distance between the nodes obtained by calculation is larger than the value epsilon, considering the distance between the two nodes to be infinite according to the calculation property of the kernel function, pruning the two nodes, and not recursing the two nodes and all child nodes; for the nodes with the distance value smaller than the epsilon, continuing the recursive computation until all the nodes are recurred and each leaf node is computed at least once;

the kernel function matrix solving method based on the high-order division is used for solving a kernel function matrix; according to the calculation result of the double kd-tree, the value of a kernel function matrix is obtained; the kernel function is a function defining the similarity or distance between data points, so that a corresponding kernel function value is calculated according to the distance x-x' between two data points;

the Cholesky decomposition algorithm is used for inverting the kernel function matrix; the fast inversion of the kernel function matrix by using Cholesky decomposition is adopted, and for the matrix inversion process, the Cholesky decomposition algorithm is used for accelerating the inversion efficiency and accelerating the prediction efficiency of Gaussian process regression.

As an optimal technical scheme, a general solving method of the generalized N-body problem, namely a high-order division and treatment algorithm based on a division data structure is used for respectively solving a kernel function matrix K of a training data set and a kernel function matrix K between the training data set and a test data set _* 。

As a preferred technical solution, based on a data set partitioning method of a dual kd tree, the same partitioning rule is applied to respectively spatially partition a training data set and a test data set, that is, a whole data set is used as a root node of each kd tree, each time ordering is performed according to a value of a certain dimension of each data point, a partitioning object is equally divided into two parts which are used as left and right child nodes of the kd tree, and then the child nodes are respectively partitioned, and the above partitioning process is recursed until a leaf node can not be partitioned any more.

As a preferred technical scheme, in a traversal pruning method of the double kd-tree, a simplified Euclidean distance between two nodes is calculated during traversal, and whether the two nodes should be pruned or not is judged according to a distance value; defining the simplified Euclidean distance as the Euclidean distance without the evolution operation:

dist(X，X ^* )＝||X-X ^* || ² 。

as a preferred technical scheme, a double-kd tree traversal pruning method for simplifying Euclidean distance is applied, if the simplified Euclidean distance between two calculated nodes is smaller than a set threshold value epsilon, downward recursive search is continued until the two nodes are leaf nodes, and at the moment, the simplified Euclidean distance of all data points between the two leaf sub-nodes is calculated pair by pair and stored; if the simplified Euclidean distance between two nodes obtained by calculation is larger than or equal to the set threshold value epsilon, the optimized technical scheme is that when the double-tree is traversed in the traversing pruning method processing process of the double-kd tree, all nodes are traversed by adopting a depth-first strategy from the root node.

As an optimal technical scheme, in a kernel function matrix solving method based on high-order divide-and-conquer, a square exponential kernel function is selected

As a kernel function of the gaussian process regression; where l represents a bandwidth parameter.

As a preferred technical scheme, in a square exponential kernel function, (x-x') ² Values from the foregoing reduced euclidean distance; and reading the corresponding simplified Euclidean distance value when calculating the kernel function matrix, and solving the kernel function matrix through exponential operation.

As a preferred technical solution, in the Cholesky decomposition algorithm, fast inversion of a kernel function matrix of Cholesky decomposition is applied, and the method aims atThe matrix inversion part in the system is used for solving the inverse of the matrix by using a Cholesky decomposition algorithm, so that the operation efficiency of the whole process is accelerated; under the same conditions, the computation rate of the Cholesky decomposition algorithm inversion is 2 times faster than the computation rate of the LU decomposition algorithm inversion.

As an optimal technical scheme, aiming at a kernel function matrix K of a training data set and a kernel function matrix K between the training data set and a test data set _* And respectively carrying out the processes of double-kd tree construction, traversal pruning, calculation of simplified Euclidean distance and kernel function value twice.

Compared with the prior art, the invention has the following advantages and effects:

the invention adopts the technical scheme of combining the high-order divide-and-conquer algorithm based on the space division data structure with the Gaussian process regression calculation method, and the high-order divide-and-conquer algorithm based on the space division data structure is a general calculation method for the generalized N-body problem. Based on the method, the general solution method of the generalized N-body problem is improved and applied to the calculation of the Gaussian process regression, so that the limitations of low calculation efficiency and low big data processing capability of the general algorithm of the Gaussian process regression are overcome, and the wide application of the Gaussian process regression in big data analysis is promoted.

Drawings

FIG. 1 is a diagram of an initial situation in which data is partitioned using a kd-Tree.

FIG. 2 is an exemplary diagram of 10 two-dimensional data points partitioned using a kd-Tree.

Fig. 3 is a diagram of a binary tree structure partitioned in fig. 2.

Fig. 4 is a flow chart of the dual tree algorithm therein.

FIG. 5 is an operation diagram of a two-tree algorithm recursive to two nodes.

FIG. 6 is an operation diagram of the dual-tree algorithm when recursion occurs to the child nodes of the node.

FIG. 7 is a flow chart of a Gaussian process regression calculation method based on the generalized N-body problem.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the embodiments of the present invention are not limited thereto.

Examples

The Gaussian process regression is nonparametric regression and has the advantages of flexible nonparametric inference, probability significance of output and the like. However, the general calculation method of gaussian process regression has the disadvantages of large calculation amount, high-dimensional data traps and the like, and has certain limitation when large data analysis is carried out. The generalized N-body problem is a problem of calculating the relation of distance, kernel, similarity and the like between point pairs, and a high-order partition algorithm based on a partition data structure is a general solving method of the generalized N-body problem. The gaussian process regression belongs to one of the generalized N-body problems, and the kernel function matrix calculation process can be optimized by using a general solution method of the generalized N-body problem, so that the prediction efficiency of the gaussian process regression is improved, as shown in fig. 7, and a flowchart of the gaussian process regression calculation method based on the generalized N-body problem is shown. The kd-tree is a space division data structure and has the advantages of easy realization and high efficiency. The invention uses kd-tree as space division data structure, divides the training data set and the testing data set, and traverses and prunes the double trees to obtain the kernel function matrix. And calculating a predicted value vector according to the calculated kernel function matrix and the input training set target.

The kd-tree, i.e. the k-dimension tree, is a binary tree in which nodes store k-dimensional data. As shown in fig. 1, the kd-tree bisects a whole set of data points into two parts according to the distance measure of the data points. A kd tree is constructed on the basis of a data set with a dimension k, namely the partition of a k-dimensional space formed by the data set, and any node in the tree corresponds to a k-dimensional hyper-rectangular area. FIG. 2 is an exemplary diagram of partitioning 10 two-dimensional data points using a kd-Tree, and FIG. 3 is a binary tree hierarchy diagram of the partitioned data points of FIG. 2. Binary search trees are commonly used for interval searching of one-dimensional data. Since the kd-tree and the binary search tree have structural similarities, the search rules are closely related for both the kd-tree and the binary search tree. In the invention, a depth-first strategy is applied to traverse each node of the kd-tree.

And aiming at the training data set and the testing data set, respectively constructing respective kd trees according to the same rule. In the process of constructing the kd-tree, a single data set is used as the root of the kd-tree, each time a component on a certain dimension of a data point of a current node is used as a comparison object, the data point is equally divided into two parts to respectively form a left child node and a right child node, and recursion is carried out on the newly divided child nodes until a leaf node can not be divided any more. In the construction process of the kd tree, if the time for constructing n data points is T (n), then

Solving the above recursive method can obtain O (nlogn), i.e. the time complexity of constructing the kd-tree based on the data points occupying O (n) storage space is O (nlogn).

And after the kd tree is constructed, pruning and traversing of double trees can be carried out. Fig. 4 is a flow chart of the dual-tree algorithm. Taking the calculation of the kernel function matrix between the training data set and the test data set as an example, let the kd-Tree based on the training data set be T _R The kd-Tree based on the test data set is T _Q . As shown in fig. 5, starting from the root nodes of the two trees, the simplified euclidean distance of the two nodes is calculated:

dist(X，X ^* )＝||X-X ^* || ²

therefore, the influence of the dimensionality disaster on the calculation of the distance between the two data points can be reduced to a certain degree, and meanwhile, the process of opening the square root is omitted, so that the larger calculation amount can be reduced, and the accuracy of distance judgment is not lost. Starting from a root node, recursing to a lower node by adopting a depth-first search strategy, as shown in fig. 6, if the simplified Euclidean distance between two nodes is less than the e, continuing to recurse to the lower node; and if the simplified Euclidean distance between the two points is larger than or equal to the epsilon, pruning the pair of nodes. Wherein epsilon is a parameter set by people, and the degree of algorithm pruning can be controlled. Let N be a certain node in the kd-Tree, then N _q For a certain node, N, in the kd-Tree where the test data set is located _r Is a certain node in the kd-tree where the training data set is located. During the traversal pruning process of the double trees, the method comprises the following stepsThe following four cases:

(1) The distance between the two nodes is larger than the set epsilon: at the moment, pruning is carried out on the two nodes, and nodes below the two nodes are not recursed;

(2) Both nodes are leaf nodes: at the moment, the simplified Euclidean distances between all the point pairs in the two nodes are calculated and stored, and the algorithm does not recur to the following nodes any more;

(3)N _q is a leaf node and N _r Is an intermediate node: acting recursively on N _r A node;

(4)N _q is an intermediate node and N _r Is a leaf node: acting recursively on N _q And (4) nodes.

In case (3) and case (4), the pseudo code of the two-tree traversal algorithm is as follows:

{case 3:N _q is a leaf node and N _r Is an intermediate node }

if Distance(N _q ,N _rc1 )<Distance(N _q ,N _rc2 )

DualTreeTraversal(N _q ,N _rc1 )

DualTreeTraversal(N _q ,N _rc2 )

else

DualTreeTraversal(N _q ,N _rc2 )

DualTreeTraversal(N _q ,N _rc1 )

{case 4:N _q Is an intermediate node and N _r Is the leaf node }

if Distance(N _qc1 ,N _r )<Distance(N _qc2 ,N _r )

DualTreeTraversal(N _qc1 ,N _r )

DualTreeTraversal(N _qc2 ,N _r )

else

DualTreeTraversal(N _qc2 ,N _r )

DualTreeTraversal(N _qc1 ,N _r )

Wherein, distance () function calculates the simplified Euclidean Distance between two nodes, and DualTreeTranssal () function carries out recursion traversal. The simplified Euclidean distance value obtained by double-tree traversal and pruning and related indexes are stored in a heap heapK and used as input parameters for next calculation of the kernel function matrix.

And when all nodes are recurred and each leaf node is calculated at least once, calculating a kernel function matrix according to the obtained heapK. The covariance value corresponding to the index value can be obtained from the definition of the square exponential kernel function and the simplified euclidean distance. The following is the algorithmic pseudo code to compute the kernel function matrix:

covK←θ

for m in range(X)

fori in range k

covK[i][index]←kernel[m][i]

in the above algorithm, the kernel function matrix covK is first initialized to θ. The value of θ is equal to the value of the covariance function when the distance between two data points is infinite. For pruned nodes, the distance between the relevant data point pairs can be considered infinite, i.e. the covariance function value between these data points is θ. Next, the simplified Euclidean distance values in the heapK are assigned to corresponding positions in the kernel function matrix covK according to the indexes. kernel stands for the operation of the kernel function:

kernel←exp[(0.5×dist_from_heapK)/l ² ]

where l is a bandwidth parameter and the default value is 1.0.

According to the method, two kernel function matrixes are respectively calculated, namely the kernel function matrix K between the training data set and the test data set _* And a kernel function matrix K of the training data set itself. Since the double kd-trees of the training data set are used for both calculations, the respective kd-trees can be constructed only once in actual practice.

Consider the case of no noise, i.e.

For the kernel function matrix K of the training data set itself, its inverse matrix needs to be solved. In the invention, the Cholesky algorithm is applied to solve the inverse matrix, so that the calculation efficiency of the algorithm can be accelerated. The Cholesky decomposition is a product of a positive definite matrix and a lower triangular matrix and a conjugate transpose matrix, and the calculation efficiency is about 2 times that of the LU decomposition.

According to a kernel function matrix K between a training data set and a test data set _* And training a kernel function matrix K of the data set to obtain a prediction result vector. The following is the algorithmic pseudo code of the model fitting and prediction sections.

L←cholesky(K)

α←L ^T \(L\y)

In the pseudo code, L is a lower triangular matrix obtained by performing Cholesky decomposition on the matrix K. α is the dot product of the inverse of matrix K and target y.I.e. the final prediction result. If the variance of the prediction is required to be known, it can be calculated from the following algorithm pseudo-code:

v＝L\K _*

V[f _* ]＝K _** -υ ^T υ

FIG. 7 is a flowchart of a Gaussian process regression calculation method based on generalized N-body problem, i.e. a flowchart of the whole calculation process of the present invention.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such modifications are intended to be included in the scope of the present invention.

Claims

1. A Gaussian process regression calculation method based on a generalized N-body problem is characterized in that the method utilizes a data set partition method based on a double-kd tree, a traversal pruning method of the double-kd tree, a kernel function matrix solution based on high-order partition and a Cholesky decomposition algorithm to cooperatively realize the Gaussian process regression calculation;

the formula for the gaussian process regression is:

the method specifically comprises the following steps:

the data set partitioning method based on the double-kd tree is used for spatially partitioning data, and in the calculation of Gaussian process regression, for a training data set R and a testing data set Q, the kd tree T of each R and Q is respectively constructed according to the same rule _R 、T _Q (ii) a By aligning the twoTraversing, calculating and pruning the kd-trees simultaneously to finally obtain a kernel function matrix;

the traversing and pruning method of the double kd trees is used for traversing, calculating and pruning the double kd trees; respectively calculating the distance between the corresponding nodes of the two kd-trees from the respective root nodes of the two kd-trees; setting a threshold value epsilon, when the distance between the nodes obtained by calculation is larger than the value epsilon, considering the distance between the two nodes to be infinite according to the calculation property of the kernel function, pruning the two nodes, and not recursing the two nodes and all child nodes; for the nodes with the distance value smaller than the epsilon, continuing the recursive computation until all nodes are recurred and each leaf node is computed at least once;

the kernel function matrix solving method based on the high-order division is used for solving a kernel function matrix; obtaining the value of a kernel function matrix according to the calculation result of the double kd-tree; the kernel function is a function defining the similarity or distance between data points, so that a corresponding kernel function value is calculated according to the distance x-x' between two data points;

2. The method of claim 1, wherein the generalized N-body problem-based Gaussian process regression calculation method is applied to solve the kernel function matrix K of the training data set itself and the kernel function matrix K between the training data set and the test data set respectively by using a general solution method of the generalized N-body problem, i.e. a high-order divide and conquer algorithm based on a divided data structure _* 。

3. The method for gaussian process regression calculation based on generalized N-body problem as claimed in claim 1, wherein the method is based on dual kd-tree dataset partitioning, the training dataset and the testing dataset are respectively spatially partitioned by applying the same partitioning rule, that is, the whole dataset is used as the root node of each kd-tree, each time the data sets are sorted according to the value of a certain dimension of each data point, the partitioned object is equally divided into two parts which are used as the left and right child nodes of the kd-tree, and then the child nodes are respectively partitioned, and the above partitioning process is recursed until the leaf nodes can not be further partitioned.

4. The method for gaussian process regression calculation based on generalized N-body problem according to claim 1, wherein in the traversal pruning method of double kd-tree, the simplified euclidean distance between two nodes is calculated during traversal, and it is determined whether two nodes should be pruned according to the distance value; defining the simplified Euclidean distance as the Euclidean distance without the evolution operation:

dist(X，X ^* )＝||X-X ^* || ² 。

5. the Gaussian process regression calculation method based on the generalized N-body problem is characterized in that a double-kd tree traversal pruning method of simplified Euclidean distance is used, if the simplified Euclidean distance between two nodes obtained through calculation is smaller than a set threshold epsilon, downward recursive search is continued until the two nodes are leaf nodes, and at the moment, the simplified Euclidean distance of all data points between the two leaf nodes is calculated pair by pair and stored; and if the simplified Euclidean distance between the two nodes obtained by calculation is larger than or equal to a set threshold value epsilon, pruning the two nodes, and not recursively calculating all the following child nodes.

6. The method of claim 4, wherein the traversal of the dual-kd-Tree is performed while traversing each node from a root node using a depth-first strategy in the traversal pruning process of the dual-kd-Tree.

7. The method of Gaussian process regression calculation based on generalized N-body problem as claimed in claim 1, wherein in the kernel function matrix solving method based on high order divide and conquer, square exponential kernel function is selected

8. The method of claim 7, wherein the regression is based on the generalized N-body problem (x-x') ² Values from the foregoing reduced euclidean distance; and reading the corresponding simplified Euclidean distance value when calculating the kernel function matrix, and solving the kernel function matrix through exponential operation.

9. The method for Gaussian process regression calculation based on generalized N-body problem as claimed in claim 1, wherein in Cholesky decomposition algorithm, fast inversion of kernel function matrix of Cholesky decomposition is applied, aiming atIn the matrix inversion part, the inverse of the matrix is obtained by using a Cholesky decomposition algorithm, so that the operation efficiency of the whole process is accelerated; under the same conditions, the computation rate of the Cholesky decomposition algorithm inversion is 2 times faster than the computation rate of the LU decomposition algorithm inversion.

10. The method for Gaussian process regression calculation based on generalized N-body problem according to any one of claims 1 to 9, wherein the kernel function matrix K for the training data set itself and the kernel function matrix K between the training data set and the test data set _* And respectively carrying out the processes of double kd tree construction, traversal pruning, calculation of simplified Euclidean distance and kernel function value twice.