CN113221065A

CN113221065A - Data density estimation and regression method, corresponding device, electronic device, and medium

Info

Publication number: CN113221065A
Application number: CN202010525621.7A
Authority: CN
Inventors: 杭汉源; 林宙辰
Original assignee: Beijing Samsung Telecommunications Technology Research Co Ltd; Samsung Electronics Co Ltd
Current assignee: Beijing Samsung Telecom R&D Center; Beijing Samsung Telecommunications Technology Research Co Ltd; Samsung Electronics Co Ltd
Priority date: 2020-02-06
Filing date: 2020-06-10
Publication date: 2021-08-06

Abstract

The application provides a big data density estimation and large-scale regression method, a big data density estimation and large-scale regression device, electronic equipment and a medium. The data density estimation method includes the steps of: generating a plurality of times of self-adaptive random division for the data set; in each division, constructing a local density estimation model by using the samples in each division grid respectively; splicing the local density estimation models together to obtain an overall density estimation model under random division; and integrating the overall density estimation model under the multiple divisions.

Description

Data density estimation and regression method, corresponding device, electronic device, and medium

Technical Field

The invention relates to big data analysis in the field of artificial intelligence, in particular to a big data density estimation method based on self-adaptive random partitioning and model integration, a large-scale regression method based on self-adaptive random partitioning and model integration, a big data density estimation device based on self-adaptive random partitioning and model integration and a large-scale regression device based on self-adaptive random partitioning and model integration.

Background

With the diversification of human life styles and the development of the digital information era, the scale and complexity of the generated big data are also rapidly increased, and the big data are converged by three main technical trends of massive transaction data, massive interactive data and massive data processing. The big data is characterized by huge data size, various data types, high flow speed and low value density. Therefore, analysis of a large amount of large data is complicated, and speed and efficiency of data analysis are required.

Big data analysis is a process of extracting hidden, unknown a priori, but potentially useful information and knowledge from a large amount of incomplete, noisy, fuzzy, random practical application data, and the analysis of big data is to analyze the information without explicit assumption to discover the knowledge.

Big data analytics typically involve two aspects: density estimation of big data and regression analysis of big data.

Disclosure of Invention

An important research direction for unsupervised learning in statistical machine learning is density estimation and regression analysis, which plays a key role as a basic learning task in the middle of many more advanced learning tasks. However, the classical density estimation and regression analysis algorithm cannot effectively process data with high dimensionality and large data volume, so that an unsupervised machine algorithm is established for density estimation and regression analysis, the algorithm is based on the self-adaptability of division, has higher stability, can be combined with parallel calculation to accelerate the operation speed, and shows good prediction accuracy and higher training speed on a real large-scale data set.

According to an aspect of the present invention, there is provided a data density estimation method, including the steps of: generating a plurality of times of self-adaptive random division for the data set; in each division, constructing a local density estimation model by using the samples in each division grid respectively; splicing the local density estimation models together to obtain an overall density estimation model under random division; and integrating the overall density estimation model under the multiple divisions.

Wherein the adaptive random partitioning comprises one of adaptive pure random partitioning and adaptive histogram transformation partitioning.

The local density estimation model adopts the ratio of the proportion of the number of samples in each grid to the total number of samples to the size of the grid.

The model integration expression takes the comprehensive result of the integral regression model under a plurality of random divisions as the final result of the model.

Wherein, the integration of the integral regression models under a plurality of random divisions adopts an averaging method.

The self-adaptive pure random division is characterized in that t sample points are randomly selected in advance before each division, a grid to be divided is selected as a grid containing the most samples, and division dimensions and tangent points are randomly selected.

Before each division, the self-adaptive histogram transformation division selects lattices with the number of sample points larger than m for division, selects the dimension to be divided as the dimension with the largest sample variance, and selects tangent points as the median of the dimension data until the number of the sample points in all the lattices is smaller than m.

The data density estimation method further comprises the following steps: rotation, stretch, and translation transformations are randomly performed on the data of the data set prior to partitioning.

The data density estimation method further comprises the following steps: prior to the adaptive stochastic partitioning, the accuracy of the data in the data set is determined.

In the data density estimation method, before the adaptive random division, the extreme values of the data in the data set are judged and accepted.

In the data density estimation method, the method further includes: before the self-adaptive random division, whether the data in the data set belong to abnormal samples is judged, and when the data in the data set belong to the abnormal samples, the abnormal samples of the data in the data set are screened out.

According to an aspect of the present invention, there is also provided a data density estimation apparatus including: the self-adaptive division module generates a plurality of times of self-adaptive random division on the data set; a density estimation module, which constructs a local density estimation model by using the samples in each division grid in each division; the single estimation module of the overall density estimation model combines the local density estimation models together to obtain an overall density estimation model under random division; and the integral density estimation model integration module integrates the integral density estimation models divided for multiple times.

According to an aspect of the present invention, there is also provided a data regression method, including the steps of: generating a plurality of times of self-adaptive random division for the data set; obtaining a local regression model on each division grid, and splicing to obtain an overall regression model; and integrating all the integral regression models to obtain an integrated model, and obtaining regression analysis of data.

The self-adaptive random division comprises one of self-adaptive pure random division, self-adaptive histogram transformation division and random self-adaptive polygon division.

The local regression model adopts a support vector machine regression (SVR) or a local average method.

Wherein, the integration of the integral regression model under a plurality of random divisions adopts a simple average method or a weighted average method.

The data regression method further comprises the following steps: randomly generated rotational, stretching and translation transformations are performed on the data of the data set prior to partitioning.

The data regression method further comprises: prior to the adaptive stochastic partitioning, the accuracy of the data in the data set is determined.

In the data regression method, before the self-adaptive random division, the extreme values of the data in the data set are judged and accepted or rejected.

According to an aspect of the present invention, there is also provided a data regression apparatus, including: the self-adaptive division module generates a plurality of times of self-adaptive random division on the data set; the local regression module is used for obtaining a local regression model on each division grid; the integral regression module is used for splicing the local regression models to obtain an integral regression model; and the integral regression module integration module is used for integrating all integral regression models to obtain an integrated model and obtain regression analysis of data.

According to an aspect of the present invention, there is also provided an electronic apparatus, including: one or more processors; a memory; one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: any of the above-described data density estimation methods and data regression methods are performed.

According to an aspect of the present invention, there is also provided a computer-readable storage medium storing at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by the processor to implement any one of the above-mentioned data density estimation method and data regression method.

Compared with the prior art, the method has the characteristics of robustness and suitability for large-scale data.

Density estimation for big data:

when the prior art uses the kernel function method to perform density estimation, each test point is affected by all sample points, so that the estimation accuracy of the test point may be reduced by abnormal points existing in the data. Different from the above, firstly, since the space division can be adopted for the data, when facing the data with the abnormal points, the area which can be affected by the abnormal points is mainly the grid area where the points fall. Secondly, subsequent model ensemble learning can make the influence brought by the abnormal point averaged by the nearby normal point, thereby reducing the influence. Therefore, the random forest density estimation model as the best choice has relatively strong robustness.

In the prior art, the density estimation cannot be effectively carried out on large-scale data due to the large calculation amount of the histogram estimation and the kernel function method. In contrast, the model of the application can achieve the purpose of processing large-scale data by fully utilizing parallel computing resources.

In the present application, the following two-step method is used for space division: firstly, dividing a sample space into a plurality of small blocks, secondly, constructing a random density sub-tree on each small block, and finally, splicing all sub-trees into a density tree on the whole space. Since the adaptive random division is adopted in the method, the two-step division mode enables the subtrees in a single tree to have parallelism without changing the random division structure. In addition, the random forest algorithm can perform parallel computation among trees, so that the problem of density estimation of large-scale data can be solved through simultaneous parallelism in trees and among trees.

Regression analysis for big data:

in the big data era, with the development of data generation, collection and storage technology, the data scale shows explosive growth, and has important significance for processing and analyzing large-scale data, exploring and disclosing social operation mode and objective rule and promoting scientific and technological development. Many problems in real life can be abstracted into large-scale regression problems, such as voice recognition, audio information retrieval and the like, and the method can also be applied to other large-scale regression tasks, such as an age prediction task in image recognition, position prediction of a 5G terminal, 5G wireless network flow prediction, 5G mobile communication network planning and the like.

In order to solve the problems of low calculation efficiency and insufficient prediction precision of large-scale samples and high-dimensional data in the prior art, a sample set is randomly divided into a plurality of subsets according to a new mode, so that regression models such as mean value regression and support vector machine regression can be applied to each sample subset, the regression models can be well combined with parallel calculation, each subtask in the regression calculation is distributed to a plurality of cores of a computer according to a division grid, the operation time is saved, and the algorithm efficiency is improved. Meanwhile, the invention generates multiple random partitions and integrates regression models under different partitions, thereby solving the problem that the regression models are discontinuous at partition boundaries and improving the accuracy of regression prediction. As shown in fig. 13, the more times of randomly generating histogram transformation division, the stronger the continuity of the obtained integrated model, and the higher the fitting accuracy to the data; and in addition, parallel computing can be combined, subtasks in the parallel computing are respectively distributed to a plurality of cores of the computer, and the prediction accuracy is improved while the running speed is still kept high.

Drawings

The foregoing and/or other aspects will become apparent and more readily appreciated from the following description of the exemplary embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 shows the histogram density estimation in the one-dimensional case.

Fig. 2 shows the kernel density estimation in the one-dimensional case.

Fig. 3(a) shows a common kernel function form and fig. 3(b) shows gaussian kernel functions at different bandwidths.

Fig. 4 shows a specific partitioning diagram of purely random partitioning.

FIG. 5(a) is the distribution of samples in the original space; fig. 5(b) and 5(c) are schematic diagrams of possible results of adaptive histogram conversion partitioning.

Fig. 6 shows a box plot of extreme value determination.

Fig. 7 shows a schematic of the general data (left) and its regression model (right).

Fig. 8 shows a schematic of the regression of data with a straight line (left) and its mean square error (right).

Fig. 9 shows a polygon division diagram (left), with the water cube being an example (right).

Fig. 10 shows a schematic of the stitching gaussian process spatial interpolation (left) and gaussian process regression (right).

FIG. 11 shows a schematic diagram of support vector machine regression.

Fig. 12 shows a flowchart of a density estimation method based on adaptive random partitioning and simple model integration according to a first embodiment of the present invention.

Fig. 13 shows a schematic diagram comparing a conventional pure random partitioning with an adaptive pure random partitioning according to the present invention.

FIG. 14 shows a flow chart of the adaptive pure stochastic partition method employed in step 1207 of the density estimation method based on adaptive stochastic partition and simple model integration according to the first embodiment of the present invention.

FIG. 15 shows a flow chart of the adaptive histogram transform partitioning method employed in step 1207 of the density estimation method based on adaptive stochastic partitioning and simple model integration according to the first embodiment of the present invention.

Fig. 16 shows a flowchart of a density estimation method based on a histogram transformation division and lifting algorithm according to a second embodiment of the present invention.

Fig. 17 shows a block diagram of a density estimation apparatus based on adaptive random partitioning and simple model integration according to a third embodiment of the present invention.

Fig. 18 is a block diagram showing a density estimation apparatus based on a histogram transform division and lifting algorithm according to a fourth embodiment of the present invention.

Fig. 1a shows a block diagram of a probability density clustering method based on a purely random histogram transformation partitioning and integration algorithm according to another embodiment of the present invention.

Fig. 1b shows a block diagram of a probability density clustering method based on a purely random histogram transformation partitioning and lifting algorithm according to another embodiment of the present invention.

Fig. 1c shows a block diagram of a probability density anomaly detection method based on pure random histogram transformation partitioning and random forest according to another embodiment of the present invention.

Fig. 1d shows a block diagram of a probability density clustering method of the probability density anomaly detection method based on K-nearest neighbor and histogram transformation partitioning Bagging algorithm according to another embodiment of the present invention.

Fig. 1e shows a random forest anomaly detection model based on an auto-supervised method according to yet another embodiment of the present invention.

Fig. 19 shows a flowchart of a large-scale regression method based on adaptive random partitioning and model integration according to a fifth embodiment of the present invention.

Fig. 19 ' (a), 19 ' (b), and 19 ' (c) illustrate specific examples of a large-scale regression method based on adaptive random partitioning and model integration according to a fifth embodiment of the present invention.

Fig. 20 shows a flowchart of the adaptive polygon partition method adopted in step 1907 of the large-scale regression method based on adaptive random partition and model integration according to the fifth embodiment of the present invention.

Fig. 21 shows a support vector machine employed in step 1911 of the large scale regression method based on adaptive random partitioning and model integration according to the fifth embodiment of the present invention.

FIG. 22 shows the support vector machine regression employed in step 1911 of the large scale regression method based on adaptive random partitioning and model integration according to the fifth embodiment of the present invention.

Fig. 23 is a flowchart illustrating a large-scale regression method based on a histogram transformation partitioning and lifting algorithm according to a sixth embodiment of the present invention.

Fig. 24 is a block diagram illustrating a large-scale regression apparatus based on adaptive random partitioning and model integration according to a seventh embodiment of the present invention.

Fig. 25 shows a block diagram of a large-scale regression device based on a histogram transformation partitioning and boosting algorithm according to an eighth embodiment of the present invention.

Fig. 26 shows a simulation experiment using a stitched gaussian process space interpolation regression and a polygon-divided support vector machine regression on the simulation data.

Fig. 27 shows a simulation experiment in which the continuity gradually increases as the number of random partitions T increases for the support vector machine regression based on random histogram transform partitioning.

Detailed Description

Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

Density estimation for big data:

the density estimation of big data, namely, the estimation of the distribution density function of random variables based on given samples of big data, the probability density function is one of the core concepts in probability theory and is used for describing the probability distribution obeyed by continuous random variables.

Conventional big data density estimation methods typically employ big data density estimation models, including parametric and non-parametric density estimation models.

In the parametric density estimation model, one assumes that the data distribution conforms to a certain behavior, such as linear, quantifiable linear, or exponential, and then finds a specific solution in the family of objective functions, i.e., determines the unknown parameters in the density function. In the parameter discrimination analysis, it is assumed that data samples which are taken as discrimination bases and take values randomly are subjected to specific distribution in each possible category. However, the parameter estimation method is paradoxical in that the density function to be estimated is completely unknown, but the parameter estimation method assumes in advance that the data obeys a certain model, which is obtained only by observing the data, and experience and theory show that a large gap often exists between the basic assumption of the parameter model and the actual physical model, and the method cannot always achieve satisfactory results.

Among the nonparametric Density estimation models, there are a Histogram Density estimation (Histogram Density Estimator) model and a Kernel Density estimation (Kernel Density Estimator) model.

The histogram density estimation model is the simplest non-parametric density estimation model. Fig. 1 shows the histogram density estimation in the one-dimensional case. As shown in fig. 1, the histogram density estimation first divides the samples into several equal-sized and non-intersecting intervals (or several grids in the high-order case), and takes the ratio of the number of samples in each interval (or grid) to the total number of samples to the size of the grid as the estimation of the local density, the abscissa represents the range of x, and the total coordinate represents the density value of x.

Fig. 2 shows the kernel density estimation in the one-dimensional case. The kernel density estimation model does not utilize prior knowledge about data distribution, does not add any hypothesis to the data distribution, and is a method for researching data distribution characteristics from a data sample, so that the kernel density estimation model is highly emphasized in the fields of statistical theory and application. As shown in fig. 2, the kernel density estimation model inserts a kernel function at each sample point, and sums all kernel functions to obtain a final density estimation, where the abscissa represents the range of x and the overall coordinate represents the density value of x. The Kernel function can have various forms, fig. 3(a) shows a common Kernel function form and fig. 3(b) shows a Gaussian Kernel function (Gaussian Kernel) at different bandwidths. The commonly used kernel function is a gaussian kernel function, where the bandwidth is an important parameter in kernel density estimation, and the effect of density estimation is closely related to bandwidth selection. Wherein Box refers to a constant density function, Epanechinikov refers to a function, and Gaussian refers to a Gaussian density function.

Since the amount of large data is enormous, the data is usually divided before data estimation. The traditional large data density estimation method basically adopts feature space division.

The feature space division refers to dividing data into a plurality of subsets according to the attribute of the feature and a certain rule, wherein each subset may contain a plurality of sample points or may not contain any sample point. Existing feature space partitioning techniques include random partitioning and adaptive partitioning. The random division makes full use of the division randomness, so that the division result has diversity, the characteristics of data can be learned in multiple aspects, and sample information is not used. The existing examples of random division include pure random division, histogram transformation division and the like; the self-adaptive division considers the information of the sample in the division criterion, can improve the precision and the efficiency of the model in the application of real data, but does not have the diversity of random division, and the most widely applied self-adaptive division comprises polygonal division and the like.

Fig. 4 shows a specific partitioning diagram of purely random partitioning. The key points of the pure random division are that nodes to be cut are randomly selected, dimensions to be cut are randomly selected, and cut points are randomly selected when samples are divided each time. In each division, a node to be cut indicates which grid is further divided, a dimension to be cut indicates which dimension of data in the node to be cut constructs a hyperplane to divide a sample, a cut point indicates a specific position of the hyperplane divided in the dimension to be cut in the node to be cut, and the index of the ratio of the side length of the grid divided in the dimension to be cut to the side length of the grid before division is used for quantification.

Fig. 5(a), 5(b), 5(c) are schematic diagrams of possible results of adaptive histogram transform partitioning.

The histogram transformation projects sample points in an original space into a transformation space through rotation, stretching and translation transformation, and the rotation angle, the stretching degree, the translation direction and the translation size are random when the transformation is carried out, then the sample points are divided in the transformation space according to integer points of all dimensions, and then the divided grids are projected back into the original space, so that one division of the samples in the original space is obtained. Fig. 5(a) shows the distribution of samples in the original space, fig. 5(b) shows the distribution of samples after rotational transformation, and fig. 5(c) shows the distribution of samples after stretching and translation transformation, and the transformed samples are divided according to integer points on the basis of the distribution.

In the conventional large data density estimation method, the extreme values in the data need to be determined and rejected. Extreme values refer to abnormal large and small values in one or more numerical data contained in a data set, or labeled abnormal data (outlier) in classified data, and are unreasonable data generated by improper operation of real data during recording, measurement, experiment or data processing. The existence of abnormal values can have a number of adverse effects on the statistical analysis of the data, such as reducing the persuasiveness and credibility of the data statistics or models. Therefore, determining and removing outliers in the data becomes an important part of building the model.

In numerical data, extreme value determination is often based on the absolute magnitude of the value. This is typically done by arranging the sample data according to the absolute size of the values, taking three quartiles in the ordered series, labeled as Q1, Q2, and Q3 from small to large, e.g., the numbers at the 25%, 50%, and 75% positions of the data. Numbers greater than Q3+ α IQR or less than Q1- α IQR are generally considered extreme values, where α is a parameter, empirically chosen as 1.5, and IQR is the difference between Q3 and Q1, known as the interquartile range (IQR). Fig. 6 shows a boxplot of extreme value determinations, illustrating the quartile of data and the determination boundaries of extreme values. In fig. 6, the horizontal line in the middle of the square is the median level, the upper and lower sides of the square are the upper quartile Q3 and the lower quartile Q1, respectively, and the upper and lower "T" shaped extensions of the square are the extreme value determination boundaries, respectively, representing Q3+1.5 × IQR and Q1-1.5 × IQR. The "o" point represents the extreme point.

In the conventional large data density estimation method, the accuracy of the data also needs to be judged. In the accuracy judgment, the accuracy judgment is applicable to the prediction model, after model training is carried out by using a test sample, the target value of the whole sample is predicted, the target value of the sample and the predicted value of the model are compared and analyzed through a set accuracy evaluation function, and the prediction accuracy evaluation of the sample is given. An accuracy threshold is selected according to model requirements or an expert, and sample points with accuracy below the threshold are considered extreme points.

The above-described prior art has the following problems or points to be improved:

1) in the prior art, the traditional histogram density estimation is a non-continuous and non-smooth density estimation method, has no derivative function, loses the spatial relationship of samples, and is unfavorable for analysis; secondly, the density function of the histogram estimation is easily influenced by the boundary width division of the subintervals, and for a fixed data set, different boundary divisions can obtain results with larger differences; finally, the effect of histogram density estimation is also affected by the distribution characteristics of the original data, for example, for data with thick tail distribution, the histogram density estimation cannot achieve higher density estimation accuracy at the thick tail part. For kernel density estimation, the sub-interval boundary width division is often related to kernel functions, and the characteristics of data are not considered, under the condition, the extreme value of the data can generate large interference on the result of the kernel density estimation; second, although the kernel density estimation solves the continuous problem, in sparse samples, there is also a problem of "density dip".

2) In the existing feature space partitioning technology, the sample information is not considered in the random partitioning such as pure random partitioning and histogram transformation partitioning, so that the adaptivity is lacked, the partitioning efficiency is low, namely the partitioning frequency of a sample low-density area is more than the required frequency, and in addition, the estimation accuracy of a sample high-density area is poor, namely the partitioning frequency of the sample high-density area is less than the required frequency; while adaptive partitioning techniques like polygon partitioning lose the possibility of generating multiple partitions due to lack of randomness. In the prior art, a division model with both random diversity and adaptivity is not available for a while.

3) In the existing density estimation model for processing high-dimensional data, neither histogram density estimation nor kernel density estimation can solve the problem that the estimation density is generally small due to the sparsity of high-dimensional space samples. In many practical problems, the support of high-dimensional data can be reduced to low dimension, but both the histogram density estimation method and the kernel density estimation method can only predict in a high-dimension feature vector space, and not only the training speed is slow, but also the estimation effect is poor.

Regression analysis on big data:

regression analysis is the most important basic idea in data analysis and one of the most statistically important theories, and most data analysis problems can be modeled as a regression analysis problem. Regression analysis is analysis for studying the correlation between independent variables and dependent variables, wherein three keywords are correlation, independent variables and dependent variables.

Dependent variables are variables that change as the independent variable changes. In practical applications, the dependent variable characterizes the core appeal of a task and is a key object of scientific research, for example, in the problem of predicting the song release years, people regard the song release years as the dependent variable.

The independent variable is a related variable for explaining the dependent variable, and may be one or more, and may also be generally referred to as an explanatory variable. The task of regression analysis is to try to explain the forming mechanism of the dependent variable by researching the correlation between the independent variable and the dependent variable, thereby achieving the purpose of predicting the dependent variable through the independent variable. For example, in the problem of predicting the release year of a song, 90 independent variables respectively represent the mean value, covariance and the like of timbre, and the regression aims to find the relationship between the timbre of the song and the release year of the song based on various timbre characteristics of the existing song and predict the release year of a new song through an established regression model.

As shown in fig. 7, scatter represents known data, horizontal axis represents independent variables, vertical axis represents dependent variables (assuming for convenience of illustration that independent and dependent variables are both one-dimensional, in practical terms, they may be multi-dimensional), and both lines can be used as regression models of data, but how to determine which effect is better? To measure the effect of regression model prediction, the data was divided into two parts: a part of the data is known and used for discovering the rules, and is called a training set; and testing the regression prediction model with another part of data, namely a test set. The regression effect is usually measured by Mean Square Error (MSE), i.e. how much the predicted result differs from the actual result on the test set, as shown in fig. 8, where fig. 8 shows a graph of data regression back and forth with a straight line (left) and its Mean Square Error (right). The scatter points represent known data, the horizontal axis represents independent variables, the vertical axis represents dependent variables, the straight lines represent regression models, the mean square error can be regarded as the average value of the areas of small squares in the right graph, and the smaller the mean square error is, the better the regression prediction effect is.

Common statistical regression methods can be classified into linear regression, logarithmic linear regression, polynomial regression and the like according to different regression models, but the traditional regression methods usually have strong assumptions on the models, actual data are likely to be far away from the assumptions, and the fixed regression models are difficult to solve the complex regression problem, so that the prediction accuracy is low. In addition, these methods are not suitable for solving the large-scale regression problem, i.e. the data size is very large, and the traditional regression method often needs a long running time, and even the regression result may not be obtained due to insufficient computing resources.

Since the amount of large data is enormous, data is generally divided and extreme values in the data are determined and discarded before data regression is performed. The conventional big data large scale regression method basically employs feature space division, such as the division method described above with reference to fig. 4-5. The extreme value in the data is conventionally determined and discarded, as in the extreme value determination method described above with reference to fig. 6.

In addition, in data regression, a polygon division method is also employed. Fig. 9 shows a polygon division diagram (left), with the water cube being an example (right).

As shown in fig. 9 (left), the polygon partition is based on the nearest neighbor rule and is composed of a polygon composed of a set of perpendicular bisectors connecting line segments of two neighboring points. Firstly, selecting a sample data subset as a group of control points by using a simple random sampling method, namely extracting samples one by one, wherein the probability of each sample extracted is equal during each extraction; any point within each polygon is less distant from the control points that make up the polygon than from the control points of other polygons. The water cube of the beijing olympic games is designed based on this division principle (fig. 9 (right)).

The traditional big data regression method generally adopts a regression model based on fixed division, and when carrying out regression analysis on large-scale data, the common method is to divide a feature space, call a local regression model in each division grid and finally splice the models in each grid together. However, in the splicing process, the problem that the division boundary of the regression model is discontinuous is generated, and some solutions are provided for the problem by the latest technologies such as a splicing gaussian process space interpolation method and a polygon division support vector machine method. Fig. 10 shows a schematic of the stitching gaussian process spatial interpolation (left) and gaussian process regression (right).

Stitching gaussian process space interpolation (Patchwork Kriging): the method is proposed in 2018 by Park and Apley, and comprises the steps of firstly dividing a sample space according to characteristics (namely independent variables of a regression problem), performing regression in each division grid by using a gaussian process (namely dividing sample data points into a plurality of groups, and only using data in each group to construct a local regression model), and finally splicing regression models in each group to obtain an integral regression model, wherein the problem that the regression model is discontinuous at a division boundary can occur in the method, and the problem is shown in fig. 10 (left). Gaussian process regression is a non-parametric statistical regression method whose results give not only the regression model but also the interval in which the predicted result may fall, as shown in fig. 10 (right). In order to solve the problem that the integral regression model is discontinuous on the division boundary, the method forces the local regression models at two ends of the boundary to have equal values by manufacturing artificial observed values around the division boundary.

FIG. 11 shows a schematic diagram of support vector machine regression, i.e., linear regression in a high dimensional space is equivalent to making a non-linear regression in the original feature space. Polygon Partition Support Vector Machine method (Voronoi Partition Support Vector Machine): according to the method, a sample space is divided by polygon division, then a support vector machine is used for regression in each division grid (as shown in figure 11, the support vector machine is a machine learning model and is suitable for regression of high-dimensional data, namely, the regression problem when the number of independent variables is large, the prediction precision is high, but the operation speed on large-scale data is low), and finally the regression models in each grid are combined into a regression model on the whole sample space. In the figure, phi refers to a transformation function from a low dimension to a high dimension.

1) conventional regression methods usually make relatively strong assumptions about the model, such as assuming a specific form of the regression model, wherein the equation form of the regression model is expressed by following a linear, polynomial, exponential model, etc., or making assumptions about the data structure, such as assuming that the residual error follows a certain known distribution or that the data has sparsity, etc. The actual data is likely to be far from these assumptions, and the fixed simple regression model is difficult to solve the complex regression problem, resulting in low prediction accuracy. In addition, these methods are not suitable for solving the problem of large-scale regression, and in the case of very large data volume, the traditional regression method often requires a long running time, and even the regression result may not be obtained due to insufficient computing resources.

3) In the existing model for processing the large-scale regression, a splicing Gaussian process space interpolation method and a polygonal dividing support vector machine method both use a method of dividing first and then synthesizing to solve the problem of the large-scale regression, but the selection of the dividing boundary has a part of subjective factors, and the regression model cannot be completely continuous and smooth on the dividing boundary, so that the prediction precision is influenced. In addition, the space interpolation method of the splicing Gaussian process cannot be combined with parallel computation, so that the operation speed on large-scale data is low.

Therefore, a large data density estimation method and apparatus, and a large scale regression method and apparatus, which adopt adaptive random data division and can perform model integration, are urgently needed.

The big data density estimation of the invention mainly comprises two parts: performing multiple self-adaptive random division and establishing local and overall density estimation models; and integration of the overall density estimation model under different partitions. The method mainly comprises the following steps: firstly, generating a plurality of times of self-adaptive random division; in each division, constructing a local density estimation model by using the samples in each division grid respectively; splicing the local density estimation models together to obtain an overall density estimation model under a certain random division; and finally integrating the overall density estimation model under the multiple divisions. The large data density estimation of the present invention may be embodied as a method or apparatus.

The self-adaptive random division can adopt self-adaptive pure random division, self-adaptive histogram transformation division and the like. The local density estimation model adopts the ratio of the proportion of the number of samples in each grid to the total number of samples to the size of the grid. The model integration representation takes the comprehensive result of the integral regression model under a plurality of random divisions as the final result of the model. The integration method of the model may adopt a simple average method, a weighted average method, or the like.

In step 1201, a large data set D to be trained is input. In this embodiment, the number of the adaptive random division spaces is T, and therefore, T times of space division operations are required, and the randomly generated adaptive space division operation is completed in a loop from 1 to T. That is, in step 1203, the number of divisions t is initialized to 1; in step 1205, it is determined whether T is smaller than T, a "tree" is generated for each partitioned space, if yes in step 1205, an adaptive partition for the sample space is randomly generated for each tree, in each partitioned grid, each local density estimation model is obtained through, for example, simple averaging or weighted averaging, and the local density estimation models in each grid are pieced together to obtain the tth whole density estimation model, in step 1207. Then, in step 1209, i is incremented by 1, and the process returns to step 1205.

If the determination at step 1205 is negative, then at step 1211, the T global density estimation models are integrated, for example, by averaging. At step 1213, the integrated density estimation model is output.

The local density estimation model in step 1207 uses the ratio of the number of samples in each grid to the total number of samples to the size of the grid, that is:

in the model integration of step 1211, each adaptive random partition is performed to obtain a corresponding overall density estimation model, and a plurality of overall density estimation models may be integrated by using a plurality of integration methods. The most common method is to take the average value of all the whole density estimation models as the integrated model, and other possible integration methods include a weighted average method and the like, namely, the weight of the density estimation models based on different partitions in the integrated model can be changed.

The adaptive partition of the sample space generated randomly for each tree in step 1207 above may be implemented by two methods, namely, an adaptive pure random partition and an adaptive histogram transform partition.

In the conventional pure random partition of fig. 13 (left side), from the perspective of experimental effect, since no sample information is used in the pure random partition, the partition efficiency may be low, that is, the number of times of dividing the sample low-density area is more than the required number of times; in addition, the estimation accuracy of the sample high-density area may be poor, that is, the sample high-density area is divided less times than required.

As an improvement, in the adaptive pure random partitioning method of the present invention of fig. 13 (right side), it is proposed to use an adaptive pure random partitioning criterion: before division, firstly, randomly extracting part of samples from the whole sample, selecting nodes in which most of the samples fall as nodes to be cut, and after determining the nodes to be cut, randomly selecting the dimensions to be cut and the positions of cut points.

The pure random division does not consider the information of the data, the division efficiency is low, the self-adaptive pure random division can adjust the division result according to the sample distribution, more grids are divided in places with high data density, the division is sparse in places with low data density, and the model effect is greatly improved.

Referring to fig. 14, in step 1401, a data set D to be trained is input, and the division number (which may also be referred to as a cut number) constituting the adaptive random division is set to p, so that p division operations are required, and the random generation adaptive division operation is completed in a loop from 1 to p. That is, in step 1403, the number of divisions (which may also be referred to as the number of cuts) i is initialized to 1; at step 1405, it is determined whether i is less than p, and if the determination at step 1405 is yes, at step 1407, the ith adaptive partitioning operation is performed. Then, in step 1409, after increasing i by 1, return to step 1405. Before each division, t sample points are randomly selected in advance, the grid containing the most samples at present is selected as the grid to be divided (also called as the grid to be cut), and the division dimension (also called as the cutting dimension) and the tangent point are randomly selected.

If the determination at step 1405 is negative, then at step 1411, p times the adaptive purely random partitioning result is output.

FIG. 15 is a flowchart of the adaptive histogram transform partitioning method employed in step 1407 of the density estimation method based on adaptive stochastic partitioning and simple model integration according to the first embodiment of the present invention.

In the adaptive histogram conversion partitioning method according to the present invention, it is considered that more sample information is used in the process of histogram conversion partitioning, thereby obtaining an adaptive histogram conversion partitioning. Starting from the transformed sample data, in each division, considering the number of samples in the existing lattices, only further dividing the lattices of which the number of the samples is greater than a certain numerical value (assumed to be m), selecting a dimension to be divided during the division so that the variance of the samples is as small as possible, and therefore selecting the dimension with the largest ratio of the extreme difference of the samples to the variance, wherein in the dimension, if the mean value of the samples falls on the left side of 0.6 quantile, the tangent point is selected to be 0.618 quantile of the data, otherwise, the 0.382 quantile of the data is selected, and the division is stopped until the number of the samples in all the lattices is less than m.

Referring to fig. 15, in step 1501, a data set D to be trained is input, and a minimum sample point limit in a grid after division is set to m. In step 1503, randomly generating a rotation angle, a stretching degree and a translation vector; the original training data is transformed into training data in the new space through randomly generated rotation, stretching and translation transformation (i.e. rotation, stretching and translation transformation is performed randomly). In step 1505, it is determined whether there is a grid with sample points greater than m. If the determination in step 1505 is yes, then in step 1507, the dimension with the largest sample variance is selected as the dimension to be divided, the median of the dimension data is selected as the tangent point, and the lattice is divided. And then returns to step 1505. A loop is made until the number of sample points in all the bins is less than m.

If the determination at step 1505 is no, then at step 1511 the partitioned grid is projected back into the original sample space based on the inverse transform corresponding to the generated transform.

At step 1513, the result of the adaptive histogram conversion partition is output.

Unlike the density estimation model based on the adaptive random partitioning and integration algorithm according to the first embodiment of the present invention shown in fig. 12, the density estimation model based on the histogram transform partitioning and boosting algorithm according to the second embodiment of the present invention moves the depth extraction process for the sample information from the spatial partitioning to the density estimation step. In each iteration, the model firstly partitions the data feature space according to the adaptive histogram transformation partitioning method illustrated in fig. 15 to obtain non-repetitive small regions, and establishes a local density estimation model on the small regions. It should be noted that the model established in each small region is obtained by weighting an element in a certain density function space based on the overall density estimation model obtained in the last iteration. Since density estimation requires that the integral value of the density function on the defined domain is 1, the density function under the original estimation and the newly estimated density function are actually normalized weights, and the weight between the two parts needs to be adjusted to optimize the loss function under the set unsupervised condition. Specifically, the density function space H is not set, and the loss function is set to E (log (X)), and the previous iteration results in the overall density function F^(t-1)(x) In a certain region, the objective optimization function is

Wherein

And is

The resulting local density estimation model then translates the expectation into a form of sample weighting:

F^(t)(x)-(1-α^*)F^(t-1)(x)+α^*f_t(x)

after obtaining the local density estimation model, the overall density estimation model of the iteration can be finally determined by normalized addition.

Referring to fig. 16, in step 1601, a large data set D to be trained is input, the total number of iterations of the lifting algorithm is set to T, and the weak density function space used in the lifting algorithm is H. In this embodiment, the total number of iterations of the lifting algorithm is T, and the operation of the lifting algorithm is completed in a loop from 1 to T. I.e. in step 1603, the number of lifting algorithm iterations t is initialized to 1.

In step 1605, it is determined whether the number of iterations T of the lifting algorithm is less than T.

If the determination at step 1605 is yes, then at step 1607, the following operations are performed:

randomly generating adaptive histogram transformation division for a sample space;

in each division grid, inheriting the density estimation model F of the last iteration^(t-1)(x) Calculating the weight of the division containing the sample point by using the reciprocal of the density function value obtained in the last iteration

Selecting an optimal density estimator corresponding to the iteration in a function space;

and performing weighted combination on the density estimation function obtained by the last iteration and the density estimator selected this time, and calculating an optimal weight proportion through empirical distribution, for example: the elements in the function space H and the weights a are selected in each of the division lattices,

constructing a new local density estimation model F^(t)(x)-(1-α^*)F^(t-1)(x)+α^*f_t(x) Minimizing a local density estimation loss estimation function L to obtain a local density estimation model;

the local density estimation models in each grid are spliced together to obtain an integral density estimation model F^(t)(x)。

Then, in step 1609, i is incremented by 1, and the process returns to step 1605.

If the determination at step 1605 is negative, at step 1611, the boosted density estimation model after T boosts is output.

The density estimation apparatus according to the third embodiment implements the density estimation method according to the first embodiment described with reference to fig. 12. Referring to fig. 17, the density estimation apparatus according to the third embodiment includes a data input module 1701, a sample adaptive partitioning and density estimation model calculation module 1703, and a density estimation model integration module 1705.

In the above method, we assume that

And

is tight and non-empty. For a given R>0, we mean B_RIs R^dCubes of medium size 2R, i.e., B_R：＝[-R，R]^d：＝{x＝(x₁，...，x_d)∈R^d：x_i∈[-R，R]I 1, d, and for any R e (0, R/2), for 1 ≦ p<∞，x＝(x₁,...,x_d) Is defined as | | x | | non-conducting phosphor_p：＝(|x₁|^p+…+|x_d|^p)^1/pInfinity is defined as

For any x ∈ R, let

Denotes the largest integer greater than or equal to x. For vectors

We remember

The negative log-loss function is explained below:

let f be

The upper unknown probability measures the density function of P. Based on the independent co-distribution data set D from the distribution P; { (x)₁，y₁)，...，(x_n，y_n) We aim to construct a measurable function satisfying an integral value of 1 to approximate as a density estimate

We use the definition as

The negative log loss function of (c) to measure how good the density estimate is.

The histogram transformation is explained below:

for clarity of the construction process of the histogram transformation, we introduce random vectors (R, S, b), where each element R, S, b represents a rotation matrix, a stretching matrix and a translation vector, respectively. Specifically, R denotes a rotation matrix, which is a real-valued d × d orthogonal square matrix having a unit determinant, i.e., the rotation of R is equal to the inverse of R, and det (R) is 1. S represents a stretching matrix which is a positive real-valued d x d diagonal scaling matrix in which the diagonal elements S_iI 1.. d is some random variable. Furthermore, we represent diagonal elements as vectors

And gives a width vector h ═ s defined in the input space^-1(ii) a In addition b ∈ [0, 1]]^dIs a d-dimensional vector, which we call a translation vector. Then we define the histogram transform to

H(x)：＝R·S·x+b。

Histogram transform boosting density estimation is explained below:

in this section we focus mainly on boosting algorithms equipped with histogram transform density estimators. We use histogram transformation as a basis learner, which is a weak predictor and has high computational efficiency.

We first introduce the histogram transformation function space: suppose that

Is an independent identically distributed histogram transformation sequence, where H_tIs based on some probability measure P_HIs/are as follows

To transform the extracted histogram. As described above, the lifting algorithm may be viewed as an iterative method for convex optimization of the empirical loss function.

Based on the above description, we propose a gradient boosting algorithm to solve the optimization problem of the empirical loss function, and the randomness of the histogram transformation just provides an effective step for boosting. The algorithm proceeds iteratively, that is, for T1_t(x_i)＝(1-α_t)F_t-1+α_tf_tIn which F is_tRepresenting the density estimate obtained after the t-th iteration, f_t∈F_tRepresents the t basic learner, and let the iteration step size alpha_tE (0, 1). We can get by simple calculation

Wherein w_t，j＝(1-α_t)…(1-α_j+1)·α_jJ is 1, t, and

our purpose is to search the basis learner f used each time_tAnd an iteration step size alpha_tSo that F_tThe corresponding penalty function becomes smaller after each iteration. In the t-th iteration, for an arbitrary α_tE (0, 1), the minimization of the empirical loss is equivalent to the minimization formula

Wherein epsilon_t＝α_t/(1-α_t). Applying taylor expansion to it we get:

wherein ω is_t，i＝1/F_t-1(x_i). For a sufficiently small epsilon_t(or alpha. in the case of_t) We can ignore its higher order terms and find the optimal maximum gradient direction as

Then we find the step size alpha by linear search_tTo ensure updated learner F_tThere is still one probability distribution. O () represents the same order.

In summary, the density estimation performed by the histogram enhancement method specifically includes:

the data input module 1701 inputs a large data set D to be trained and the number of adaptive random division spaces T. Therefore, T times of space division operations are required, and randomly generated adaptive space division operations are completed in a loop from 1 to T.

In the sample adaptive partitioning and density estimation model calculation module 1703, a "tree" is generated in each of the 1 to T spatial partitions of the large data set D, an adaptive partition for the sample space is randomly generated for each tree, each local density estimation model is obtained in each partition grid by, for example, simple averaging or weighted averaging, and the local density estimation models in each grid are pieced together to obtain the entire density estimation model of each partition space.

In the density estimation model integration module 1705, the T whole density estimation models corresponding to the T divisions, which are obtained in the sample adaptive division and density estimation model calculation module 1703, are integrated by, for example, averaging, and the integrated density estimation model is output.

The local density estimation model adopts the ratio of the proportion of the number of samples in each grid to the total number of samples to the size of the grid, namely:

in the model integration of the density pattern model integration module 1705, the overall density estimation model corresponding to each self-adaptive random division is obtained, and a plurality of overall density estimation models can be integrated by adopting a plurality of integration methods. The most common method is to take the average value of all the whole density estimation models as the integrated model, and other possible integration methods include a weighted average method and the like, namely, the weight of the density estimation models based on different partitions in the integrated model can be changed.

The adaptive partitioning of the sample space is randomly generated for each tree in the sample adaptive partitioning and density estimation model calculation module 1703, and two methods of adaptive pure random partitioning and adaptive histogram transformation partitioning can be adopted.

The density estimation apparatus according to the fourth embodiment implements the density estimation method according to the second embodiment described with reference to fig. 16.

Referring to fig. 18, the density estimation apparatus according to the fourth embodiment includes a data input module 1801, a sample adaptive partitioning and density estimation model calculation module 1803, and a density estimation model output module 1805.

Referring to fig. 18, in the data input module 1801, a large data set D to be trained is input, the total number of iterations of the lifting algorithm is set to T, and the weak density function space used in the lifting algorithm is H.

In the sample adaptive partitioning and density estimation model calculation module 1803, T times of iterative calculations of the lifting algorithm are implemented. Wherein in each iteration, an adaptive histogram transform partition of the sample space is randomly generated; in each division grid, inheriting the density estimation model F of the last iteration^(t-1)(x) Calculating the weight of the division containing the sample point by using the reciprocal of the density function value obtained in the last iteration

Selecting an optimal density estimator corresponding to the iteration in a function space; and performing weighted combination on the density estimation function obtained by the last iteration and the density estimator selected this time, and calculating an optimal weight proportion through empirical distribution, for example: the elements in the function space H and the weights a are selected in each of the division lattices,

α)F^(t-1)(x_i)|α-f_t(x_i)]constructing a new local density estimation model F^(t)(x)＝(1-α^*)F^(t-1)(x)+α^*g_t(x) Minimizing a local density estimation loss estimation function L to obtain a local density estimation model; the local density estimation models in each grid are spliced together to obtain an integral density estimation model F^(t)(x)。

If T iterative computations of the lifting algorithm are completed, the density estimation model output module 1805 outputs the lifted density estimation model after T lifts.

The probability density clustering method based on the pure random histogram transformation division combines the pure random histogram transformation division idea with the clustering method based on the probability density, and is a specific application of the density estimation method based on the pure random histogram transformation division. Generally, for unlabeled sample data sets

Firstly, an estimation value of unknown distribution of a sample data source is obtained by using a density estimation method based on histogram transformation division

And using the estimated density function

Screening out sample points with higher probability density by a level set method

And finally obtaining a final clustering result with the help of a cluster tree.

Referring to FIG. 1a, in step 1a01, training data is input

A clustering tree distance parameter h; value set of parameter lambda

In step 1a03, calculating to obtain density estimation of training data by using density estimation model based on pure random histogram transformation division and integrated algorithm

Or

In this embodiment, the total number of cycles of the probability density clustering method based on the pure random histogram transformation partitioning and integration algorithm is L, and the operation of the algorithm is completed in the cycle from 1 to L. That is, in step 1a03, the loop number variable i is initialized to 1.

In step 1a05, it is determined whether the loop number variable i is less than length (L).

If the judgment at step 1a05 is YES, at step 1a07, the following operations are performed: screening out sample points with probability density larger than the level set parameters; marking points with similar distances in the screened sample set based on the parameter h in the clustering tree; the related composition set C in the above labeled graph (graph) is calculated based on the DBSCAN algorithm.

In particular by λ_iDetermining a set of nodes

And determining a corresponding set of boundaries

Finally, constructing a corresponding graph based on the two sets

And calculate the chart

Corresponding correlation component C (lambda)_i)。

Then, in step 1a07, after increasing i by 1, return is made to step 1a 05.

If the determination in step 1a05 is no, then in step 1a11, the final clustering tree T is obtained under different level set parameters.

Next, in step 1a13, the integrated clustering tree model is output

The probability density clustering method based on the pure random vertical direction transformation division and the lifting algorithm combines the clustering method based on the probability density with the pure random vertical direction transformation division idea and the lifting algorithm for improving the accuracy, gradually improves the algorithm accuracy in the iterative integration process, and is a specific application of the density estimation method based on the pure random vertical direction transformation division and the lifting algorithm. Generally, for unlabeled sample data sets

Firstly, a density estimation method based on the histogram transformation division and the lifting algorithm is used for obtaining an estimation value of unknown distribution of a sample data source

And using the estimated density function

Referring to FIG. 1b, in step 1b 01, training data is input

A clustering tree distance parameter h; value set of parameter lambda

In step 1b 03, calculating to obtain a density estimation of training data by using a density estimation model based on a pure random histogram transformation partitioning and integration algorithm

Or

In this embodiment, the total number of cycles of the probability density clustering method based on the pure random histogram transformation partitioning and lifting algorithm is L, and the operation of the algorithm is completed in a cycle from 1 to L. That is, in step 1b 03, the loop number variable i is initialized to 1.

In step 1b 05, it is judged whether the loop number variable i is smaller than length (L).

If the judgment in step 1b 05 is yes, then in step 1b 07, the following operations are performed: screening out sample points with probability density larger than the level set parameters; marking points with similar distances in the screened sample set based on the parameter h in the clustering tree; the related composition set C in the above labeled graph (graph) is calculated based on the DBSCAN algorithm.

In particular by λ_iDetermining a set of nodes

And determining a corresponding set of boundaries

Finally, constructing a corresponding graph based on the two sets

And calculate the chart

Corresponding correlation component C (lambda)_i)。

Then, in step 1b 09, i is increased by 1, and the process returns to step 1b 05.

If the determination in step 1b 05 is negative, then in step 1b11, the final clustering tree T is obtained under different level set parameters.

Next, in step 1b 13, outputIntegrated clustering tree model

The probability density anomaly detection method based on the pure random histogram transformation division and the random forest is specifically applied to the density estimation method based on the pure random histogram transformation division and the random forest. Generally, for unlabeled sample data sets

And estimating the resulting density function

On the basis of the above, it is determined that a point having a probability density smaller than the set probability density parameter ρ belongs to an abnormal sample, that is, a point having a probability density smaller than the set probability density parameter ρ belongs to an abnormal sample

Referring to FIG. 1c, in step 1c 01, training data is input

I.e. inputting a training data set; a density boundary parameter ρ.

In step 1c 03, an estimate of the density function of the unknown distribution is calculated using density estimates based on purely random histogram transformation partitioning and random forests

An estimate of the location density is obtained. Next, in step 1c 05, a set of sample points with an estimated probability less than the density boundary, i.e. an abnormal set of sample points, is output

Suppose that

Is a subset, μ is a Lenberg measure with μ (X) > 0, P is a probability measure supporting X, and P is absolutely continuous with respect to μ with density f. Let training data d: is ═ x₁,., x _ n) are observations of independent co-distribution in P. We will turn B_RIs R^dCubes of medium size 2R, i.e., B_R：＝[-R，R]^d：＝{x＝(x₁，...，x_d)∈R^d：x_i∈[-R，R]1, d, and will be the same as

Is represented by B_rThe complement of (c).

For our tree-based algorithm, we first introduced a pure random tree (xTree), or so-called extreme random tree, which is more suitable for unsupervised learning, than conventional random forests use the impurity-based standard supervised learning problem.

Mathematically, assume that Z is a splitting criterion Z for a tree that takes spatial values, whose probability measure is P_ZAnd (4) showing. Since the splitting of the tree is in space

So we will turn B to_rThe node created after p partitions (p-split) is denoted as

Wherein A is_jRepresenting the jth node. We will further turn to

Represented as a tree, i.e. a collection of all leaf nodes.

An extreme random tree (xTree) partitions root nodes (feature space) completely (extremely) randomly by selecting nodes to partition, dimensions to partition, and partition points. An xTree of p-split can be constructed by iteration, where the ith step (i ═ 1.. times, p) is described as a random vector Q_i：＝(L_i，R_i，S_i). First term L_iRepresenting the nodes to be split, which are randomly selected with the same probability from among the previously generated nodes. Second term R_iUnif { 1. -, d } represents L_iI.e. R is chosen uniformly in all dimensions_iThe probabilities are the same. Third item S_i～Unif[0，1]Described is a division point, which is formed by the ith division point and the r of the newly generated node after the ith division point_iThe ratio of the lengths of the dimensions. It is noted that,

and

are independently and equally distributed.

In the following example, assume that for all A ∈ A_Z，pThe lux measurement μ (a) > 0 because if μ (a (x)) is 0, the density at x is estimated to be 0. Let D_nIs an empirical measure, i.e.

For x ∈ A_j∈A_Z，pAn extreme random tree (xTree) density estimator may be represented as

Wherein A is_jAlso written as A (x). In this formula, the sum on the right represents a value falling on A_jThe number of observations in (1). Is provided with

Is respectively composed of

A random density tree generated by a splitting criterion. The density estimator for xForest may then be expressed as

Core-sample in DBSCAN is explained as follows.

Sample is at

Called core-sample, if # { N (x, ε) # D } > MinPts, where N (x, ε): the ratio of { x': | x-x' | < ε, ε is the neighborhood radius of sample x, MinPts is the minimum number of samples in a community N (x, ε), which is also the core-sample threshold.

The core-sample depends on two hyper-parameters ε, MinPts. On the one hand, ε is a hyper-parameter, which acts like the bandwidth in kernel density estimation. The number of samples falling within the epsilon radius neighborhood of an x point describes the relative density value of the x point. On the other hand, MinPts is a hyper-parameter, which is a threshold to determine whether a sample is a core-sample.

Now, using xForest density estimation

For the sample

Training is performed and we can extend the concept of core-sample to situations where sample density is explicitly obtained.

The core-sample in xForest is explained below.

One sample x is in

Middle quiltCalled core sample, if its density estimate is satisfied

Where λ is the density threshold.

In xForest, the generalization of core-sample has only one hyper-parameter associated with the set of density levels. Compared to DBSCAN, it utilizes an explicit form of density estimation.

Now, based on the xForest density estimate, we can propose an xForest clustering algorithm. First, we generate T trees with p segments as training data

An xforet density estimator is constructed and all samples with densities not less than a density threshold lambda are designated as core samples. An epsilon-radius neighbor map G is then built for all the nuclear samples, and m clusters are derived from the m connected components of map G. Finally, the remaining unlabeled samples are clustered into clusters of nuclear samples closest to them.

The runtime complexity of xForest is divided into two parts, runtime for density estimation and runtime for cluster generation. For density estimation, the runtime complexity of each point in each tree depends largely on the depth of the tree. The average runtime complexity of xForest is o (tndlogn), where the number of trees T can be a constant compared to n, and its effect can be minimized by parallel computation. However, when the xTree happens to be very unbalanced, the worst-case runtime complexity becomes O (Td · n)²). For the cluster generation, xForest is similar to DBSCAN, but the core samples are defined by estimated densities with corresponding run-time complexity (at euclidean distance) of O (d · n)²). Thus, the worst case runtime complexity of xForest is O (Td n)²)+O(d·n²). Recall that T is already small enough compared to n and d so the worst case run-time complexity of the complexity is actually O (d n)²) Same as the original DBSCAN.

Nevertheless, we mention that xforet has a higher accuracy than DBSCAN at the same run-time complexity, because it enables a more accurate density estimation and thus a more accurate core point. For DBSCAN, although not explicitly illustrated, core points are defined by estimating the density, by counting the number of neighbors within a certain radius, where the estimation is rather coarse, discontinuous, and may be very sensitive to the chosen epsilon. In addition, the parameter ε is also used to search for connected components, and the optimal ε for this task may not match the optimal ε for density estimation. In contrast, xForest employs a more accurate density estimation process, the estimated density function is asymptotically smooth, and the split boundaries are smoothed by the set of trees, thus performing better in clustering tasks where the underlying density function is smooth. In addition, xForest can obtain good local adaptive capability by adopting the minimum sample division (min _ samples _ split) in the common python package, so that the local property of the sample can be considered to adapt to more complex data structures. In this way the parameter epsilon only works when finding the joining component, so that its optimum value can be easily obtained.

In addition, similar to the implementation of DBSCAN, R can be used under certain conditions^*The tree, our xForest, is also performed by means of tree structure, which indicates that the efficient acceleration technique for DBSCAN, such as subsampling, can also be transplanted into xForest. We mention that since random forests are naturally compatible with subsampling methods, we can accelerate the two stages of identifying core-samples and finding connected components by subsampling, and finally xForest will reach an accelerated runtime complexity of 0 (d.nn '), where n' is the subsampling scale.

The manner in which an extreme Random tree (xForest) is used for outlier detection is described below.

We define anomalies characterized by the degree of aggregation, which can be described by the density f. For a fixed threshold ρ > 0, the ρ level set { f > ρ } represents a region of high aggregation. On the other hand, { f ≦ ρ } is regarded as a region where the degree of aggregation at which the abnormal value is located is low. Therefore, our goal is to estimate the set { f ≦ ρ } to detect anomalies in all samples, or equivalently estimate the set of ρ levels { f > ρ }. Using xForest density estimator

We can pass through

Collective estimation S of constructed rho level set_ρAnd the xfiest algorithm for density-based anomaly detection is presented in the following algorithm.

Density-based outlier detection algorithm: xForest algorithm

Inputting: training set D: x ═ x₁，...，x_n}；

Dividing times p in the vertical direction transformation division;

the number T of trees in the random forest;

judging a threshold value rho by using a density function;

cycling from 1 to T:

constructing pure stochastic partitions Z in feature space_t，p；

Construction of Density estimates Using the x-Tree approach

Finish the cycle

Estimating T density estimation trees by using equal integration method

Integrated xForest density estimation

Executing:

abnormal value:

The probability density anomaly detection method based on the K neighbor and the histogram transformation partitioning Bagging algorithm applies the histogram transformation partitioning Bagging algorithm to the K neighbor density estimation method, improves the accuracy of an anomaly detection model based on probability density, and is a specific application of the probability density anomaly detection method based on the K neighbor and the histogram transformation partitioning Bagging algorithm. Generally, for unlabeled sample data sets

Firstly, the estimated value of unknown distribution of a sample data source is estimated by using a density estimation algorithm based on K neighbor and histogram transformation partitioning Bagging algorithm

And estimating the resulting density function

Referring to FIG. 1d, in step 1d 01, training data is input

A density boundary parameter ρ.

In step 1d 03, partitioning Bagging using K-nearest neighbor and histogram transformation (Bagged) algorithm's density estimation computation estimates of the density function of the unknown distribution

An estimate of the location density is obtained. Next, in step 1d 05, a sample point set with an estimated probability less than the density boundary, i.e., an abnormal sample point set, is output

Let P be R^dThe potential density is f. For any x ∈ R^dAnd r > 0, we convert B_r(x)：＝B(x，r)：＝{x’∈R^d：||x’-x||₂R ≦ r } is expressed as a closed sphere with r radius x. If for any N ∈ N^*All have x_n≤cy_nIf c > 0, it is called

K-nearest neighbor (k-NN) method for density estimation

For any x ∈ R^dAnd given a set of independent, samples D generated from the probability distribution P_n：＝{X₁，...，X_nWe reorder the samples into D according to their increasing value of distance to x_(n)：＝{X₍₁₎，...，X_(n)}. Then we have | | X₍₁₎(x)-x||≤…≤X_(n)(x) -x | |. Let the distance between the data concentration point x and its kth nearest neighbor be R_k(x; D). Specifically, when D ═ D_nWhen we denote the distance as R_k(x；D_n)＝：R_k(x) In that respect Furthermore, let μ be the Leeberg measure, we obtain, according to the Leeberg differential theorem

This holds for almost all x. Taking R as R in (1)_k(x) And use

To estimate P (B)_r(x) We get the k-NN density estimate as follows

Wherein

Bagged Neighbor (BNN) method for density estimation

To improve the efficiency and accuracy of the original k-NN density estimator, we used bagging (bagging) techniques, from D_nTo data sets without putting back

A sampling is performed in which the size of the data set is # (D)_b) M. Then we integrate these B density estimators

The following BNN density estimate is obtained

Bagged Near Neighbor (BNN) for outlier detection

We define anomalies characterized by the degree of aggregation, which can be described by the density f. For a fixed threshold ρ > 0, the ρ level set { f > ρ } represents a region of high aggregation. On the other hand, { f ≦ ρ } is regarded as a region where the degree of aggregation at which the abnormal value is located is low. Therefore, our goal is to estimate the set { f ≦ ρ } to detect anomalies in all samples, or equivalently estimate the set of ρ levels { f > ρ }. Using the BNN density estimator cR (3), we can pass

Collective estimation S of constructed rho level set_ρAnd a BNN algorithm for density-based anomaly detection is proposed in the BNN method.

Density-based outlier detection algorithm: BNN

Inputting: training set D: x ═ x₁，...，x_n}; a density threshold parameter ρ; a neighbor parameter k; the number B of the sub-sample sets and the number m of samples of each sub-sample set;

sampling m points from D without putting back as

According to

Calculating a BNN density estimate (3) (equation (3) above);

abnormal value:

A random forest anomaly detection model based on an automatic supervision method is a rapid and accurate anomaly detection algorithm which combines a frame used for enhancing information acquisition capability in an automatic supervision learning task and a random forest classifier. Specifically, for the characteristic space where the data is located, a random rotation mapping is constructed to preprocess the data so as to improve the utilization degree of data information, and a corresponding rotation mode is added to the original data as a label to form a new (data and rotation) type data pair; secondly, converting an original unsupervised learning abnormity detection task into a supervised classification task through the data construction mode, and training a model with a target of a rotating label through constructing a random forest model based on a classification tree; finally, according to theoretical research: the lower the classification accuracy of a classification algorithm corresponding to the self-supervision classification problem is for a certain data classification, the more likely the data is abnormal data, the overall prediction accuracy of each data in each rotation direction is detected by the method, and a final abnormal detection result is given.

Referring to fig. 1e, at step 1e01, the inputs:

training data

Where the data are all outliers;

predicting a data set

Namely a prediction data set T containing a sample to be detected, wherein the abnormal state of a sample point in the data set is unknown;

feature space rotation mapping set

And

an abnormal sample number parameter N;

in step 1e 03:

1. performing rotation mapping on the feature space corresponding to each training data to respectively obtain K new feature spaces and K corresponding training data sets, and adding the space rotation mapping as a label into the corresponding data sets to obtain two groups of new data sets

And

2. learning a data set using a classification tree-based random forest model

Obtaining a model M. That is, the training data is used to train a random forest model for classification.

3. Using model M to T^SThe data set is predicted and calculated for each sample Y_iIs accurate to predictAnd (6) determining the rate.

In step 1e 05: and (3) outputting: and predicting the N to-be-detected sample sets with the lowest accuracy.

sForest only trains the general data D: is ═ X₁，...，X_n) Wherein X is_iIs the observed value of the same distribution as X obtained from the independent and same distribution in P. The construction of sForest firstly generates random rotation on input data, and the random rotation is recorded as R_mAnd m is 1, a. The corresponding augmented data with self-appended tags is represented as

For m 1.., m, where we will R₀Denoted by the same transformation, will

Represented as the original data set to simplify notation. We then train a random forest classifier on the self-labeling data. In the testing phase, we first passed the pre-generated R_mAnd m is 1, m, rotating the test sample, testing the rotated sample by a pre-trained forest classifier, and finally identifying the abnormality with lower test precision.

For two-dimensional image data, the self-supervised learning algorithm usually rotates the training data by a fixed angle such as {0, 90, 180, 270} and attaches corresponding labels. However, for structured data with higher dimensional features, 4 basic rotations gradually show their deficiencies. Therefore, we propose to self-label by random rotation, since random rotation provides more potential rotations, the chance of selection is consistent.

The details are as follows.

Given R^dA suitable rotation matrix R is a unitary determinant of a real-valued n x n orthogonal square matrix, i.e.

R^T＝R^-1，|R|＝1.

The set of all these matrices forms a special orthogonal set, which we denote as sort (d), which is a sub-set of the orthogonal set orth (d), which also includes the so-called abnormal rotation involving reflection (determinant equal to-1). More specifically, the matrix has a determinant of | R | ═ 1 in matrix sort (d), and the matrix has a determinant of | R | { -1, 1 in matrix o (n).

To perform random rotation, we must sample uniformly all possible rotations in SOrth (d). Nevertheless, it is worth emphasizing that randomly rotating each angle in spherical coordinates does not result in a uniform distribution of all rotations in n > 2, which means that some rotations are more likely to occur than others. Instead of simple rotations, we use "True" consistent random rotations. Furthermore, since rotation is not necessary for categorizing variables, and definition is not explicit, random rotation will not be applicable to these features.

Recording an augmented (self-tagged) common data set as

D^A＝D^A：＝{(R_m(X_i)，R_m)}_{i∈{1，...，N}，m∈{0，...，M}}，

Wherein R is₀Representing the same transformation, R₀(X_i) N denotes original training data.

Now, we construct our sForest based on a random forest classifier that starts with B buckets of randomly generated augmented normal samples by the bootstrap method, with the B-th sample bucket marked as

Then, a decision tree is constructed on each bucket respectively, and the prediction space is divided into non-overlapping areas A_j J 1.. j, j denotes the total number of terminal nodes of the decision tree. In the indication zone N_jArea A of individual observation_jIn the node, for (R (X)_i)，Y_i)∈D_bLet us order

Is of the class R_mObserved value at node A_jIn the ratio of (1), whichThe middle random parameter R represents random rotation, and 1 (-) is an indicator function. Y is_iRepresents X_iA corresponding rotating tag.

Then we take this vector

As output of the b-th self-supervised tree. We mention

Is a standard vector because

The self-supervised forest classifier can then be represented as a set of outputs of all trees

Intuitively, we can combine vectors

Is classified as R (x)_mThe probability of (c).

In some previous studies, classifiers thought that outliers tended to obtain a lower probability of belonging to their predictive labels, the intuition was that class-specific features had to be captured by training the classifier to distinguish self-labeling data. We now build their own sfiest classifier.

First, a test data set is recorded

Including normal samples and abnormal samples. We turn R_mM 1.. m, are used to generate augmented test data and obtain

Wherein R is₀The same transformation representing symbolic simplicity. The well-trained SForest classifier is then applied to D^A，tIn (3) implementing the output vector

For each one

We can simply write the output vector as an m matrix

Note that the diagonal elements are each self-labeled R (X)_i) Probability of being correctly classified. Therefore, for abnormal points with low classification accuracy, we designed

As the sum of these probabilities, i.e.

The normal score describes how normal the sample is tested, and we consider the sample with the lowest normal score as abnormal.

Application example of large data density estimation of the invention

The density estimation problem is one of basic problems of probability statistics and is an important research direction of unsupervised learning in statistical machine learning, and plays a key role in intermediate links of a plurality of statistical machine learning tasks. First, density estimation has direct application value in that it can exploit the obtained data density to reveal the essential features and internal structure of the data. For example, in the traffic field, uncertainty in vehicle trajectory prediction is studied by density estimation. Secondly, the density estimation is used as a basic statistical machine learning task, and higher learning tasks such as clustering and anomaly detection can be better solved. For example, density-based clustering is widely used for marketing research, image segmentation, indoor positioning based on WLAN data, etc. because it can find clusters of arbitrary shape and size; density-based anomaly detection methods are often used to address the increasingly severe network intrusion problems caused by globalization of information. Therefore, the algorithm for density estimation and the feasibility theory research thereof have important scientific value not only in the field of statistical machine learning, but also in other fields such as market economy, industrial engineering and the like.

As an example, the invention can be applied to classification problems in terms of the goodness of the radar echo. The method adopts data of a gulf radar data set (http:// area. ics. uci. edu/ml/dates/Ionosphere), and has the main task of estimating whether radar echoes show certain structural features in an Ionosphere according to the radar pulse number features in the data set. The dataset contains 351 observations, each of which has 34 attributes. The radar data was generated by a radar system handset in gulf, labrador, which consists of a phased array of 16 high frequency antennas with a total transmit power of about 6.4 kw. Examples in this database are function values generated by processing 34 complex electromagnetic signals, and specifically, the goos bay system may receive 17 pulses per experiment, each pulse having 2 attribute descriptions for a total of 34 features.

The process of implementing the algorithm is based on an integration density estimation of adaptive pure stochastic partitioning and adaptive histogram transformation partitioning as an example. In the experiment, firstly, a PCA (principal component analysis) algorithm is used for carrying out dimension reduction processing on a sample space according to 34 attributes of test data; secondly, generating multiple times of self-adaptive random division according to the main attributes after the dimensionality reduction to obtain a plurality of overall density estimation models, and averaging to obtain an integrated density estimation model. In specific experimental setting, 100 times of histogram transformation division is respectively and randomly generated for an integration density estimation model based on self-adaptive histogram transformation division, and integration is carried out by using an average value, wherein a parameter 'the minimum sample point number m in each division' is from {1,3,10,20 and 40}, 30% of training data is randomly selected as verification data, and the corresponding optimal parameter which enables ANLL to be minimum is selected. For the integration density estimation model based on the self-adaptive pure random division, the number of the division in the integration and the setting of the minimum sample point number in each division are the same as the parameter setting of the self-adaptive histogram transformation division.

In the comparison of the existing classification of certain structural features in an ionosphere by using a 'gulf radar data set' and the prior art, the integrated density estimation model based on the adaptive pure random partitioning can achieve higher prediction accuracy, the negative log-likelihood reaches 0.06, which is far lower than 24.36 of a Gaussian kernel density estimation method and 26.20 of simple histogram density estimation. The negative log-likelihood of the integrated density prediction model based on adaptive histogram transform partitioning is [ -1.78, -11.28, -18.95, -25.35] for dimensions [3,10,16,22], respectively, with a reduction in absolute value of [ -1.15, -6.38, -10.26, -14.71] relative to a simple histogram density estimate of [ -0.63, -4.9, -8.69, -10.64], a reduction in relative value of [ 54.78%, 76.80%, 84.69%, 72.33% ]. Relative gaussian kernel density estimates of [ -1.50, -7.87, -13.22, -18.78], a decrease in absolute value of [ -0.28, -3.41, -5.73, -6.57], and a decrease in relative value of [ 18.67%, 43.32%, 43.34%, 34.98% ].

The method fully utilizes the randomness of self-adaptive division and the advantages of integrated learning, solves the problem of discontinuous histogram density estimation, and improves the precision of density estimation; the invention can be well combined with parallel computation, not only can adopt a CPU (Central processing Unit) processor, but also can be combined with a GPU (graphics processing Unit) processor, thereby greatly saving the running time, improving the algorithm efficiency, and even processing data with huge data volume and ultrahigh dimensionality.

Besides the processing and analysis of radar data, the invention can also be applied to other density estimation tasks, such as Chinese character recognition in image recognition, dynamic video segmentation, extreme value recognition in an intelligent traffic system, density estimation of streaming data, high-density communication network protocol optimization and the like.

On the other hand, the big data regression analysis of the present invention is mainly composed of two parts: and performing multiple self-adaptive random division and establishing integration of local and integral regression models under different divisions. The method mainly comprises the steps of firstly generating a plurality of times of self-adaptive random division, respectively using samples in each division grid to construct a local regression model in each division, splicing the local regression models together to obtain an integral regression model under a certain time of random division, and finally integrating the integral regression models under multiple divisions. The self-adaptive random division can adopt self-adaptive pure random division, self-adaptive histogram transformation division, random self-adaptive polygon division and the like; the local regression model can adopt a support vector machine regression (SVR) and a local average method; the integration method of the model may adopt a simple average method, a weighted average method, or the like. The big data regression analysis of the present invention may be embodied as a method or apparatus.

The big data regression analysis comprises two steps, firstly, carrying out multiple self-adaptive random division on a characteristic space, obtaining a regression model on each division grid during each division, and splicing to obtain an integral regression model; and secondly, integrating all the integral regression models to obtain an integrated model.

At step 1901, a large dataset D requiring training is input. In this embodiment, the number of the adaptive random division spaces is T, and therefore, T times of space division operations are required, and the randomly generated adaptive space division operation is completed in a loop from 1 to T. That is, in step 1903, the number of divisions t is initialized to 1; in step 1905, it is determined whether T is smaller than T, each partition space generates a "tree", if yes in step 1905, in step 1907, an adaptive partition for the sample space is randomly generated for each tree, a local regression model is obtained in each partition grid, and the local regression models in each grid are pieced together to obtain the tth integral regression model. Then, in step 1909, i is incremented by 1, and the process returns to step 1905.

If the determination at step 1905 is NO, then at step 1911, the T whole regression models are integrated. At step 1913, the integrated regression model is output.

The adaptive random partitioning in step 1907 may be adaptive pure random partitioning, adaptive histogram transform partitioning, random adaptive polygon partitioning, or the like. The adaptive pure random division and the adaptive histogram transformation division are described in detail with reference to fig. 14 and 15. Wherein the random adaptive polygon partitioning is described in detail with reference to fig. 20.

After the adaptive random partitions are generated, in step 1911, a common regression algorithm is called in each partition grid to obtain a local regression model, and in step 1913, the local regression model is spliced into an overall regression model. In step 1911, the following two common regression algorithms, i.e., local average and support vector machine regression, are commonly used.

Local averaging method: the local average method is that the average value of the dependent variables of the samples in each divided grid is used as a regression result, and the method is the most intuitive and simple regression model and is mainly suitable for regression of samples with lower data dimensionality and more discrete numerical values. Fig. 21 shows a support vector machine employed in step 1911 of the large scale regression method based on adaptive random partitioning and model integration according to the fifth embodiment of the present invention. FIG. 22 shows the support vector machine regression employed in step 1911 of the large scale regression method based on adaptive random partitioning and model integration according to the fifth embodiment of the present invention.

Support vector machine regression (SVR): a Support Vector Machine (SVM) is a machine learning algorithm suitable for classification tasks, which finds a classification surface in a linear separable case (i.e., where there is a high-dimensional plane separating two types of sample points) such that the minimum distance from the two types of sample points to the classification surface is maximized. The sample points closest to the optimal classification surface are called "support vectors" (see fig. 21 (left)), which can determine the optimal classification surface without requiring all samples. In the linear inseparable case, the sample data may be mapped into a higher dimensional space such that the data is linearly separable in this high dimensional space. By the 'kernel method', the model can be calculated and the result can be obtained without actually finding the concrete expression of mapping.

Support vector machine regression (SVR) is a generalization of the support vector machine, and solves the regression task using similar ideas, and obtains the regression result by measuring the distance from the sample to the hyperplane (as shown in fig. 22). The regression of the support vector machine is mainly suitable for regression of samples with low data dimensionality and good numerical continuity.

Fig. 19 ' (a), 19 ' (b), and 19 ' (c) show the cases of the integrated regression model based on 1, 2, and 3 random divisions, respectively, in order to compare the continuity and prediction accuracy of the model. FIG. 19' (a) shows a regression model based on a single stochastic partition, with poor prediction accuracy and model continuity;

FIG. 19' (b) shows that two random partitions are generated, and the discontinuous regression models obtained from the two partitions are subjected to average integration, so that the continuity of the integrated model is improved compared with that of the regression model based on a single partition; fig. 19' (c) shows an ensemble learning model based on cubic stochastic partitioning, and the prediction accuracy and model continuity are further improved. If the number of times of random division is further increased, the integrated model gradually tends to be continuous until the problem that the regression model is discontinuous on the division boundary is solved, and satisfactory regression prediction precision is achieved.

Referring to fig. 20, in step 2001, a data set D to be trained is input, the number of control points is set to m, and the adaptive polygon wipe operation is completed in a loop from 1 to m. That is, in step 2003, the extracted sample point i is initialized to 1; at step 2005, it is determined whether the extracted sample point i is smaller than m, and if the determination at step 2005 is yes, at step 2007, one sample point is extracted as a control point with equal probability from the unextracted training data. Then, in step 2009, i is increased by 1, and the process returns to step 2005.

If the determination at step 2005 is negative, then at step 2011, the adaptive random polygon partitioning result is output.

With regard to model integration, each time the adaptive random partitioning is performed, a corresponding integral regression model is obtained, and a plurality of integral regression models can be integrated by adopting a plurality of integration methods. The most common is to take the average of all the whole regression models as the integrated model. Other methods, such as weighted average, are to learn the weights of the regression models of different partitions in the integrated model and then integrate the models.

In addition to the integrated model obtained by Parallel integration (Parallel Ensemble), a Sequential Ensemble (Sequential Ensemble), that is, a Boosting Algorithm (Boosting Algorithm), may be used to perform multiple iterations under fixed division to generate an integrated model. For example, the histogram transformation partition may be used as a spatial partition method, the rotation, stretch, and translation transformations of the data input space may be randomly generated, and the histogram partition method may be used in the transformed data space. In each divided grid, local regression estimation is performed by using an averaging method, and then all local regression models are combined into an overall regression model. Therefore, a lifting algorithm is introduced, and the residual value of each sample point under the last regression model is firstly calculated as a new target value. After obtaining a new sample, randomly generating rotation, stretching and translation transformation of the space again, dividing the transformed space again by using a histogram division method, and estimating each grid by using simple average estimation. After all local regression models are combined, a second global regression model is obtained. And taking the residual value of each sample point under the second regression model as the training target value again, and circularly executing the steps. Since our regression model always estimates the residual error of the last model, starting from the second regression model, the overall regression model of the whole algorithm is the sum of the regression models of each time.

At step 2301, a large dataset D that needs to be trained is input. In this embodiment, the total number of iterations of the lifting algorithm is set to m, so that m iterations are required, and the operation of adding the regression models of each iteration to obtain the overall regression model is completed in a loop from 1 to m. That is, in step 2303, the iteration number variable i of the lifting algorithm is initialized to 1; in step 2305, it is determined whether i is smaller than m, and if yes in step 2305, in step 2307, the division is performed using a random histogram transform division method, local estimation models are obtained in each of the divided cells using average estimation, an overall regression model is obtained by combination, and a residual value of each sample point is calculated and used as a target value to form a new data pair. Then, in step 2309, i is incremented by 1, and the process returns to step 2305.

If the determination at step 2305 is negative, then at step 2311, the regression models for each iteration are summed to obtain the overall regression model. At step 2313, a regression model based on the lifting algorithm is output.

The large-scale regression device according to the seventh embodiment of the present invention implements the large-scale regression method according to the fifth embodiment described with reference to fig. 19.

Referring to fig. 24, the large-scale regression apparatus according to the seventh embodiment of the present invention includes a data input module 2401, a sample adaptive partitioning and regression model calculation module 2403, and a regression model integration module 2405.

At data input module 2401, a large dataset D that needs to be trained is input. In this embodiment, the number of the adaptive random division spaces is T, and therefore, T times of space division operations are required, and the randomly generated adaptive space division operation is completed in a loop from 1 to T.

In the sample adaptive partitioning and regression model calculating module 2403, an adaptive partition of a sample space is randomly generated for each tree, a local regression model is obtained in each partition grid, and the local regression models in each grid are pieced together to obtain an overall regression model each time.

In the regression model integration module 2405, the T integral regression models are integrated, and the integrated regression model is output.

The self-adaptive random division can adopt self-adaptive pure random division, self-adaptive histogram transformation division, random self-adaptive polygon division and the like. The adaptive pure random division and the adaptive histogram transformation division are described in detail with reference to fig. 14 and 15. Wherein the random adaptive polygon partitioning is described in detail with reference to fig. 20.

After the self-adaptive random division is generated, a common regression algorithm is called in each division grid to obtain a local regression model, and the local regression model is spliced into an integral regression model. The following two common regression algorithms, i.e., local mean and support vector machine regression, are commonly used.

The large-scale regression device according to the eighth embodiment of the present invention implements the large-scale regression method according to the sixth embodiment described with reference to fig. 23.

Referring to fig. 25, the large scale regression apparatus according to the seventh embodiment of the present invention includes a data input module 2501, a sample adaptive partitioning and regression model calculation module 2503, and a regression model integration module 2505.

The regression task is based on observing the input variablesX to predict the value of the output variable Y that is not observed. More precisely, we need to train a predictor f that maps the observed input values of X onto the unobserved output variables Y in the form of f (X). Throughout this paper, we assume

Is not null, for some M greater than 0, there is Y: is [ -M, M [ -M]And P is_xIs the edge distribution of X. For arbitrarily fixed R > 0, note B_RIs composed of

The hypercube having a radius of 2R centered on the origin, that is,

the extreme random tree is explained below

Mathematically speaking, let the random variable Z be in space

The probability measure of the partition criterion of a tree of middle value is P_zAnd (4) showing. Since the tree is cut through the space

Therefore we will be at B_RThe node created by the p cuts above is represented as

Wherein A is_jRepresenting the jth node. We further represent the tree as

I.e. the set of all leaf nodes.

[2]The proposed extreme random tree segments root nodes (feature spaces) by randomly selecting nodes to be segmented, dimensions to be segmented and segmentation points. The tree partition of p cuts mayBy an iterative algorithm, where the i-th step (i ═ 1.. p.) can use a random vector Q_i：＝(L_i，R_i，S_i) A description will be given. First item L_iIndicating the node to be switched, which is randomly selected with equal probability from the previously generated nodes. Second term R_iUnif { 1. -, d } represents L_iI.e. uniformly selecting R from all dimensions with equal probability_i. Third item S_i～Unif[0，1]Describing the point to be cut, using the newly generated node L after the ith cut_iR of (A) to (B)_iLength in dimension and L_iIs expressed by the ratio of the lengths of (a) to (b). It is to be noted that,

and

independently and equally distributed.

We introduce a regularization empirical risk minimization framework for algorithm design. And D: { (X)₁，Y₁)，...，(X_n，Y_n) Are independent co-distributed observations with the same distribution as the generic random variable pair (X, Y), derived from an unknown probability measure P on X Y. Order to

Is a loss function. For measurable functions

Risks are caused by

And (4) defining. Furthermore, the bayesian risk is given by:

and corresponding thereto

Referred to as a bayesian decision function. Empirical risk is defined as

Wherein

Is an empirical measure associated with the data,

is (X)_i，Y_i) The dirac measure of (a).

Order to

Being a regularization term, the regression problem can then be formulated as

Loss function L: as a broad framework, our approach is applicable to a variety of supervisory tasks with different loss functions. Herein, we use least squares loss L (Y, f (x)): (Y-f (x))²Solving the least squares regression problem.

Function space

The proper basic function space should be selected according to the characteristics of the basic data set

For low dimensional datasets, we fit our algorithm to the basis function space

Together are used, wherein Z_t～P_zAnd is

From extreme random trees

The step function of the above.

In contrast, for high dimensional datasets, a constant function may not have sufficient representational capacity. Therefore, we introduce a Gaussian radial basis kernel

To Reproduce Kernel Hilbert Space (RKHS) and to use the underlying function space as a joint RKHS

Please see section b.3.1 of the supplementary information for more details.

The regularization term Ω. The regularization term depends on the predictor of each tree.

For a forest of constant functions, we choose to penalize the segmentation number p, which means we can limit

Thereby obtaining a limited VC dimension, thereby making the PAC algorithm learnable. In addition, it can also prevent the learning result from being over-fitted by avoiding the size of the node being too small. For T1

Where λ is the regularization term of the t-th tree of (1)_tRepresenting the regularization parameter.

For a forest of kernel functions, on the one hand, due to joint expansion

Dependent on the tree cutting rule Z_tIt is therefore wise to penalize p for similar reasons as a forest of constant function. On the other hand, to avoid overfitting, we also expect that the form of the regression variable f is not too complex. Therefore we also need to penalize the RKHS norm of f, the regularization term becomes

T1.., T, where λ_1，tAnd λ_2，tIs a regularization parameter.

Embedded random forest (mForest)

As a set of T mTree estimators, mfiest is obtained through the RERM framework. The generation process is shown in the following algorithm.

cmTree&The t-th cmTree regression predictor can be obtained by the following formula

Wherein λ_tWhich is representative of the regularization parameters,

is the t-th tree

At cutting criterion Z_tNumber of cuts. By averaging the cmTree estimators, we can derive the cmForest regression predictor as

kmTree&kmforest. kmtree estimator

Comprises the following steps:

and the kmForest regressor is

At data input module 2501, a large data set D that needs to be trained is input. In this embodiment, the total number of iterations of the lifting algorithm is set to m, so that m iterations are required, and the operation of adding the regression models of each iteration to obtain the overall regression model is completed in a loop from 1 to m.

The iterative operation is performed m times in the sample adaptive partitioning and regression model calculation module 2503, wherein in each iteration, a random histogram transform partitioning method is used for partitioning, a local estimation model is obtained in each partitioned grid by using average estimation, an integral regression model is obtained by combination, and a residual value of each sample point is calculated and used as a target value to form a new data pair.

In the regression model integration module 2505, the regression models of each iteration are summed to obtain an overall regression model, and the regression model based on the lifting algorithm is output.

Regression is based on dataset D: { (x)₁，x₂)，...(x_n，x_n) )) } to predict the value of the output variable Y that is not observed. Wherein the data set is from

The unknown probability measure P. In this context, we assume

And

is a non-empty compact set.

For any fixed R > 0, we convert B to_RR represents a radius of 2R^dCentered hypercube, i.e. B_R：＝[-R，R]^d：＝{x＝(x₁，...，x_d)∈R^d，x_i∈[-R，R]I 1.. d }, for any R e (0, R), we write

Recall that, for 1 ≦ p ≦ infinity, x ≦ x (x)₁，...，x_d)，L_pNorm defined by | | x | | non writing_p：＝(|x₁|^p+…+|x_d|^p)^1/pAnd L is_∞Norm is formed by

We use symbols

And

indicates the presence of normal numbers c and c', such that a_n≤cb_n，a_n≥c′b_nFor N ∈ N. Furthermore, for x ∈ R, let

Represents the largest integer less than or equal to x. Hereinafter, the following multi-index notation is often used. For R^dOf (1), we write

And

least squares regression

Taking into account least squares losses

L(x，y，f(x))：＝(y-f(x))². For measurable decision function

Risks are caused by

Definition, empirical risk is

To be defined. Bayesian risk is the minimum risk with respect to P and L, consisting of

It is given.

In the following, the values considered are in the interval [ -M, M]The predictive variable of (2) is sufficient. To this end, we introduce the concept of clipping (clipping) for the decision function, let us

Is the clipping value of t in R, if t < -M, the value is-M, if t is in [ -M, M]The value is t, and if t is larger than M, the value is M. The minimum square loss L is tailorable at M. After cutting, the risk is reduced, i.e.

Hence, in the following, we only consider the clipping of the decision function and the corresponding risk.

Histogram transformation in regression problems

For clarity of describing one possible construction process of the histogram transformation, we introduce a random vector (R, S, b), where each element represents a rotation matrix, a stretching matrix and a translation vector, respectively. In particular to

R represents a rotation matrix which is a real d x d orthogonal square with a determinant of 1, i.e.

R^T＝R^-1 det(R)＝1

S represents a stretching matrix which is a positive real-valued d x d diagonal scaling matrix, wherein the diagonal elements

Is a random variable. Obviously, there are

Furthermore, we mean

The bin width vector defined in the input space is given by h ═ s^-1。

b∈[0，1]^dIs a d-dimensional translation vector.

Based on the above representation, we define a histogram transform

H(x)：＝R·S·x+b。

It is worth mentioning that we do not necessarily consider the cell width h in the transform space₀Case not equal to 1, since the same effect can be achieved by scaling the transform matrix H'. Thus, let

For the index of the transformed cell, the transformed cell is given by:

including in the input space

Corresponding histogram cell of

A_H(x)：＝{x'|\H(x')∈A’_H(x)}

And we further represent all cells induced by H as

In which the repeating cells are counted only once, and

as an index set for H, such that

In (1)

We have

The collection thus obtained

Form B_ROne division of (2). For convenience, we use A₀Instead of the former

Then

Form R^dOne division of (2).

We present a practical way to construct a histogram transform. First generating the product of d²Independent univariate standard normal random variables form a d x d square matrix M, and then a Householder Q is applied^RThe decomposition is to obtain a factorization of the form M ═ R · W, where R is the orthogonal matrix and W is the upper triangular matrix with directly opposite diagonal elements. The matrix R constructed in this way is an orthogonal matrix and obeys a uniform distribution. If R does not have a positive determinant, then it is not a proper rotation matrix. In this case, we can change the sign of the first column of R to construct a new rotation matrix R that satisfies the condition⁺。

We build a diagonal scaling matrix with the symbol S of the diagonal, where the element S_kTaken from the Jeffreys priors, i.e. log(s)_i) Following areaOn the middle

Is uniformly distributed, whereins ₀And

is a fixed constant. To simplify the notation, we remember

And

furthermore, the translation vector b is from hypercube [0, 1]]^dIs obtained in a uniform distribution.

Given a histogram transform H, set

Form B_RAnd (4) dividing. We consider the set of functions defined below

To limit

Complexity of (1), we divide by pi_HWidth of the box

A penalty is imposed. Then, we can go through the pair

Empirical Risk Minimization (RERM) with regularization is performed to obtain a Histogram Transformation Regression (HTR), i.e., a

Wherein

It is worth noting that to simplify the computation, we apply the same penalty to the dimensions instead of each element h₁，...，h_dAnd (4) punishing respectively.

L₂Regularized lifting histogram transform

Boosting is the task of converting multiple inaccurate weak learners into a single accurate predictor. Specifically, we apply a finite set of functions

Defined as a set of basic learners, a general boosting algorithm will

Function of (1)

Combined to minimize experience loss. The final predictor can be expressed as

Wherein f is_t∈F，t＝1，...，T，ω_tT is a weight, and T is equal to or greater than 0. From the statistical function gradient descent point of view, reformulating boosting into a stepwise optimization problem with different loss functions. In this case, gradient boosting requires the computation of negative function gradients in response

And selects a particular model from the allowed function classes to update the predictor variables at each boosting iteration.

In this work, we focus mainly on boosting algorithms with histogram transformation regressors as the basic learners because they are weak predictors and computationally efficient. Before continuing, we need to introduce the functional space of most interest to us to build our learning theory. Suppose that

Is an independent and obedient certain probability measure P_HOf the histogram change sequence of (1), hypothesis

As defined in (7). Then we can define space E as

Furthermore, for f ∈ E, we define

Then for any f ∈ E, we immediately get we get it according to the Cauchy-Schwarz inequality

In fact, (E, | · | | non-conducting phosphor)_E) Is a function space consisting of measurably bounded functions. M is a constant.

As mentioned above, the boosting method can be viewed as an iterative method for optimizing a convex empirical loss function.

Definition 1 let E be the function space (9) and L be the least squares penalty. Given lambda₁＞0，λ₂For > 0, we call a learning method the enhanced histogram transform regression (BHTR) algorithm for E if this method is for each

Assigning a function

So that

Wherein omega_λ(f) Is defined as

The regularization term consists of two parts. The motivation for the first term is the fact that early boosting methods, such as Adaboost, may over-fit in the presence of tag noise. L using complex estimator weights₂The norm controls the degree of overfitting to achieve consistency and convergence. The second term is added to control the bin width of the histogram transform, which in fact corresponds to adding the base learner f_tL of_pNorm because they do not exceed in capacity

Is a piecewise constant function over the cells of (1).

For theoretical analysis we also need an infinite sample version of definition 1. To this end, we fix distribution P at

And let the function space E be as shown in (9). Then each satisfy

F of (a)_D，Be.E is referred to as the infinite sample version of BHTR relative to E and L. Furthermore, an error function is approximatedA (λ) is defined as

With all of these preparatory tasks, we now propose a general form of the BHTR algorithm. In fact, the randomness of the histogram transform provides an efficient process for performing boosting. With the help of HTR, we repeat the least squares fit of the residuals. Furthermore, we introduce a learning rate ρ to suppress the movement of gradient descent updates, which is related to regularization by contraction.

Algorithm for enhanced histogram transformation for regression

Inputting: training set D: is ═ x₁，y₁)，...，(x_n，y_n) (ii) a Bandwidth parameter

A learning rate ρ;

initialization:

from 1 to T

Generating random affine transformation matrices

Applying data independent segmentation to the transformed sample space;

applying a constant function to each trellis, i.e. the residual sum function f_tIs matched so that

Wherein

Is/are as follows

As defined in (7).

Updating:

calculating residual error

End the cycle

And (3) outputting: enhanced regression histogram transform estimation

Application example of big data regression of the invention

The present invention is applied to the song release year prediction problem as an example. The data set adopted is a 'year prediction song data set' (http:// actual. ics. uci. edu/ml/dates/Yeast PredictionMSD), and the main task is to predict the release year of the song according to the audio features in the data set. This dataset is a subset of the "mega song dataset" (http:// millionsong dataset. com /) containing 463,715 training samples and 51,630 test samples, each observation point having 90 attributes characterizing the timbre of songs released in 1922 to 2011. These features are obtained by computing mel-frequency cepstral coefficients (MFCCs) for the discretized audio sequence. The mel frequency cepstrum coefficient is often applied to a voice recognition task for extracting audio features, retrieving audio information, and the like, and the audio features can be extracted by other methods, such as PNCC and the like.

The process of realizing the algorithm takes a support vector machine regression integration model based on the self-adaptive pure random division and a support vector machine regression integration model based on the self-adaptive histogram transformation division as an example. In the experiment, multiple times of self-adaptive random division are generated according to 90 attributes of test data to obtain a plurality of integral regression models, and an average value is taken to obtain an integrated regression model. In the specific experimental setting, a regression integration model of a support vector machine based on self-adaptive pure random division is combined into 10 times of self-adaptive random division, a sample space is divided into 200 grids in each division, regular parameters in the regression model of the support vector machine are selected from {0.01, 1 and 100}, bandwidth parameters of a Gaussian kernel are selected from {0.001, 0.1 and 10}, and in the selection of the two parameters, 30% of training data are randomly selected as a verification data set, and optimal parameters are automatically selected. For the support vector machine regression integration model based on the self-adaptive histogram transformation division, 5 times of histogram transformation division are respectively generated randomly and integrated by using an average value, a corresponding parameter m is selected to be 2000, namely the division is stopped after the rest 2000 sample points are divided in each grid, and the regular parameter and the loan bandwidth grid in the support vector machine regression model are set to be the same as the parameter of the self-adaptive pure random division.

In the comparison of the year prediction experiment of song release by using a 'year prediction song data set' with the prior art, a support vector machine regression model based on self-adaptive pure random division can achieve higher prediction precision and higher operation speed, the mean square error reaches 81.11, the absolute value of the mean square error of a relative polygon division support vector machine method (85.10) is reduced by 3.99, and the relative value is reduced by 4.7%; the running speed is 327 seconds, the absolute value of the running speed is improved by 92 seconds and the relative value is improved by 22 percent relative to the polygon dividing support vector machine method (419 seconds). The mean square error of the prediction of the adaptive histogram transformation integration model is 83.82, and under the condition that the running time of the prediction of the adaptive histogram transformation integration model is 386 seconds, compared with the polygon division support vector machine method (85.10), the absolute value of the mean square error is reduced by 1.28, and the relative value is reduced by 1.5%. The splicing Gaussian process spatial interpolation method cannot adopt a parallel computing method, so that the operation time is too long (more than 36 hours), and the 'year prediction song data set' cannot be predicted.

In addition, by taking a local average regression integrated model based on self-adaptive pure random division as an example, the effect of the method on model continuity is observed through a simulation experiment. Let the data follow y sin (x) + e, and let the argument x follow a uniform distribution U (0,10), let the random perturbation term e follow a normal distribution N (0, 0.2). In the experiment, 50,000 samples are randomly generated, and the effect of the method on solving the problem of discontinuous division boundary by comparing the method with a polygonal division support vector machine method and a splicing Gaussian process space interpolation method is compared.

Fig. 26 shows simulation experiments on simulated data using stitched gaussian process spatial interpolation regression (left) and polygon-divided support vector machine regression (right). As shown in fig. 26, the regression results of the polygon partition support vector machine method (fig. 26 (left)) and the stitching gaussian process space interpolation method (fig. 26 (right)) can find obvious discontinuous positions, and three typical positions are selected and enlarged on the right side.

Fig. 27 shows a simulation experiment in which the continuity gradually increases as the number of random partitions T increases for the support vector machine regression based on random histogram transform partitioning. By observing the regression model obtained by the invention through fig. 27, when the generated random division times (T) gradually increase, the regression model gradually tends to be continuous and smooth, thereby ensuring the accuracy of regression prediction.

The invention fully utilizes the advantages of randomness of self-adaptive division and integrated learning, solves the problem of discontinuous boundary and improves the accuracy of regression prediction; the invention can be well combined with parallel computing, can not only adopt a CPU processor, but also combine with a GPU processor, greatly saves running time, improves algorithm efficiency, and even can process data with huge data volume and ultrahigh dimensionality.

Besides processing and analyzing audio data, such as voice recognition, audio information retrieval and the like, the invention can also be applied to other large-scale regression tasks, such as an age prediction task in image recognition, position prediction of a 5G terminal, 5G wireless network flow prediction, 5G mobile communication network planning and the like.

To facilitate an understanding of exemplary embodiments, exemplary embodiments and applications of a large data density estimation method based on adaptive random partitioning and model integration according to the present invention have been described and illustrated in the accompanying drawings. It should be understood, however, that the exemplary embodiments are only intended to illustrate exemplary embodiments, and not to limit the scope of the invention. It should also be understood that the exemplary embodiments are not limited to the exemplary embodiments shown and described. Various modifications to the exemplary embodiments will be readily apparent to those skilled in the art.

Claims

1. A data density estimation method, comprising the steps of:

generating a plurality of times of self-adaptive random division for the data set;

in each division, constructing a local density estimation model by using the samples in each division grid respectively;

splicing the local density estimation models together to obtain an overall density estimation model under random division; and integrating the overall density estimation model under the multiple divisions.

2. The data density estimation method of claim 1, wherein the adaptive stochastic partition comprises one of an adaptive pure stochastic partition, an adaptive histogram transform partition.

3. The data density estimation method according to claim 2,

4. The data density estimation method according to claim 2,

5. A data density estimation apparatus comprising:

the self-adaptive division module generates a plurality of times of self-adaptive random division on the data set;

a density estimation module, which constructs a local density estimation model by using the samples in each division grid in each division;

the single estimation module of the overall density estimation model combines the local density estimation models together to obtain an overall density estimation model under random division; and

and the integral density estimation model integration module integrates the integral density estimation models divided for multiple times to obtain the density estimation of the data.

6. A data regression method comprising the steps of:

obtaining a local regression model on each division grid, and splicing to obtain an overall regression model;

and integrating all the integral regression models to obtain an integrated model.

7. The data regression method of claim 6, wherein the adaptive stochastic partition comprises one of an adaptive pure stochastic partition, an adaptive histogram transform partition, and a stochastic adaptive polygon partition.

8. The data regression method of claim 6,

9. A data regression device, comprising:

the local regression module is used for obtaining a local regression model on each division grid;

the integral regression module is used for splicing the local regression models to obtain an integral regression model;

and the integral regression module integration module is used for integrating all integral regression models to obtain an integrated model and obtain regression analysis of data.

10. An electronic device, comprising:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: performing the method of any one of claims 1 to 4, 6 to 8.

11. A computer readable storage medium, characterized in that it stores at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the method according to any one of claims 1 to 4, 6 to 8.