CN113221065A - Data density estimation and regression method, corresponding device, electronic device, and medium - Google Patents

Data density estimation and regression method, corresponding device, electronic device, and medium Download PDF

Info

Publication number
CN113221065A
CN113221065A CN202010525621.7A CN202010525621A CN113221065A CN 113221065 A CN113221065 A CN 113221065A CN 202010525621 A CN202010525621 A CN 202010525621A CN 113221065 A CN113221065 A CN 113221065A
Authority
CN
China
Prior art keywords
division
data
regression
density estimation
adaptive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010525621.7A
Other languages
Chinese (zh)
Inventor
杭汉源
林宙辰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Samsung Telecom R&D Center
Beijing Samsung Telecommunications Technology Research Co Ltd
Samsung Electronics Co Ltd
Original Assignee
Beijing Samsung Telecommunications Technology Research Co Ltd
Samsung Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Samsung Telecommunications Technology Research Co Ltd, Samsung Electronics Co Ltd filed Critical Beijing Samsung Telecommunications Technology Research Co Ltd
Publication of CN113221065A publication Critical patent/CN113221065A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Algebra (AREA)
  • Operations Research (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Complex Calculations (AREA)

Abstract

The application provides a big data density estimation and large-scale regression method, a big data density estimation and large-scale regression device, electronic equipment and a medium. The data density estimation method includes the steps of: generating a plurality of times of self-adaptive random division for the data set; in each division, constructing a local density estimation model by using the samples in each division grid respectively; splicing the local density estimation models together to obtain an overall density estimation model under random division; and integrating the overall density estimation model under the multiple divisions.

Description

Data density estimation and regression method, corresponding device, electronic device, and medium
Technical Field
The invention relates to big data analysis in the field of artificial intelligence, in particular to a big data density estimation method based on self-adaptive random partitioning and model integration, a large-scale regression method based on self-adaptive random partitioning and model integration, a big data density estimation device based on self-adaptive random partitioning and model integration and a large-scale regression device based on self-adaptive random partitioning and model integration.
Background
With the diversification of human life styles and the development of the digital information era, the scale and complexity of the generated big data are also rapidly increased, and the big data are converged by three main technical trends of massive transaction data, massive interactive data and massive data processing. The big data is characterized by huge data size, various data types, high flow speed and low value density. Therefore, analysis of a large amount of large data is complicated, and speed and efficiency of data analysis are required.
Big data analysis is a process of extracting hidden, unknown a priori, but potentially useful information and knowledge from a large amount of incomplete, noisy, fuzzy, random practical application data, and the analysis of big data is to analyze the information without explicit assumption to discover the knowledge.
Big data analytics typically involve two aspects: density estimation of big data and regression analysis of big data.
Disclosure of Invention
An important research direction for unsupervised learning in statistical machine learning is density estimation and regression analysis, which plays a key role as a basic learning task in the middle of many more advanced learning tasks. However, the classical density estimation and regression analysis algorithm cannot effectively process data with high dimensionality and large data volume, so that an unsupervised machine algorithm is established for density estimation and regression analysis, the algorithm is based on the self-adaptability of division, has higher stability, can be combined with parallel calculation to accelerate the operation speed, and shows good prediction accuracy and higher training speed on a real large-scale data set.
According to an aspect of the present invention, there is provided a data density estimation method, including the steps of: generating a plurality of times of self-adaptive random division for the data set; in each division, constructing a local density estimation model by using the samples in each division grid respectively; splicing the local density estimation models together to obtain an overall density estimation model under random division; and integrating the overall density estimation model under the multiple divisions.
Wherein the adaptive random partitioning comprises one of adaptive pure random partitioning and adaptive histogram transformation partitioning.
The local density estimation model adopts the ratio of the proportion of the number of samples in each grid to the total number of samples to the size of the grid.
The model integration expression takes the comprehensive result of the integral regression model under a plurality of random divisions as the final result of the model.
Wherein, the integration of the integral regression models under a plurality of random divisions adopts an averaging method.
The self-adaptive pure random division is characterized in that t sample points are randomly selected in advance before each division, a grid to be divided is selected as a grid containing the most samples, and division dimensions and tangent points are randomly selected.
Before each division, the self-adaptive histogram transformation division selects lattices with the number of sample points larger than m for division, selects the dimension to be divided as the dimension with the largest sample variance, and selects tangent points as the median of the dimension data until the number of the sample points in all the lattices is smaller than m.
The data density estimation method further comprises the following steps: rotation, stretch, and translation transformations are randomly performed on the data of the data set prior to partitioning.
The data density estimation method further comprises the following steps: prior to the adaptive stochastic partitioning, the accuracy of the data in the data set is determined.
In the data density estimation method, before the adaptive random division, the extreme values of the data in the data set are judged and accepted.
In the data density estimation method, the method further includes: before the self-adaptive random division, whether the data in the data set belong to abnormal samples is judged, and when the data in the data set belong to the abnormal samples, the abnormal samples of the data in the data set are screened out.
According to an aspect of the present invention, there is also provided a data density estimation apparatus including: the self-adaptive division module generates a plurality of times of self-adaptive random division on the data set; a density estimation module, which constructs a local density estimation model by using the samples in each division grid in each division; the single estimation module of the overall density estimation model combines the local density estimation models together to obtain an overall density estimation model under random division; and the integral density estimation model integration module integrates the integral density estimation models divided for multiple times.
According to an aspect of the present invention, there is also provided a data regression method, including the steps of: generating a plurality of times of self-adaptive random division for the data set; obtaining a local regression model on each division grid, and splicing to obtain an overall regression model; and integrating all the integral regression models to obtain an integrated model, and obtaining regression analysis of data.
The self-adaptive random division comprises one of self-adaptive pure random division, self-adaptive histogram transformation division and random self-adaptive polygon division.
The local regression model adopts a support vector machine regression (SVR) or a local average method.
The model integration expression takes the comprehensive result of the integral regression model under a plurality of random divisions as the final result of the model.
Wherein, the integration of the integral regression model under a plurality of random divisions adopts a simple average method or a weighted average method.
The self-adaptive pure random division is characterized in that t sample points are randomly selected in advance before each division, a grid to be divided is selected as a grid containing the most samples, and division dimensions and tangent points are randomly selected.
Before each division, the self-adaptive histogram transformation division selects lattices with the number of sample points larger than m for division, selects the dimension to be divided as the dimension with the largest sample variance, and selects tangent points as the median of the dimension data until the number of the sample points in all the lattices is smaller than m.
The data regression method further comprises the following steps: randomly generated rotational, stretching and translation transformations are performed on the data of the data set prior to partitioning.
The data regression method further comprises: prior to the adaptive stochastic partitioning, the accuracy of the data in the data set is determined.
In the data regression method, before the self-adaptive random division, the extreme values of the data in the data set are judged and accepted or rejected.
According to an aspect of the present invention, there is also provided a data regression apparatus, including: the self-adaptive division module generates a plurality of times of self-adaptive random division on the data set; the local regression module is used for obtaining a local regression model on each division grid; the integral regression module is used for splicing the local regression models to obtain an integral regression model; and the integral regression module integration module is used for integrating all integral regression models to obtain an integrated model and obtain regression analysis of data.
According to an aspect of the present invention, there is also provided an electronic apparatus, including: one or more processors; a memory; one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: any of the above-described data density estimation methods and data regression methods are performed.
According to an aspect of the present invention, there is also provided a computer-readable storage medium storing at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by the processor to implement any one of the above-mentioned data density estimation method and data regression method.
Compared with the prior art, the method has the characteristics of robustness and suitability for large-scale data.
Density estimation for big data:
when the prior art uses the kernel function method to perform density estimation, each test point is affected by all sample points, so that the estimation accuracy of the test point may be reduced by abnormal points existing in the data. Different from the above, firstly, since the space division can be adopted for the data, when facing the data with the abnormal points, the area which can be affected by the abnormal points is mainly the grid area where the points fall. Secondly, subsequent model ensemble learning can make the influence brought by the abnormal point averaged by the nearby normal point, thereby reducing the influence. Therefore, the random forest density estimation model as the best choice has relatively strong robustness.
In the prior art, the density estimation cannot be effectively carried out on large-scale data due to the large calculation amount of the histogram estimation and the kernel function method. In contrast, the model of the application can achieve the purpose of processing large-scale data by fully utilizing parallel computing resources.
In the present application, the following two-step method is used for space division: firstly, dividing a sample space into a plurality of small blocks, secondly, constructing a random density sub-tree on each small block, and finally, splicing all sub-trees into a density tree on the whole space. Since the adaptive random division is adopted in the method, the two-step division mode enables the subtrees in a single tree to have parallelism without changing the random division structure. In addition, the random forest algorithm can perform parallel computation among trees, so that the problem of density estimation of large-scale data can be solved through simultaneous parallelism in trees and among trees.
Regression analysis for big data:
in the big data era, with the development of data generation, collection and storage technology, the data scale shows explosive growth, and has important significance for processing and analyzing large-scale data, exploring and disclosing social operation mode and objective rule and promoting scientific and technological development. Many problems in real life can be abstracted into large-scale regression problems, such as voice recognition, audio information retrieval and the like, and the method can also be applied to other large-scale regression tasks, such as an age prediction task in image recognition, position prediction of a 5G terminal, 5G wireless network flow prediction, 5G mobile communication network planning and the like.
In order to solve the problems of low calculation efficiency and insufficient prediction precision of large-scale samples and high-dimensional data in the prior art, a sample set is randomly divided into a plurality of subsets according to a new mode, so that regression models such as mean value regression and support vector machine regression can be applied to each sample subset, the regression models can be well combined with parallel calculation, each subtask in the regression calculation is distributed to a plurality of cores of a computer according to a division grid, the operation time is saved, and the algorithm efficiency is improved. Meanwhile, the invention generates multiple random partitions and integrates regression models under different partitions, thereby solving the problem that the regression models are discontinuous at partition boundaries and improving the accuracy of regression prediction. As shown in fig. 13, the more times of randomly generating histogram transformation division, the stronger the continuity of the obtained integrated model, and the higher the fitting accuracy to the data; and in addition, parallel computing can be combined, subtasks in the parallel computing are respectively distributed to a plurality of cores of the computer, and the prediction accuracy is improved while the running speed is still kept high.
Drawings
The foregoing and/or other aspects will become apparent and more readily appreciated from the following description of the exemplary embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 shows the histogram density estimation in the one-dimensional case.
Fig. 2 shows the kernel density estimation in the one-dimensional case.
Fig. 3(a) shows a common kernel function form and fig. 3(b) shows gaussian kernel functions at different bandwidths.
Fig. 4 shows a specific partitioning diagram of purely random partitioning.
FIG. 5(a) is the distribution of samples in the original space; fig. 5(b) and 5(c) are schematic diagrams of possible results of adaptive histogram conversion partitioning.
Fig. 6 shows a box plot of extreme value determination.
Fig. 7 shows a schematic of the general data (left) and its regression model (right).
Fig. 8 shows a schematic of the regression of data with a straight line (left) and its mean square error (right).
Fig. 9 shows a polygon division diagram (left), with the water cube being an example (right).
Fig. 10 shows a schematic of the stitching gaussian process spatial interpolation (left) and gaussian process regression (right).
FIG. 11 shows a schematic diagram of support vector machine regression.
Fig. 12 shows a flowchart of a density estimation method based on adaptive random partitioning and simple model integration according to a first embodiment of the present invention.
Fig. 13 shows a schematic diagram comparing a conventional pure random partitioning with an adaptive pure random partitioning according to the present invention.
FIG. 14 shows a flow chart of the adaptive pure stochastic partition method employed in step 1207 of the density estimation method based on adaptive stochastic partition and simple model integration according to the first embodiment of the present invention.
FIG. 15 shows a flow chart of the adaptive histogram transform partitioning method employed in step 1207 of the density estimation method based on adaptive stochastic partitioning and simple model integration according to the first embodiment of the present invention.
Fig. 16 shows a flowchart of a density estimation method based on a histogram transformation division and lifting algorithm according to a second embodiment of the present invention.
Fig. 17 shows a block diagram of a density estimation apparatus based on adaptive random partitioning and simple model integration according to a third embodiment of the present invention.
Fig. 18 is a block diagram showing a density estimation apparatus based on a histogram transform division and lifting algorithm according to a fourth embodiment of the present invention.
Fig. 1a shows a block diagram of a probability density clustering method based on a purely random histogram transformation partitioning and integration algorithm according to another embodiment of the present invention.
Fig. 1b shows a block diagram of a probability density clustering method based on a purely random histogram transformation partitioning and lifting algorithm according to another embodiment of the present invention.
Fig. 1c shows a block diagram of a probability density anomaly detection method based on pure random histogram transformation partitioning and random forest according to another embodiment of the present invention.
Fig. 1d shows a block diagram of a probability density clustering method of the probability density anomaly detection method based on K-nearest neighbor and histogram transformation partitioning Bagging algorithm according to another embodiment of the present invention.
Fig. 1e shows a random forest anomaly detection model based on an auto-supervised method according to yet another embodiment of the present invention.
Fig. 19 shows a flowchart of a large-scale regression method based on adaptive random partitioning and model integration according to a fifth embodiment of the present invention.
Fig. 19 ' (a), 19 ' (b), and 19 ' (c) illustrate specific examples of a large-scale regression method based on adaptive random partitioning and model integration according to a fifth embodiment of the present invention.
Fig. 20 shows a flowchart of the adaptive polygon partition method adopted in step 1907 of the large-scale regression method based on adaptive random partition and model integration according to the fifth embodiment of the present invention.
Fig. 21 shows a support vector machine employed in step 1911 of the large scale regression method based on adaptive random partitioning and model integration according to the fifth embodiment of the present invention.
FIG. 22 shows the support vector machine regression employed in step 1911 of the large scale regression method based on adaptive random partitioning and model integration according to the fifth embodiment of the present invention.
Fig. 23 is a flowchart illustrating a large-scale regression method based on a histogram transformation partitioning and lifting algorithm according to a sixth embodiment of the present invention.
Fig. 24 is a block diagram illustrating a large-scale regression apparatus based on adaptive random partitioning and model integration according to a seventh embodiment of the present invention.
Fig. 25 shows a block diagram of a large-scale regression device based on a histogram transformation partitioning and boosting algorithm according to an eighth embodiment of the present invention.
Fig. 26 shows a simulation experiment using a stitched gaussian process space interpolation regression and a polygon-divided support vector machine regression on the simulation data.
Fig. 27 shows a simulation experiment in which the continuity gradually increases as the number of random partitions T increases for the support vector machine regression based on random histogram transform partitioning.
Detailed Description
Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.
Density estimation for big data:
the density estimation of big data, namely, the estimation of the distribution density function of random variables based on given samples of big data, the probability density function is one of the core concepts in probability theory and is used for describing the probability distribution obeyed by continuous random variables.
Conventional big data density estimation methods typically employ big data density estimation models, including parametric and non-parametric density estimation models.
In the parametric density estimation model, one assumes that the data distribution conforms to a certain behavior, such as linear, quantifiable linear, or exponential, and then finds a specific solution in the family of objective functions, i.e., determines the unknown parameters in the density function. In the parameter discrimination analysis, it is assumed that data samples which are taken as discrimination bases and take values randomly are subjected to specific distribution in each possible category. However, the parameter estimation method is paradoxical in that the density function to be estimated is completely unknown, but the parameter estimation method assumes in advance that the data obeys a certain model, which is obtained only by observing the data, and experience and theory show that a large gap often exists between the basic assumption of the parameter model and the actual physical model, and the method cannot always achieve satisfactory results.
Among the nonparametric Density estimation models, there are a Histogram Density estimation (Histogram Density Estimator) model and a Kernel Density estimation (Kernel Density Estimator) model.
The histogram density estimation model is the simplest non-parametric density estimation model. Fig. 1 shows the histogram density estimation in the one-dimensional case. As shown in fig. 1, the histogram density estimation first divides the samples into several equal-sized and non-intersecting intervals (or several grids in the high-order case), and takes the ratio of the number of samples in each interval (or grid) to the total number of samples to the size of the grid as the estimation of the local density, the abscissa represents the range of x, and the total coordinate represents the density value of x.
Fig. 2 shows the kernel density estimation in the one-dimensional case. The kernel density estimation model does not utilize prior knowledge about data distribution, does not add any hypothesis to the data distribution, and is a method for researching data distribution characteristics from a data sample, so that the kernel density estimation model is highly emphasized in the fields of statistical theory and application. As shown in fig. 2, the kernel density estimation model inserts a kernel function at each sample point, and sums all kernel functions to obtain a final density estimation, where the abscissa represents the range of x and the overall coordinate represents the density value of x. The Kernel function can have various forms, fig. 3(a) shows a common Kernel function form and fig. 3(b) shows a Gaussian Kernel function (Gaussian Kernel) at different bandwidths. The commonly used kernel function is a gaussian kernel function, where the bandwidth is an important parameter in kernel density estimation, and the effect of density estimation is closely related to bandwidth selection. Wherein Box refers to a constant density function, Epanechinikov refers to a function, and Gaussian refers to a Gaussian density function.
Since the amount of large data is enormous, the data is usually divided before data estimation. The traditional large data density estimation method basically adopts feature space division.
The feature space division refers to dividing data into a plurality of subsets according to the attribute of the feature and a certain rule, wherein each subset may contain a plurality of sample points or may not contain any sample point. Existing feature space partitioning techniques include random partitioning and adaptive partitioning. The random division makes full use of the division randomness, so that the division result has diversity, the characteristics of data can be learned in multiple aspects, and sample information is not used. The existing examples of random division include pure random division, histogram transformation division and the like; the self-adaptive division considers the information of the sample in the division criterion, can improve the precision and the efficiency of the model in the application of real data, but does not have the diversity of random division, and the most widely applied self-adaptive division comprises polygonal division and the like.
Fig. 4 shows a specific partitioning diagram of purely random partitioning. The key points of the pure random division are that nodes to be cut are randomly selected, dimensions to be cut are randomly selected, and cut points are randomly selected when samples are divided each time. In each division, a node to be cut indicates which grid is further divided, a dimension to be cut indicates which dimension of data in the node to be cut constructs a hyperplane to divide a sample, a cut point indicates a specific position of the hyperplane divided in the dimension to be cut in the node to be cut, and the index of the ratio of the side length of the grid divided in the dimension to be cut to the side length of the grid before division is used for quantification.
Fig. 5(a), 5(b), 5(c) are schematic diagrams of possible results of adaptive histogram transform partitioning.
The histogram transformation projects sample points in an original space into a transformation space through rotation, stretching and translation transformation, and the rotation angle, the stretching degree, the translation direction and the translation size are random when the transformation is carried out, then the sample points are divided in the transformation space according to integer points of all dimensions, and then the divided grids are projected back into the original space, so that one division of the samples in the original space is obtained. Fig. 5(a) shows the distribution of samples in the original space, fig. 5(b) shows the distribution of samples after rotational transformation, and fig. 5(c) shows the distribution of samples after stretching and translation transformation, and the transformed samples are divided according to integer points on the basis of the distribution.
In the conventional large data density estimation method, the extreme values in the data need to be determined and rejected. Extreme values refer to abnormal large and small values in one or more numerical data contained in a data set, or labeled abnormal data (outlier) in classified data, and are unreasonable data generated by improper operation of real data during recording, measurement, experiment or data processing. The existence of abnormal values can have a number of adverse effects on the statistical analysis of the data, such as reducing the persuasiveness and credibility of the data statistics or models. Therefore, determining and removing outliers in the data becomes an important part of building the model.
In numerical data, extreme value determination is often based on the absolute magnitude of the value. This is typically done by arranging the sample data according to the absolute size of the values, taking three quartiles in the ordered series, labeled as Q1, Q2, and Q3 from small to large, e.g., the numbers at the 25%, 50%, and 75% positions of the data. Numbers greater than Q3+ α IQR or less than Q1- α IQR are generally considered extreme values, where α is a parameter, empirically chosen as 1.5, and IQR is the difference between Q3 and Q1, known as the interquartile range (IQR). Fig. 6 shows a boxplot of extreme value determinations, illustrating the quartile of data and the determination boundaries of extreme values. In fig. 6, the horizontal line in the middle of the square is the median level, the upper and lower sides of the square are the upper quartile Q3 and the lower quartile Q1, respectively, and the upper and lower "T" shaped extensions of the square are the extreme value determination boundaries, respectively, representing Q3+1.5 × IQR and Q1-1.5 × IQR. The "o" point represents the extreme point.
In the conventional large data density estimation method, the accuracy of the data also needs to be judged. In the accuracy judgment, the accuracy judgment is applicable to the prediction model, after model training is carried out by using a test sample, the target value of the whole sample is predicted, the target value of the sample and the predicted value of the model are compared and analyzed through a set accuracy evaluation function, and the prediction accuracy evaluation of the sample is given. An accuracy threshold is selected according to model requirements or an expert, and sample points with accuracy below the threshold are considered extreme points.
The above-described prior art has the following problems or points to be improved:
1) in the prior art, the traditional histogram density estimation is a non-continuous and non-smooth density estimation method, has no derivative function, loses the spatial relationship of samples, and is unfavorable for analysis; secondly, the density function of the histogram estimation is easily influenced by the boundary width division of the subintervals, and for a fixed data set, different boundary divisions can obtain results with larger differences; finally, the effect of histogram density estimation is also affected by the distribution characteristics of the original data, for example, for data with thick tail distribution, the histogram density estimation cannot achieve higher density estimation accuracy at the thick tail part. For kernel density estimation, the sub-interval boundary width division is often related to kernel functions, and the characteristics of data are not considered, under the condition, the extreme value of the data can generate large interference on the result of the kernel density estimation; second, although the kernel density estimation solves the continuous problem, in sparse samples, there is also a problem of "density dip".
2) In the existing feature space partitioning technology, the sample information is not considered in the random partitioning such as pure random partitioning and histogram transformation partitioning, so that the adaptivity is lacked, the partitioning efficiency is low, namely the partitioning frequency of a sample low-density area is more than the required frequency, and in addition, the estimation accuracy of a sample high-density area is poor, namely the partitioning frequency of the sample high-density area is less than the required frequency; while adaptive partitioning techniques like polygon partitioning lose the possibility of generating multiple partitions due to lack of randomness. In the prior art, a division model with both random diversity and adaptivity is not available for a while.
3) In the existing density estimation model for processing high-dimensional data, neither histogram density estimation nor kernel density estimation can solve the problem that the estimation density is generally small due to the sparsity of high-dimensional space samples. In many practical problems, the support of high-dimensional data can be reduced to low dimension, but both the histogram density estimation method and the kernel density estimation method can only predict in a high-dimension feature vector space, and not only the training speed is slow, but also the estimation effect is poor.
Regression analysis on big data:
regression analysis is the most important basic idea in data analysis and one of the most statistically important theories, and most data analysis problems can be modeled as a regression analysis problem. Regression analysis is analysis for studying the correlation between independent variables and dependent variables, wherein three keywords are correlation, independent variables and dependent variables.
Dependent variables are variables that change as the independent variable changes. In practical applications, the dependent variable characterizes the core appeal of a task and is a key object of scientific research, for example, in the problem of predicting the song release years, people regard the song release years as the dependent variable.
The independent variable is a related variable for explaining the dependent variable, and may be one or more, and may also be generally referred to as an explanatory variable. The task of regression analysis is to try to explain the forming mechanism of the dependent variable by researching the correlation between the independent variable and the dependent variable, thereby achieving the purpose of predicting the dependent variable through the independent variable. For example, in the problem of predicting the release year of a song, 90 independent variables respectively represent the mean value, covariance and the like of timbre, and the regression aims to find the relationship between the timbre of the song and the release year of the song based on various timbre characteristics of the existing song and predict the release year of a new song through an established regression model.
Fig. 7 shows a schematic of the general data (left) and its regression model (right).
As shown in fig. 7, scatter represents known data, horizontal axis represents independent variables, vertical axis represents dependent variables (assuming for convenience of illustration that independent and dependent variables are both one-dimensional, in practical terms, they may be multi-dimensional), and both lines can be used as regression models of data, but how to determine which effect is better? To measure the effect of regression model prediction, the data was divided into two parts: a part of the data is known and used for discovering the rules, and is called a training set; and testing the regression prediction model with another part of data, namely a test set. The regression effect is usually measured by Mean Square Error (MSE), i.e. how much the predicted result differs from the actual result on the test set, as shown in fig. 8, where fig. 8 shows a graph of data regression back and forth with a straight line (left) and its Mean Square Error (right). The scatter points represent known data, the horizontal axis represents independent variables, the vertical axis represents dependent variables, the straight lines represent regression models, the mean square error can be regarded as the average value of the areas of small squares in the right graph, and the smaller the mean square error is, the better the regression prediction effect is.
Common statistical regression methods can be classified into linear regression, logarithmic linear regression, polynomial regression and the like according to different regression models, but the traditional regression methods usually have strong assumptions on the models, actual data are likely to be far away from the assumptions, and the fixed regression models are difficult to solve the complex regression problem, so that the prediction accuracy is low. In addition, these methods are not suitable for solving the large-scale regression problem, i.e. the data size is very large, and the traditional regression method often needs a long running time, and even the regression result may not be obtained due to insufficient computing resources.
Since the amount of large data is enormous, data is generally divided and extreme values in the data are determined and discarded before data regression is performed. The conventional big data large scale regression method basically employs feature space division, such as the division method described above with reference to fig. 4-5. The extreme value in the data is conventionally determined and discarded, as in the extreme value determination method described above with reference to fig. 6.
In addition, in data regression, a polygon division method is also employed. Fig. 9 shows a polygon division diagram (left), with the water cube being an example (right).
As shown in fig. 9 (left), the polygon partition is based on the nearest neighbor rule and is composed of a polygon composed of a set of perpendicular bisectors connecting line segments of two neighboring points. Firstly, selecting a sample data subset as a group of control points by using a simple random sampling method, namely extracting samples one by one, wherein the probability of each sample extracted is equal during each extraction; any point within each polygon is less distant from the control points that make up the polygon than from the control points of other polygons. The water cube of the beijing olympic games is designed based on this division principle (fig. 9 (right)).
The traditional big data regression method generally adopts a regression model based on fixed division, and when carrying out regression analysis on large-scale data, the common method is to divide a feature space, call a local regression model in each division grid and finally splice the models in each grid together. However, in the splicing process, the problem that the division boundary of the regression model is discontinuous is generated, and some solutions are provided for the problem by the latest technologies such as a splicing gaussian process space interpolation method and a polygon division support vector machine method. Fig. 10 shows a schematic of the stitching gaussian process spatial interpolation (left) and gaussian process regression (right).
Stitching gaussian process space interpolation (Patchwork Kriging): the method is proposed in 2018 by Park and Apley, and comprises the steps of firstly dividing a sample space according to characteristics (namely independent variables of a regression problem), performing regression in each division grid by using a gaussian process (namely dividing sample data points into a plurality of groups, and only using data in each group to construct a local regression model), and finally splicing regression models in each group to obtain an integral regression model, wherein the problem that the regression model is discontinuous at a division boundary can occur in the method, and the problem is shown in fig. 10 (left). Gaussian process regression is a non-parametric statistical regression method whose results give not only the regression model but also the interval in which the predicted result may fall, as shown in fig. 10 (right). In order to solve the problem that the integral regression model is discontinuous on the division boundary, the method forces the local regression models at two ends of the boundary to have equal values by manufacturing artificial observed values around the division boundary.
FIG. 11 shows a schematic diagram of support vector machine regression, i.e., linear regression in a high dimensional space is equivalent to making a non-linear regression in the original feature space. Polygon Partition Support Vector Machine method (Voronoi Partition Support Vector Machine): according to the method, a sample space is divided by polygon division, then a support vector machine is used for regression in each division grid (as shown in figure 11, the support vector machine is a machine learning model and is suitable for regression of high-dimensional data, namely, the regression problem when the number of independent variables is large, the prediction precision is high, but the operation speed on large-scale data is low), and finally the regression models in each grid are combined into a regression model on the whole sample space. In the figure, phi refers to a transformation function from a low dimension to a high dimension.
The above-described prior art has the following problems or points to be improved:
1) conventional regression methods usually make relatively strong assumptions about the model, such as assuming a specific form of the regression model, wherein the equation form of the regression model is expressed by following a linear, polynomial, exponential model, etc., or making assumptions about the data structure, such as assuming that the residual error follows a certain known distribution or that the data has sparsity, etc. The actual data is likely to be far from these assumptions, and the fixed simple regression model is difficult to solve the complex regression problem, resulting in low prediction accuracy. In addition, these methods are not suitable for solving the problem of large-scale regression, and in the case of very large data volume, the traditional regression method often requires a long running time, and even the regression result may not be obtained due to insufficient computing resources.
2) In the existing feature space partitioning technology, the sample information is not considered in the random partitioning such as pure random partitioning and histogram transformation partitioning, so that the adaptivity is lacked, the partitioning efficiency is low, namely the partitioning frequency of a sample low-density area is more than the required frequency, and in addition, the estimation accuracy of a sample high-density area is poor, namely the partitioning frequency of the sample high-density area is less than the required frequency; while adaptive partitioning techniques like polygon partitioning lose the possibility of generating multiple partitions due to lack of randomness. In the prior art, a division model with both random diversity and adaptivity is not available for a while.
3) In the existing model for processing the large-scale regression, a splicing Gaussian process space interpolation method and a polygonal dividing support vector machine method both use a method of dividing first and then synthesizing to solve the problem of the large-scale regression, but the selection of the dividing boundary has a part of subjective factors, and the regression model cannot be completely continuous and smooth on the dividing boundary, so that the prediction precision is influenced. In addition, the space interpolation method of the splicing Gaussian process cannot be combined with parallel computation, so that the operation speed on large-scale data is low.
Therefore, a large data density estimation method and apparatus, and a large scale regression method and apparatus, which adopt adaptive random data division and can perform model integration, are urgently needed.
The big data density estimation of the invention mainly comprises two parts: performing multiple self-adaptive random division and establishing local and overall density estimation models; and integration of the overall density estimation model under different partitions. The method mainly comprises the following steps: firstly, generating a plurality of times of self-adaptive random division; in each division, constructing a local density estimation model by using the samples in each division grid respectively; splicing the local density estimation models together to obtain an overall density estimation model under a certain random division; and finally integrating the overall density estimation model under the multiple divisions. The large data density estimation of the present invention may be embodied as a method or apparatus.
The self-adaptive random division can adopt self-adaptive pure random division, self-adaptive histogram transformation division and the like. The local density estimation model adopts the ratio of the proportion of the number of samples in each grid to the total number of samples to the size of the grid. The model integration representation takes the comprehensive result of the integral regression model under a plurality of random divisions as the final result of the model. The integration method of the model may adopt a simple average method, a weighted average method, or the like.
Fig. 12 shows a flowchart of a density estimation method based on adaptive random partitioning and simple model integration according to a first embodiment of the present invention.
In step 1201, a large data set D to be trained is input. In this embodiment, the number of the adaptive random division spaces is T, and therefore, T times of space division operations are required, and the randomly generated adaptive space division operation is completed in a loop from 1 to T. That is, in step 1203, the number of divisions t is initialized to 1; in step 1205, it is determined whether T is smaller than T, a "tree" is generated for each partitioned space, if yes in step 1205, an adaptive partition for the sample space is randomly generated for each tree, in each partitioned grid, each local density estimation model is obtained through, for example, simple averaging or weighted averaging, and the local density estimation models in each grid are pieced together to obtain the tth whole density estimation model, in step 1207. Then, in step 1209, i is incremented by 1, and the process returns to step 1205.
If the determination at step 1205 is negative, then at step 1211, the T global density estimation models are integrated, for example, by averaging. At step 1213, the integrated density estimation model is output.
The local density estimation model in step 1207 uses the ratio of the number of samples in each grid to the total number of samples to the size of the grid, that is:
Figure BDA0002533678600000141
in the model integration of step 1211, each adaptive random partition is performed to obtain a corresponding overall density estimation model, and a plurality of overall density estimation models may be integrated by using a plurality of integration methods. The most common method is to take the average value of all the whole density estimation models as the integrated model, and other possible integration methods include a weighted average method and the like, namely, the weight of the density estimation models based on different partitions in the integrated model can be changed.
The adaptive partition of the sample space generated randomly for each tree in step 1207 above may be implemented by two methods, namely, an adaptive pure random partition and an adaptive histogram transform partition.
Fig. 13 shows a schematic diagram comparing a conventional pure random partitioning with an adaptive pure random partitioning according to the present invention.
In the conventional pure random partition of fig. 13 (left side), from the perspective of experimental effect, since no sample information is used in the pure random partition, the partition efficiency may be low, that is, the number of times of dividing the sample low-density area is more than the required number of times; in addition, the estimation accuracy of the sample high-density area may be poor, that is, the sample high-density area is divided less times than required.
As an improvement, in the adaptive pure random partitioning method of the present invention of fig. 13 (right side), it is proposed to use an adaptive pure random partitioning criterion: before division, firstly, randomly extracting part of samples from the whole sample, selecting nodes in which most of the samples fall as nodes to be cut, and after determining the nodes to be cut, randomly selecting the dimensions to be cut and the positions of cut points.
The pure random division does not consider the information of the data, the division efficiency is low, the self-adaptive pure random division can adjust the division result according to the sample distribution, more grids are divided in places with high data density, the division is sparse in places with low data density, and the model effect is greatly improved.
FIG. 14 shows a flow chart of the adaptive pure stochastic partition method employed in step 1207 of the density estimation method based on adaptive stochastic partition and simple model integration according to the first embodiment of the present invention.
Referring to fig. 14, in step 1401, a data set D to be trained is input, and the division number (which may also be referred to as a cut number) constituting the adaptive random division is set to p, so that p division operations are required, and the random generation adaptive division operation is completed in a loop from 1 to p. That is, in step 1403, the number of divisions (which may also be referred to as the number of cuts) i is initialized to 1; at step 1405, it is determined whether i is less than p, and if the determination at step 1405 is yes, at step 1407, the ith adaptive partitioning operation is performed. Then, in step 1409, after increasing i by 1, return to step 1405. Before each division, t sample points are randomly selected in advance, the grid containing the most samples at present is selected as the grid to be divided (also called as the grid to be cut), and the division dimension (also called as the cutting dimension) and the tangent point are randomly selected.
If the determination at step 1405 is negative, then at step 1411, p times the adaptive purely random partitioning result is output.
FIG. 15 is a flowchart of the adaptive histogram transform partitioning method employed in step 1407 of the density estimation method based on adaptive stochastic partitioning and simple model integration according to the first embodiment of the present invention.
In the adaptive histogram conversion partitioning method according to the present invention, it is considered that more sample information is used in the process of histogram conversion partitioning, thereby obtaining an adaptive histogram conversion partitioning. Starting from the transformed sample data, in each division, considering the number of samples in the existing lattices, only further dividing the lattices of which the number of the samples is greater than a certain numerical value (assumed to be m), selecting a dimension to be divided during the division so that the variance of the samples is as small as possible, and therefore selecting the dimension with the largest ratio of the extreme difference of the samples to the variance, wherein in the dimension, if the mean value of the samples falls on the left side of 0.6 quantile, the tangent point is selected to be 0.618 quantile of the data, otherwise, the 0.382 quantile of the data is selected, and the division is stopped until the number of the samples in all the lattices is less than m.
Referring to fig. 15, in step 1501, a data set D to be trained is input, and a minimum sample point limit in a grid after division is set to m. In step 1503, randomly generating a rotation angle, a stretching degree and a translation vector; the original training data is transformed into training data in the new space through randomly generated rotation, stretching and translation transformation (i.e. rotation, stretching and translation transformation is performed randomly). In step 1505, it is determined whether there is a grid with sample points greater than m. If the determination in step 1505 is yes, then in step 1507, the dimension with the largest sample variance is selected as the dimension to be divided, the median of the dimension data is selected as the tangent point, and the lattice is divided. And then returns to step 1505. A loop is made until the number of sample points in all the bins is less than m.
If the determination at step 1505 is no, then at step 1511 the partitioned grid is projected back into the original sample space based on the inverse transform corresponding to the generated transform.
At step 1513, the result of the adaptive histogram conversion partition is output.
Fig. 16 shows a flowchart of a density estimation method based on a histogram transformation division and lifting algorithm according to a second embodiment of the present invention.
Unlike the density estimation model based on the adaptive random partitioning and integration algorithm according to the first embodiment of the present invention shown in fig. 12, the density estimation model based on the histogram transform partitioning and boosting algorithm according to the second embodiment of the present invention moves the depth extraction process for the sample information from the spatial partitioning to the density estimation step. In each iteration, the model firstly partitions the data feature space according to the adaptive histogram transformation partitioning method illustrated in fig. 15 to obtain non-repetitive small regions, and establishes a local density estimation model on the small regions. It should be noted that the model established in each small region is obtained by weighting an element in a certain density function space based on the overall density estimation model obtained in the last iteration. Since density estimation requires that the integral value of the density function on the defined domain is 1, the density function under the original estimation and the newly estimated density function are actually normalized weights, and the weight between the two parts needs to be adjusted to optimize the loss function under the set unsupervised condition. Specifically, the density function space H is not set, and the loss function is set to E (log (X)), and the previous iteration results in the overall density function F(t-1)(x) In a certain region, the objective optimization function is
Figure BDA0002533678600000161
Wherein
Figure BDA0002533678600000162
And is
Figure BDA0002533678600000163
The resulting local density estimation model then translates the expectation into a form of sample weighting:
Figure BDA0002533678600000164
F(t)(x)-(1-α*)F(t-1)(x)+α*ft(x)
after obtaining the local density estimation model, the overall density estimation model of the iteration can be finally determined by normalized addition.
Referring to fig. 16, in step 1601, a large data set D to be trained is input, the total number of iterations of the lifting algorithm is set to T, and the weak density function space used in the lifting algorithm is H. In this embodiment, the total number of iterations of the lifting algorithm is T, and the operation of the lifting algorithm is completed in a loop from 1 to T. I.e. in step 1603, the number of lifting algorithm iterations t is initialized to 1.
In step 1605, it is determined whether the number of iterations T of the lifting algorithm is less than T.
If the determination at step 1605 is yes, then at step 1607, the following operations are performed:
randomly generating adaptive histogram transformation division for a sample space;
in each division grid, inheriting the density estimation model F of the last iteration(t-1)(x) Calculating the weight of the division containing the sample point by using the reciprocal of the density function value obtained in the last iteration
Figure BDA0002533678600000171
Selecting an optimal density estimator corresponding to the iteration in a function space;
and performing weighted combination on the density estimation function obtained by the last iteration and the density estimator selected this time, and calculating an optimal weight proportion through empirical distribution, for example: the elements in the function space H and the weights a are selected in each of the division lattices,
Figure BDA0002533678600000172
Figure BDA0002533678600000173
constructing a new local density estimation model F(t)(x)-(1-α*)F(t-1)(x)+α*ft(x) Minimizing a local density estimation loss estimation function L to obtain a local density estimation model;
the local density estimation models in each grid are spliced together to obtain an integral density estimation model F(t)(x)。
Then, in step 1609, i is incremented by 1, and the process returns to step 1605.
If the determination at step 1605 is negative, at step 1611, the boosted density estimation model after T boosts is output.
Fig. 17 shows a block diagram of a density estimation apparatus based on adaptive random partitioning and simple model integration according to a third embodiment of the present invention.
The density estimation apparatus according to the third embodiment implements the density estimation method according to the first embodiment described with reference to fig. 12. Referring to fig. 17, the density estimation apparatus according to the third embodiment includes a data input module 1701, a sample adaptive partitioning and density estimation model calculation module 1703, and a density estimation model integration module 1705.
In the above method, we assume that
Figure BDA0002533678600000174
And
Figure BDA0002533678600000175
is tight and non-empty. For a given R>0, we mean BRIs RdCubes of medium size 2R, i.e., BR:=[-R,R]d:={x=(x1,...,xd)∈Rd:xi∈[-R,R]I 1, d, and for any R e (0, R/2), for 1 ≦ p<∞,x=(x1,...,xd) Is defined as | | x | | non-conducting phosphorp:=(|x1|p+…+|xd|p)1/pInfinity is defined as
Figure BDA0002533678600000181
For any x ∈ R, let
Figure BDA0002533678600000189
Denotes the largest integer greater than or equal to x. For vectors
Figure BDA0002533678600000182
We remember
Figure BDA0002533678600000183
Figure BDA0002533678600000184
The negative log-loss function is explained below:
let f be
Figure BDA00025336786000001810
The upper unknown probability measures the density function of P. Based on the independent co-distribution data set D from the distribution P; { (x)1,y1),...,(xn,yn) We aim to construct a measurable function satisfying an integral value of 1 to approximate as a density estimate
Figure BDA0002533678600000185
We use the definition as
Figure BDA00025336786000001811
The negative log loss function of (c) to measure how good the density estimate is.
The histogram transformation is explained below:
for clarity of the construction process of the histogram transformation, we introduce random vectors (R, S, b), where each element R, S, b represents a rotation matrix, a stretching matrix and a translation vector, respectively. Specifically, R denotes a rotation matrix, which is a real-valued d × d orthogonal square matrix having a unit determinant, i.e., the rotation of R is equal to the inverse of R, and det (R) is 1. S represents a stretching matrix which is a positive real-valued d x d diagonal scaling matrix in which the diagonal elements SiI 1.. d is some random variable. Furthermore, we represent diagonal elements as vectors
Figure BDA0002533678600000186
And gives a width vector h ═ s defined in the input space-1(ii) a In addition b ∈ [0, 1]]dIs a d-dimensional vector, which we call a translation vector. Then we define the histogram transform to
Figure BDA0002533678600000187
H(x):=R·S·x+b。
Histogram transform boosting density estimation is explained below:
in this section we focus mainly on boosting algorithms equipped with histogram transform density estimators. We use histogram transformation as a basis learner, which is a weak predictor and has high computational efficiency.
We first introduce the histogram transformation function space: suppose that
Figure BDA0002533678600000188
Is an independent identically distributed histogram transformation sequence, where HtIs based on some probability measure PHIs/are as follows
Figure BDA00025336786000001812
To transform the extracted histogram. As described above, the lifting algorithm may be viewed as an iterative method for convex optimization of the empirical loss function.
Based on the above description, we propose a gradient boosting algorithm to solve the optimization problem of the empirical loss function, and the randomness of the histogram transformation just provides an effective step for boosting. The algorithm proceeds iteratively, that is, for T1t(xi)=(1-αt)Ft-1tftIn which F istRepresenting the density estimate obtained after the t-th iteration, ft∈FtRepresents the t basic learner, and let the iteration step size alphatE (0, 1). We can get by simple calculation
Figure BDA0002533678600000191
Wherein wt,j=(1-αt)…(1-αj+1)·αjJ is 1, t, and
Figure BDA0002533678600000192
our purpose is to search the basis learner f used each timetAnd an iteration step size alphatSo that FtThe corresponding penalty function becomes smaller after each iteration. In the t-th iteration, for an arbitrary αtE (0, 1), the minimization of the empirical loss is equivalent to the minimization formula
Figure BDA0002533678600000193
Wherein epsilont=αt/(1-αt). Applying taylor expansion to it we get:
Figure BDA0002533678600000194
wherein ω ist,i=1/Ft-1(xi). For a sufficiently small epsilont(or alpha. in the case oft) We can ignore its higher order terms and find the optimal maximum gradient direction as
Figure BDA0002533678600000195
Then we find the step size alpha by linear searchtTo ensure updated learner FtThere is still one probability distribution. O () represents the same order.
In summary, the density estimation performed by the histogram enhancement method specifically includes:
Figure BDA0002533678600000196
the data input module 1701 inputs a large data set D to be trained and the number of adaptive random division spaces T. Therefore, T times of space division operations are required, and randomly generated adaptive space division operations are completed in a loop from 1 to T.
In the sample adaptive partitioning and density estimation model calculation module 1703, a "tree" is generated in each of the 1 to T spatial partitions of the large data set D, an adaptive partition for the sample space is randomly generated for each tree, each local density estimation model is obtained in each partition grid by, for example, simple averaging or weighted averaging, and the local density estimation models in each grid are pieced together to obtain the entire density estimation model of each partition space.
In the density estimation model integration module 1705, the T whole density estimation models corresponding to the T divisions, which are obtained in the sample adaptive division and density estimation model calculation module 1703, are integrated by, for example, averaging, and the integrated density estimation model is output.
The local density estimation model adopts the ratio of the proportion of the number of samples in each grid to the total number of samples to the size of the grid, namely:
Figure BDA0002533678600000201
in the model integration of the density pattern model integration module 1705, the overall density estimation model corresponding to each self-adaptive random division is obtained, and a plurality of overall density estimation models can be integrated by adopting a plurality of integration methods. The most common method is to take the average value of all the whole density estimation models as the integrated model, and other possible integration methods include a weighted average method and the like, namely, the weight of the density estimation models based on different partitions in the integrated model can be changed.
The adaptive partitioning of the sample space is randomly generated for each tree in the sample adaptive partitioning and density estimation model calculation module 1703, and two methods of adaptive pure random partitioning and adaptive histogram transformation partitioning can be adopted.
Fig. 18 is a block diagram showing a density estimation apparatus based on a histogram transform division and lifting algorithm according to a fourth embodiment of the present invention.
The density estimation apparatus according to the fourth embodiment implements the density estimation method according to the second embodiment described with reference to fig. 16.
Referring to fig. 18, the density estimation apparatus according to the fourth embodiment includes a data input module 1801, a sample adaptive partitioning and density estimation model calculation module 1803, and a density estimation model output module 1805.
Referring to fig. 18, in the data input module 1801, a large data set D to be trained is input, the total number of iterations of the lifting algorithm is set to T, and the weak density function space used in the lifting algorithm is H.
In the sample adaptive partitioning and density estimation model calculation module 1803, T times of iterative calculations of the lifting algorithm are implemented. Wherein in each iteration, an adaptive histogram transform partition of the sample space is randomly generated; in each division grid, inheriting the density estimation model F of the last iteration(t-1)(x) Calculating the weight of the division containing the sample point by using the reciprocal of the density function value obtained in the last iteration
Figure BDA0002533678600000211
Selecting an optimal density estimator corresponding to the iteration in a function space; and performing weighted combination on the density estimation function obtained by the last iteration and the density estimator selected this time, and calculating an optimal weight proportion through empirical distribution, for example: the elements in the function space H and the weights a are selected in each of the division lattices,
Figure BDA0002533678600000212
α)F(t-1)(xi)|α-ft(xi)]constructing a new local density estimation model F(t)(x)=(1-α*)F(t-1)(x)+α*gt(x) Minimizing a local density estimation loss estimation function L to obtain a local density estimation model; the local density estimation models in each grid are spliced together to obtain an integral density estimation model F(t)(x)。
If T iterative computations of the lifting algorithm are completed, the density estimation model output module 1805 outputs the lifted density estimation model after T lifts.
Fig. 1a shows a block diagram of a probability density clustering method based on a purely random histogram transformation partitioning and integration algorithm according to another embodiment of the present invention.
The probability density clustering method based on the pure random histogram transformation division combines the pure random histogram transformation division idea with the clustering method based on the probability density, and is a specific application of the density estimation method based on the pure random histogram transformation division. Generally, for unlabeled sample data sets
Figure BDA0002533678600000213
Firstly, an estimation value of unknown distribution of a sample data source is obtained by using a density estimation method based on histogram transformation division
Figure BDA0002533678600000214
And using the estimated density function
Figure BDA0002533678600000215
Screening out sample points with higher probability density by a level set method
Figure BDA0002533678600000216
And finally obtaining a final clustering result with the help of a cluster tree.
Referring to FIG. 1a, in step 1a01, training data is input
Figure BDA0002533678600000217
A clustering tree distance parameter h; value set of parameter lambda
Figure BDA0002533678600000218
In step 1a03, calculating to obtain density estimation of training data by using density estimation model based on pure random histogram transformation division and integrated algorithm
Figure BDA0002533678600000219
Or
Figure BDA00025336786000002110
In this embodiment, the total number of cycles of the probability density clustering method based on the pure random histogram transformation partitioning and integration algorithm is L, and the operation of the algorithm is completed in the cycle from 1 to L. That is, in step 1a03, the loop number variable i is initialized to 1.
In step 1a05, it is determined whether the loop number variable i is less than length (L).
If the judgment at step 1a05 is YES, at step 1a07, the following operations are performed: screening out sample points with probability density larger than the level set parameters; marking points with similar distances in the screened sample set based on the parameter h in the clustering tree; the related composition set C in the above labeled graph (graph) is calculated based on the DBSCAN algorithm.
In particular by λiDetermining a set of nodes
Figure BDA0002533678600000221
And determining a corresponding set of boundaries
Figure BDA0002533678600000222
Finally, constructing a corresponding graph based on the two sets
Figure BDA0002533678600000223
And calculate the chart
Figure BDA00025336786000002215
Corresponding correlation component C (lambda)i)。
Then, in step 1a07, after increasing i by 1, return is made to step 1a 05.
If the determination in step 1a05 is no, then in step 1a11, the final clustering tree T is obtained under different level set parameters.
Next, in step 1a13, the integrated clustering tree model is output
Figure BDA0002533678600000224
Fig. 1b shows a block diagram of a probability density clustering method based on a purely random histogram transformation partitioning and lifting algorithm according to another embodiment of the present invention.
The probability density clustering method based on the pure random vertical direction transformation division and the lifting algorithm combines the clustering method based on the probability density with the pure random vertical direction transformation division idea and the lifting algorithm for improving the accuracy, gradually improves the algorithm accuracy in the iterative integration process, and is a specific application of the density estimation method based on the pure random vertical direction transformation division and the lifting algorithm. Generally, for unlabeled sample data sets
Figure BDA0002533678600000225
Firstly, a density estimation method based on the histogram transformation division and the lifting algorithm is used for obtaining an estimation value of unknown distribution of a sample data source
Figure BDA0002533678600000226
And using the estimated density function
Figure BDA0002533678600000227
Screening out sample points with higher probability density by a level set method
Figure BDA0002533678600000228
And finally obtaining a final clustering result with the help of a cluster tree.
Referring to FIG. 1b, in step 1b 01, training data is input
Figure BDA0002533678600000229
A clustering tree distance parameter h; value set of parameter lambda
Figure BDA00025336786000002210
In step 1b 03, calculating to obtain a density estimation of training data by using a density estimation model based on a pure random histogram transformation partitioning and integration algorithm
Figure BDA00025336786000002211
Or
Figure BDA00025336786000002216
In this embodiment, the total number of cycles of the probability density clustering method based on the pure random histogram transformation partitioning and lifting algorithm is L, and the operation of the algorithm is completed in a cycle from 1 to L. That is, in step 1b 03, the loop number variable i is initialized to 1.
In step 1b 05, it is judged whether the loop number variable i is smaller than length (L).
If the judgment in step 1b 05 is yes, then in step 1b 07, the following operations are performed: screening out sample points with probability density larger than the level set parameters; marking points with similar distances in the screened sample set based on the parameter h in the clustering tree; the related composition set C in the above labeled graph (graph) is calculated based on the DBSCAN algorithm.
In particular by λiDetermining a set of nodes
Figure BDA00025336786000002213
And determining a corresponding set of boundaries
Figure BDA00025336786000002214
Finally, constructing a corresponding graph based on the two sets
Figure BDA0002533678600000231
And calculate the chart
Figure BDA00025336786000002310
Corresponding correlation component C (lambda)i)。
Then, in step 1b 09, i is increased by 1, and the process returns to step 1b 05.
If the determination in step 1b 05 is negative, then in step 1b11, the final clustering tree T is obtained under different level set parameters.
Next, in step 1b 13, outputIntegrated clustering tree model
Figure BDA0002533678600000232
Fig. 1c shows a block diagram of a probability density anomaly detection method based on pure random histogram transformation partitioning and random forest according to another embodiment of the present invention.
The probability density anomaly detection method based on the pure random histogram transformation division and the random forest is specifically applied to the density estimation method based on the pure random histogram transformation division and the random forest. Generally, for unlabeled sample data sets
Figure BDA0002533678600000233
Firstly, a density estimation method based on the histogram transformation division and the lifting algorithm is used for obtaining an estimation value of unknown distribution of a sample data source
Figure BDA0002533678600000234
And estimating the resulting density function
Figure BDA0002533678600000235
On the basis of the above, it is determined that a point having a probability density smaller than the set probability density parameter ρ belongs to an abnormal sample, that is, a point having a probability density smaller than the set probability density parameter ρ belongs to an abnormal sample
Figure BDA0002533678600000236
Referring to FIG. 1c, in step 1c 01, training data is input
Figure BDA0002533678600000237
I.e. inputting a training data set; a density boundary parameter ρ.
In step 1c 03, an estimate of the density function of the unknown distribution is calculated using density estimates based on purely random histogram transformation partitioning and random forests
Figure BDA0002533678600000238
An estimate of the location density is obtained. Next, in step 1c 05, a set of sample points with an estimated probability less than the density boundary, i.e. an abnormal set of sample points, is output
Figure BDA0002533678600000239
Suppose that
Figure BDA00025336786000002311
Is a subset, μ is a Lenberg measure with μ (X) > 0, P is a probability measure supporting X, and P is absolutely continuous with respect to μ with density f. Let training data d: is ═ x1,., x _ n) are observations of independent co-distribution in P. We will turn BRIs RdCubes of medium size 2R, i.e., BR:=[-R,R]d:={x=(x1,...,xd)∈Rd:xi∈[-R,R]1, d, and will be the same as
Figure BDA00025336786000002312
Is represented by BrThe complement of (c).
For our tree-based algorithm, we first introduced a pure random tree (xTree), or so-called extreme random tree, which is more suitable for unsupervised learning, than conventional random forests use the impurity-based standard supervised learning problem.
Mathematically, assume that Z is a splitting criterion Z for a tree that takes spatial values, whose probability measure is PZAnd (4) showing. Since the splitting of the tree is in space
Figure BDA00025336786000002313
So we will turn B torThe node created after p partitions (p-split) is denoted as
Figure BDA0002533678600000241
Wherein A isjRepresenting the jth node. We will further turn to
Figure BDA0002533678600000245
Represented as a tree, i.e. a collection of all leaf nodes.
An extreme random tree (xTree) partitions root nodes (feature space) completely (extremely) randomly by selecting nodes to partition, dimensions to partition, and partition points. An xTree of p-split can be constructed by iteration, where the ith step (i ═ 1.. times, p) is described as a random vector Qi:=(Li,Ri,Si). First term LiRepresenting the nodes to be split, which are randomly selected with the same probability from among the previously generated nodes. Second term RiUnif { 1. -, d } represents LiI.e. R is chosen uniformly in all dimensionsiThe probabilities are the same. Third item Si~Unif[0,1]Described is a division point, which is formed by the ith division point and the r of the newly generated node after the ith division pointiThe ratio of the lengths of the dimensions. It is noted that,
Figure BDA0002533678600000246
and
Figure BDA0002533678600000247
are independently and equally distributed.
In the following example, assume that for all A ∈ AZ,pThe lux measurement μ (a) > 0 because if μ (a (x)) is 0, the density at x is estimated to be 0. Let DnIs an empirical measure, i.e.
Figure BDA0002533678600000242
For x ∈ Aj∈AZ,pAn extreme random tree (xTree) density estimator may be represented as
Figure BDA0002533678600000243
Wherein A isjAlso written as A (x). In this formula, the sum on the right represents a value falling on AjThe number of observations in (1). Is provided with
Figure BDA0002533678600000248
Is respectively composed of
Figure BDA0002533678600000249
A random density tree generated by a splitting criterion. The density estimator for xForest may then be expressed as
Figure BDA0002533678600000244
Core-sample in DBSCAN is explained as follows.
Sample is at
Figure BDA00025336786000002410
Called core-sample, if # { N (x, ε) # D } > MinPts, where N (x, ε): the ratio of { x': | x-x' | < ε, ε is the neighborhood radius of sample x, MinPts is the minimum number of samples in a community N (x, ε), which is also the core-sample threshold.
The core-sample depends on two hyper-parameters ε, MinPts. On the one hand, ε is a hyper-parameter, which acts like the bandwidth in kernel density estimation. The number of samples falling within the epsilon radius neighborhood of an x point describes the relative density value of the x point. On the other hand, MinPts is a hyper-parameter, which is a threshold to determine whether a sample is a core-sample.
Now, using xForest density estimation
Figure BDA0002533678600000253
For the sample
Figure BDA0002533678600000254
Training is performed and we can extend the concept of core-sample to situations where sample density is explicitly obtained.
The core-sample in xForest is explained below.
One sample x is in
Figure BDA0002533678600000255
Middle quiltCalled core sample, if its density estimate is satisfied
Figure BDA0002533678600000251
Where λ is the density threshold.
In xForest, the generalization of core-sample has only one hyper-parameter associated with the set of density levels. Compared to DBSCAN, it utilizes an explicit form of density estimation.
Now, based on the xForest density estimate, we can propose an xForest clustering algorithm. First, we generate T trees with p segments as training data
Figure BDA0002533678600000256
An xforet density estimator is constructed and all samples with densities not less than a density threshold lambda are designated as core samples. An epsilon-radius neighbor map G is then built for all the nuclear samples, and m clusters are derived from the m connected components of map G. Finally, the remaining unlabeled samples are clustered into clusters of nuclear samples closest to them.
Figure BDA0002533678600000252
Figure BDA0002533678600000261
The runtime complexity of xForest is divided into two parts, runtime for density estimation and runtime for cluster generation. For density estimation, the runtime complexity of each point in each tree depends largely on the depth of the tree. The average runtime complexity of xForest is o (tndlogn), where the number of trees T can be a constant compared to n, and its effect can be minimized by parallel computation. However, when the xTree happens to be very unbalanced, the worst-case runtime complexity becomes O (Td · n)2). For the cluster generation, xForest is similar to DBSCAN, but the core samples are defined by estimated densities with corresponding run-time complexity (at euclidean distance) of O (d · n)2). Thus, the worst case runtime complexity of xForest is O (Td n)2)+O(d·n2). Recall that T is already small enough compared to n and d so the worst case run-time complexity of the complexity is actually O (d n)2) Same as the original DBSCAN.
Nevertheless, we mention that xforet has a higher accuracy than DBSCAN at the same run-time complexity, because it enables a more accurate density estimation and thus a more accurate core point. For DBSCAN, although not explicitly illustrated, core points are defined by estimating the density, by counting the number of neighbors within a certain radius, where the estimation is rather coarse, discontinuous, and may be very sensitive to the chosen epsilon. In addition, the parameter ε is also used to search for connected components, and the optimal ε for this task may not match the optimal ε for density estimation. In contrast, xForest employs a more accurate density estimation process, the estimated density function is asymptotically smooth, and the split boundaries are smoothed by the set of trees, thus performing better in clustering tasks where the underlying density function is smooth. In addition, xForest can obtain good local adaptive capability by adopting the minimum sample division (min _ samples _ split) in the common python package, so that the local property of the sample can be considered to adapt to more complex data structures. In this way the parameter epsilon only works when finding the joining component, so that its optimum value can be easily obtained.
In addition, similar to the implementation of DBSCAN, R can be used under certain conditions*The tree, our xForest, is also performed by means of tree structure, which indicates that the efficient acceleration technique for DBSCAN, such as subsampling, can also be transplanted into xForest. We mention that since random forests are naturally compatible with subsampling methods, we can accelerate the two stages of identifying core-samples and finding connected components by subsampling, and finally xForest will reach an accelerated runtime complexity of 0 (d.nn '), where n' is the subsampling scale.
The manner in which an extreme Random tree (xForest) is used for outlier detection is described below.
We define anomalies characterized by the degree of aggregation, which can be described by the density f. For a fixed threshold ρ > 0, the ρ level set { f > ρ } represents a region of high aggregation. On the other hand, { f ≦ ρ } is regarded as a region where the degree of aggregation at which the abnormal value is located is low. Therefore, our goal is to estimate the set { f ≦ ρ } to detect anomalies in all samples, or equivalently estimate the set of ρ levels { f > ρ }. Using xForest density estimator
Figure BDA0002533678600000271
We can pass through
Figure BDA0002533678600000272
Collective estimation S of constructed rho level setρAnd the xfiest algorithm for density-based anomaly detection is presented in the following algorithm.
Density-based outlier detection algorithm: xForest algorithm
Inputting: training set D: x ═ x1,...,xn};
Dividing times p in the vertical direction transformation division;
the number T of trees in the random forest;
judging a threshold value rho by using a density function;
cycling from 1 to T:
constructing pure stochastic partitions Z in feature spacet,p
Construction of Density estimates Using the x-Tree approach
Figure BDA0002533678600000273
Finish the cycle
Estimating T density estimation trees by using equal integration method
Figure BDA00025336786000002811
Integrated xForest density estimation
Figure BDA0002533678600000281
Executing:
abnormal value:
Figure BDA0002533678600000282
fig. 1d shows a block diagram of a probability density clustering method of the probability density anomaly detection method based on K-nearest neighbor and histogram transformation partitioning Bagging algorithm according to another embodiment of the present invention.
The probability density anomaly detection method based on the K neighbor and the histogram transformation partitioning Bagging algorithm applies the histogram transformation partitioning Bagging algorithm to the K neighbor density estimation method, improves the accuracy of an anomaly detection model based on probability density, and is a specific application of the probability density anomaly detection method based on the K neighbor and the histogram transformation partitioning Bagging algorithm. Generally, for unlabeled sample data sets
Figure BDA0002533678600000283
Firstly, the estimated value of unknown distribution of a sample data source is estimated by using a density estimation algorithm based on K neighbor and histogram transformation partitioning Bagging algorithm
Figure BDA0002533678600000284
And estimating the resulting density function
Figure BDA0002533678600000285
On the basis of the above, it is determined that a point having a probability density smaller than the set probability density parameter ρ belongs to an abnormal sample, that is, a point having a probability density smaller than the set probability density parameter ρ belongs to an abnormal sample
Figure BDA0002533678600000286
Referring to FIG. 1d, in step 1d 01, training data is input
Figure BDA0002533678600000287
A density boundary parameter ρ.
In step 1d 03, partitioning Bagging using K-nearest neighbor and histogram transformation (Bagged) algorithm's density estimation computation estimates of the density function of the unknown distribution
Figure BDA0002533678600000288
An estimate of the location density is obtained. Next, in step 1d 05, a sample point set with an estimated probability less than the density boundary, i.e., an abnormal sample point set, is output
Figure BDA0002533678600000289
Let P be RdThe potential density is f. For any x ∈ RdAnd r > 0, we convert Br(x):=B(x,r):={x’∈Rd:||x’-x||2R ≦ r } is expressed as a closed sphere with r radius x. If for any N ∈ N*All have xn≤cynIf c > 0, it is called
Figure BDA00025336786000002812
K-nearest neighbor (k-NN) method for density estimation
For any x ∈ RdAnd given a set of independent, samples D generated from the probability distribution Pn:={X1,...,XnWe reorder the samples into D according to their increasing value of distance to x(n):={X(1),...,X(n)}. Then we have | | X(1)(x)-x||≤…≤X(n)(x) -x | |. Let the distance between the data concentration point x and its kth nearest neighbor be Rk(x; D). Specifically, when D ═ DnWhen we denote the distance as Rk(x;Dn)=:Rk(x) In that respect Furthermore, let μ be the Leeberg measure, we obtain, according to the Leeberg differential theorem
Figure BDA00025336786000002810
This holds for almost all x. Taking R as R in (1)k(x) And use
Figure BDA0002533678600000298
To estimate P (B)r(x) We get the k-NN density estimate as follows
Figure BDA0002533678600000291
Wherein
Figure BDA0002533678600000292
Bagged Neighbor (BNN) method for density estimation
To improve the efficiency and accuracy of the original k-NN density estimator, we used bagging (bagging) techniques, from DnTo data sets without putting back
Figure BDA0002533678600000299
A sampling is performed in which the size of the data set is # (D)b) M. Then we integrate these B density estimators
Figure BDA0002533678600000293
The following BNN density estimate is obtained
Figure BDA0002533678600000294
Bagged Near Neighbor (BNN) for outlier detection
We define anomalies characterized by the degree of aggregation, which can be described by the density f. For a fixed threshold ρ > 0, the ρ level set { f > ρ } represents a region of high aggregation. On the other hand, { f ≦ ρ } is regarded as a region where the degree of aggregation at which the abnormal value is located is low. Therefore, our goal is to estimate the set { f ≦ ρ } to detect anomalies in all samples, or equivalently estimate the set of ρ levels { f > ρ }. Using the BNN density estimator cR (3), we can pass
Figure BDA0002533678600000295
Collective estimation S of constructed rho level setρAnd a BNN algorithm for density-based anomaly detection is proposed in the BNN method.
Density-based outlier detection algorithm: BNN
Inputting: training set D: x ═ x1,...,xn}; a density threshold parameter ρ; a neighbor parameter k; the number B of the sub-sample sets and the number m of samples of each sub-sample set;
sampling m points from D without putting back as
Figure BDA0002533678600000296
According to
Figure BDA00025336786000002910
Calculating a BNN density estimate (3) (equation (3) above);
abnormal value:
Figure BDA0002533678600000297
fig. 1e shows a random forest anomaly detection model based on an auto-supervised method according to yet another embodiment of the present invention.
A random forest anomaly detection model based on an automatic supervision method is a rapid and accurate anomaly detection algorithm which combines a frame used for enhancing information acquisition capability in an automatic supervision learning task and a random forest classifier. Specifically, for the characteristic space where the data is located, a random rotation mapping is constructed to preprocess the data so as to improve the utilization degree of data information, and a corresponding rotation mode is added to the original data as a label to form a new (data and rotation) type data pair; secondly, converting an original unsupervised learning abnormity detection task into a supervised classification task through the data construction mode, and training a model with a target of a rotating label through constructing a random forest model based on a classification tree; finally, according to theoretical research: the lower the classification accuracy of a classification algorithm corresponding to the self-supervision classification problem is for a certain data classification, the more likely the data is abnormal data, the overall prediction accuracy of each data in each rotation direction is detected by the method, and a final abnormal detection result is given.
Referring to fig. 1e, at step 1e01, the inputs:
training data
Figure BDA0002533678600000301
Where the data are all outliers;
predicting a data set
Figure BDA0002533678600000302
Namely a prediction data set T containing a sample to be detected, wherein the abnormal state of a sample point in the data set is unknown;
feature space rotation mapping set
Figure BDA0002533678600000303
And
an abnormal sample number parameter N;
in step 1e 03:
1. performing rotation mapping on the feature space corresponding to each training data to respectively obtain K new feature spaces and K corresponding training data sets, and adding the space rotation mapping as a label into the corresponding data sets to obtain two groups of new data sets
Figure BDA0002533678600000304
And
Figure BDA0002533678600000305
2. learning a data set using a classification tree-based random forest model
Figure BDA0002533678600000306
Obtaining a model M. That is, the training data is used to train a random forest model for classification.
3. Using model M to TSThe data set is predicted and calculated for each sample YiIs accurate to predictAnd (6) determining the rate.
In step 1e 05: and (3) outputting: and predicting the N to-be-detected sample sets with the lowest accuracy.
sForest only trains the general data D: is ═ X1,...,Xn) Wherein X isiIs the observed value of the same distribution as X obtained from the independent and same distribution in P. The construction of sForest firstly generates random rotation on input data, and the random rotation is recorded as RmAnd m is 1, a. The corresponding augmented data with self-appended tags is represented as
Figure BDA0002533678600000311
For m 1.., m, where we will R0Denoted by the same transformation, will
Figure BDA0002533678600000312
Represented as the original data set to simplify notation. We then train a random forest classifier on the self-labeling data. In the testing phase, we first passed the pre-generated RmAnd m is 1, m, rotating the test sample, testing the rotated sample by a pre-trained forest classifier, and finally identifying the abnormality with lower test precision.
For two-dimensional image data, the self-supervised learning algorithm usually rotates the training data by a fixed angle such as {0, 90, 180, 270} and attaches corresponding labels. However, for structured data with higher dimensional features, 4 basic rotations gradually show their deficiencies. Therefore, we propose to self-label by random rotation, since random rotation provides more potential rotations, the chance of selection is consistent.
The details are as follows.
Given RdA suitable rotation matrix R is a unitary determinant of a real-valued n x n orthogonal square matrix, i.e.
RT=R-1,|R|=1.
The set of all these matrices forms a special orthogonal set, which we denote as sort (d), which is a sub-set of the orthogonal set orth (d), which also includes the so-called abnormal rotation involving reflection (determinant equal to-1). More specifically, the matrix has a determinant of | R | ═ 1 in matrix sort (d), and the matrix has a determinant of | R | { -1, 1 in matrix o (n).
To perform random rotation, we must sample uniformly all possible rotations in SOrth (d). Nevertheless, it is worth emphasizing that randomly rotating each angle in spherical coordinates does not result in a uniform distribution of all rotations in n > 2, which means that some rotations are more likely to occur than others. Instead of simple rotations, we use "True" consistent random rotations. Furthermore, since rotation is not necessary for categorizing variables, and definition is not explicit, random rotation will not be applicable to these features.
Recording an augmented (self-tagged) common data set as
DA=D^A:={(Rm(Xi),Rm)}i∈{1,...,N},m∈{0,...,M}
Wherein R is0Representing the same transformation, R0(Xi) N denotes original training data.
Now, we construct our sForest based on a random forest classifier that starts with B buckets of randomly generated augmented normal samples by the bootstrap method, with the B-th sample bucket marked as
Figure BDA0002533678600000313
Then, a decision tree is constructed on each bucket respectively, and the prediction space is divided into non-overlapping areas Aj J 1.. j, j denotes the total number of terminal nodes of the decision tree. In the indication zone NjArea A of individual observationjIn the node, for (R (X)i),Yi)∈DbLet us order
Figure BDA0002533678600000321
Is of the class RmObserved value at node AjIn the ratio of (1), whichThe middle random parameter R represents random rotation, and 1 (-) is an indicator function. Y isiRepresents XiA corresponding rotating tag.
Then we take this vector
Figure BDA0002533678600000322
As output of the b-th self-supervised tree. We mention
Figure BDA00025336786000003211
Is a standard vector because
Figure BDA0002533678600000323
The self-supervised forest classifier can then be represented as a set of outputs of all trees
Figure BDA0002533678600000324
Intuitively, we can combine vectors
Figure BDA00025336786000003212
Is classified as R (x)mThe probability of (c).
In some previous studies, classifiers thought that outliers tended to obtain a lower probability of belonging to their predictive labels, the intuition was that class-specific features had to be captured by training the classifier to distinguish self-labeling data. We now build their own sfiest classifier.
First, a test data set is recorded
Figure BDA0002533678600000325
Including normal samples and abnormal samples. We turn RmM 1.. m, are used to generate augmented test data and obtain
Figure BDA0002533678600000326
Wherein R is0The same transformation representing symbolic simplicity. The well-trained SForest classifier is then applied to DA,tIn (3) implementing the output vector
Figure BDA0002533678600000327
For each one
Figure BDA0002533678600000328
We can simply write the output vector as an m matrix
Figure BDA0002533678600000329
Figure BDA00025336786000003210
Note that the diagonal elements are each self-labeled R (X)i) Probability of being correctly classified. Therefore, for abnormal points with low classification accuracy, we designed
Figure BDA0002533678600000333
As the sum of these probabilities, i.e.
Figure BDA0002533678600000331
The normal score describes how normal the sample is tested, and we consider the sample with the lowest normal score as abnormal.
Figure BDA0002533678600000332
Application example of large data density estimation of the invention
The density estimation problem is one of basic problems of probability statistics and is an important research direction of unsupervised learning in statistical machine learning, and plays a key role in intermediate links of a plurality of statistical machine learning tasks. First, density estimation has direct application value in that it can exploit the obtained data density to reveal the essential features and internal structure of the data. For example, in the traffic field, uncertainty in vehicle trajectory prediction is studied by density estimation. Secondly, the density estimation is used as a basic statistical machine learning task, and higher learning tasks such as clustering and anomaly detection can be better solved. For example, density-based clustering is widely used for marketing research, image segmentation, indoor positioning based on WLAN data, etc. because it can find clusters of arbitrary shape and size; density-based anomaly detection methods are often used to address the increasingly severe network intrusion problems caused by globalization of information. Therefore, the algorithm for density estimation and the feasibility theory research thereof have important scientific value not only in the field of statistical machine learning, but also in other fields such as market economy, industrial engineering and the like.
As an example, the invention can be applied to classification problems in terms of the goodness of the radar echo. The method adopts data of a gulf radar data set (http:// area. ics. uci. edu/ml/dates/Ionosphere), and has the main task of estimating whether radar echoes show certain structural features in an Ionosphere according to the radar pulse number features in the data set. The dataset contains 351 observations, each of which has 34 attributes. The radar data was generated by a radar system handset in gulf, labrador, which consists of a phased array of 16 high frequency antennas with a total transmit power of about 6.4 kw. Examples in this database are function values generated by processing 34 complex electromagnetic signals, and specifically, the goos bay system may receive 17 pulses per experiment, each pulse having 2 attribute descriptions for a total of 34 features.
The process of implementing the algorithm is based on an integration density estimation of adaptive pure stochastic partitioning and adaptive histogram transformation partitioning as an example. In the experiment, firstly, a PCA (principal component analysis) algorithm is used for carrying out dimension reduction processing on a sample space according to 34 attributes of test data; secondly, generating multiple times of self-adaptive random division according to the main attributes after the dimensionality reduction to obtain a plurality of overall density estimation models, and averaging to obtain an integrated density estimation model. In specific experimental setting, 100 times of histogram transformation division is respectively and randomly generated for an integration density estimation model based on self-adaptive histogram transformation division, and integration is carried out by using an average value, wherein a parameter 'the minimum sample point number m in each division' is from {1,3,10,20 and 40}, 30% of training data is randomly selected as verification data, and the corresponding optimal parameter which enables ANLL to be minimum is selected. For the integration density estimation model based on the self-adaptive pure random division, the number of the division in the integration and the setting of the minimum sample point number in each division are the same as the parameter setting of the self-adaptive histogram transformation division.
In the comparison of the existing classification of certain structural features in an ionosphere by using a 'gulf radar data set' and the prior art, the integrated density estimation model based on the adaptive pure random partitioning can achieve higher prediction accuracy, the negative log-likelihood reaches 0.06, which is far lower than 24.36 of a Gaussian kernel density estimation method and 26.20 of simple histogram density estimation. The negative log-likelihood of the integrated density prediction model based on adaptive histogram transform partitioning is [ -1.78, -11.28, -18.95, -25.35] for dimensions [3,10,16,22], respectively, with a reduction in absolute value of [ -1.15, -6.38, -10.26, -14.71] relative to a simple histogram density estimate of [ -0.63, -4.9, -8.69, -10.64], a reduction in relative value of [ 54.78%, 76.80%, 84.69%, 72.33% ]. Relative gaussian kernel density estimates of [ -1.50, -7.87, -13.22, -18.78], a decrease in absolute value of [ -0.28, -3.41, -5.73, -6.57], and a decrease in relative value of [ 18.67%, 43.32%, 43.34%, 34.98% ].
The method fully utilizes the randomness of self-adaptive division and the advantages of integrated learning, solves the problem of discontinuous histogram density estimation, and improves the precision of density estimation; the invention can be well combined with parallel computation, not only can adopt a CPU (Central processing Unit) processor, but also can be combined with a GPU (graphics processing Unit) processor, thereby greatly saving the running time, improving the algorithm efficiency, and even processing data with huge data volume and ultrahigh dimensionality.
Besides the processing and analysis of radar data, the invention can also be applied to other density estimation tasks, such as Chinese character recognition in image recognition, dynamic video segmentation, extreme value recognition in an intelligent traffic system, density estimation of streaming data, high-density communication network protocol optimization and the like.
On the other hand, the big data regression analysis of the present invention is mainly composed of two parts: and performing multiple self-adaptive random division and establishing integration of local and integral regression models under different divisions. The method mainly comprises the steps of firstly generating a plurality of times of self-adaptive random division, respectively using samples in each division grid to construct a local regression model in each division, splicing the local regression models together to obtain an integral regression model under a certain time of random division, and finally integrating the integral regression models under multiple divisions. The self-adaptive random division can adopt self-adaptive pure random division, self-adaptive histogram transformation division, random self-adaptive polygon division and the like; the local regression model can adopt a support vector machine regression (SVR) and a local average method; the integration method of the model may adopt a simple average method, a weighted average method, or the like. The big data regression analysis of the present invention may be embodied as a method or apparatus.
The big data regression analysis comprises two steps, firstly, carrying out multiple self-adaptive random division on a characteristic space, obtaining a regression model on each division grid during each division, and splicing to obtain an integral regression model; and secondly, integrating all the integral regression models to obtain an integrated model.
Fig. 19 shows a flowchart of a large-scale regression method based on adaptive random partitioning and model integration according to a fifth embodiment of the present invention.
At step 1901, a large dataset D requiring training is input. In this embodiment, the number of the adaptive random division spaces is T, and therefore, T times of space division operations are required, and the randomly generated adaptive space division operation is completed in a loop from 1 to T. That is, in step 1903, the number of divisions t is initialized to 1; in step 1905, it is determined whether T is smaller than T, each partition space generates a "tree", if yes in step 1905, in step 1907, an adaptive partition for the sample space is randomly generated for each tree, a local regression model is obtained in each partition grid, and the local regression models in each grid are pieced together to obtain the tth integral regression model. Then, in step 1909, i is incremented by 1, and the process returns to step 1905.
If the determination at step 1905 is NO, then at step 1911, the T whole regression models are integrated. At step 1913, the integrated regression model is output.
The adaptive random partitioning in step 1907 may be adaptive pure random partitioning, adaptive histogram transform partitioning, random adaptive polygon partitioning, or the like. The adaptive pure random division and the adaptive histogram transformation division are described in detail with reference to fig. 14 and 15. Wherein the random adaptive polygon partitioning is described in detail with reference to fig. 20.
After the adaptive random partitions are generated, in step 1911, a common regression algorithm is called in each partition grid to obtain a local regression model, and in step 1913, the local regression model is spliced into an overall regression model. In step 1911, the following two common regression algorithms, i.e., local average and support vector machine regression, are commonly used.
Local averaging method: the local average method is that the average value of the dependent variables of the samples in each divided grid is used as a regression result, and the method is the most intuitive and simple regression model and is mainly suitable for regression of samples with lower data dimensionality and more discrete numerical values. Fig. 21 shows a support vector machine employed in step 1911 of the large scale regression method based on adaptive random partitioning and model integration according to the fifth embodiment of the present invention. FIG. 22 shows the support vector machine regression employed in step 1911 of the large scale regression method based on adaptive random partitioning and model integration according to the fifth embodiment of the present invention.
Support vector machine regression (SVR): a Support Vector Machine (SVM) is a machine learning algorithm suitable for classification tasks, which finds a classification surface in a linear separable case (i.e., where there is a high-dimensional plane separating two types of sample points) such that the minimum distance from the two types of sample points to the classification surface is maximized. The sample points closest to the optimal classification surface are called "support vectors" (see fig. 21 (left)), which can determine the optimal classification surface without requiring all samples. In the linear inseparable case, the sample data may be mapped into a higher dimensional space such that the data is linearly separable in this high dimensional space. By the 'kernel method', the model can be calculated and the result can be obtained without actually finding the concrete expression of mapping.
Support vector machine regression (SVR) is a generalization of the support vector machine, and solves the regression task using similar ideas, and obtains the regression result by measuring the distance from the sample to the hyperplane (as shown in fig. 22). The regression of the support vector machine is mainly suitable for regression of samples with low data dimensionality and good numerical continuity.
Fig. 19 ' (a), 19 ' (b), and 19 ' (c) illustrate specific examples of a large-scale regression method based on adaptive random partitioning and model integration according to a fifth embodiment of the present invention.
Fig. 19 ' (a), 19 ' (b), and 19 ' (c) show the cases of the integrated regression model based on 1, 2, and 3 random divisions, respectively, in order to compare the continuity and prediction accuracy of the model. FIG. 19' (a) shows a regression model based on a single stochastic partition, with poor prediction accuracy and model continuity;
FIG. 19' (b) shows that two random partitions are generated, and the discontinuous regression models obtained from the two partitions are subjected to average integration, so that the continuity of the integrated model is improved compared with that of the regression model based on a single partition; fig. 19' (c) shows an ensemble learning model based on cubic stochastic partitioning, and the prediction accuracy and model continuity are further improved. If the number of times of random division is further increased, the integrated model gradually tends to be continuous until the problem that the regression model is discontinuous on the division boundary is solved, and satisfactory regression prediction precision is achieved.
Fig. 20 shows a flowchart of the adaptive polygon partition method adopted in step 1907 of the large-scale regression method based on adaptive random partition and model integration according to the fifth embodiment of the present invention.
Referring to fig. 20, in step 2001, a data set D to be trained is input, the number of control points is set to m, and the adaptive polygon wipe operation is completed in a loop from 1 to m. That is, in step 2003, the extracted sample point i is initialized to 1; at step 2005, it is determined whether the extracted sample point i is smaller than m, and if the determination at step 2005 is yes, at step 2007, one sample point is extracted as a control point with equal probability from the unextracted training data. Then, in step 2009, i is increased by 1, and the process returns to step 2005.
If the determination at step 2005 is negative, then at step 2011, the adaptive random polygon partitioning result is output.
With regard to model integration, each time the adaptive random partitioning is performed, a corresponding integral regression model is obtained, and a plurality of integral regression models can be integrated by adopting a plurality of integration methods. The most common is to take the average of all the whole regression models as the integrated model. Other methods, such as weighted average, are to learn the weights of the regression models of different partitions in the integrated model and then integrate the models.
In addition to the integrated model obtained by Parallel integration (Parallel Ensemble), a Sequential Ensemble (Sequential Ensemble), that is, a Boosting Algorithm (Boosting Algorithm), may be used to perform multiple iterations under fixed division to generate an integrated model. For example, the histogram transformation partition may be used as a spatial partition method, the rotation, stretch, and translation transformations of the data input space may be randomly generated, and the histogram partition method may be used in the transformed data space. In each divided grid, local regression estimation is performed by using an averaging method, and then all local regression models are combined into an overall regression model. Therefore, a lifting algorithm is introduced, and the residual value of each sample point under the last regression model is firstly calculated as a new target value. After obtaining a new sample, randomly generating rotation, stretching and translation transformation of the space again, dividing the transformed space again by using a histogram division method, and estimating each grid by using simple average estimation. After all local regression models are combined, a second global regression model is obtained. And taking the residual value of each sample point under the second regression model as the training target value again, and circularly executing the steps. Since our regression model always estimates the residual error of the last model, starting from the second regression model, the overall regression model of the whole algorithm is the sum of the regression models of each time.
Fig. 23 is a flowchart illustrating a large-scale regression method based on a histogram transformation partitioning and lifting algorithm according to a sixth embodiment of the present invention.
At step 2301, a large dataset D that needs to be trained is input. In this embodiment, the total number of iterations of the lifting algorithm is set to m, so that m iterations are required, and the operation of adding the regression models of each iteration to obtain the overall regression model is completed in a loop from 1 to m. That is, in step 2303, the iteration number variable i of the lifting algorithm is initialized to 1; in step 2305, it is determined whether i is smaller than m, and if yes in step 2305, in step 2307, the division is performed using a random histogram transform division method, local estimation models are obtained in each of the divided cells using average estimation, an overall regression model is obtained by combination, and a residual value of each sample point is calculated and used as a target value to form a new data pair. Then, in step 2309, i is incremented by 1, and the process returns to step 2305.
If the determination at step 2305 is negative, then at step 2311, the regression models for each iteration are summed to obtain the overall regression model. At step 2313, a regression model based on the lifting algorithm is output.
Fig. 24 is a block diagram illustrating a large-scale regression apparatus based on adaptive random partitioning and model integration according to a seventh embodiment of the present invention.
The large-scale regression device according to the seventh embodiment of the present invention implements the large-scale regression method according to the fifth embodiment described with reference to fig. 19.
Referring to fig. 24, the large-scale regression apparatus according to the seventh embodiment of the present invention includes a data input module 2401, a sample adaptive partitioning and regression model calculation module 2403, and a regression model integration module 2405.
At data input module 2401, a large dataset D that needs to be trained is input. In this embodiment, the number of the adaptive random division spaces is T, and therefore, T times of space division operations are required, and the randomly generated adaptive space division operation is completed in a loop from 1 to T.
In the sample adaptive partitioning and regression model calculating module 2403, an adaptive partition of a sample space is randomly generated for each tree, a local regression model is obtained in each partition grid, and the local regression models in each grid are pieced together to obtain an overall regression model each time.
In the regression model integration module 2405, the T integral regression models are integrated, and the integrated regression model is output.
The self-adaptive random division can adopt self-adaptive pure random division, self-adaptive histogram transformation division, random self-adaptive polygon division and the like. The adaptive pure random division and the adaptive histogram transformation division are described in detail with reference to fig. 14 and 15. Wherein the random adaptive polygon partitioning is described in detail with reference to fig. 20.
After the self-adaptive random division is generated, a common regression algorithm is called in each division grid to obtain a local regression model, and the local regression model is spliced into an integral regression model. The following two common regression algorithms, i.e., local mean and support vector machine regression, are commonly used.
Fig. 25 shows a block diagram of a large-scale regression device based on a histogram transformation partitioning and boosting algorithm according to an eighth embodiment of the present invention.
The large-scale regression device according to the eighth embodiment of the present invention implements the large-scale regression method according to the sixth embodiment described with reference to fig. 23.
Referring to fig. 25, the large scale regression apparatus according to the seventh embodiment of the present invention includes a data input module 2501, a sample adaptive partitioning and regression model calculation module 2503, and a regression model integration module 2505.
The regression task is based on observing the input variablesX to predict the value of the output variable Y that is not observed. More precisely, we need to train a predictor f that maps the observed input values of X onto the unobserved output variables Y in the form of f (X). Throughout this paper, we assume
Figure BDA0002533678600000395
Is not null, for some M greater than 0, there is Y: is [ -M, M [ -M]And P isxIs the edge distribution of X. For arbitrarily fixed R > 0, note BRIs composed of
Figure BDA0002533678600000396
The hypercube having a radius of 2R centered on the origin, that is,
Figure BDA0002533678600000391
the extreme random tree is explained below
Mathematically speaking, let the random variable Z be in space
Figure BDA0002533678600000397
The probability measure of the partition criterion of a tree of middle value is PzAnd (4) showing. Since the tree is cut through the space
Figure BDA0002533678600000392
Therefore we will be at BRThe node created by the p cuts above is represented as
Figure BDA0002533678600000393
Wherein A isjRepresenting the jth node. We further represent the tree as
Figure BDA0002533678600000394
I.e. the set of all leaf nodes.
[2]The proposed extreme random tree segments root nodes (feature spaces) by randomly selecting nodes to be segmented, dimensions to be segmented and segmentation points. The tree partition of p cuts mayBy an iterative algorithm, where the i-th step (i ═ 1.. p.) can use a random vector Qi:=(Li,Ri,Si) A description will be given. First item LiIndicating the node to be switched, which is randomly selected with equal probability from the previously generated nodes. Second term RiUnif { 1. -, d } represents LiI.e. uniformly selecting R from all dimensions with equal probabilityi. Third item Si~Unif[0,1]Describing the point to be cut, using the newly generated node L after the ith cutiR of (A) to (B)iLength in dimension and LiIs expressed by the ratio of the lengths of (a) to (b). It is to be noted that,
Figure BDA00025336786000004013
and
Figure BDA00025336786000004014
independently and equally distributed.
We introduce a regularization empirical risk minimization framework for algorithm design. And D: { (X)1,Y1),...,(Xn,Yn) Are independent co-distributed observations with the same distribution as the generic random variable pair (X, Y), derived from an unknown probability measure P on X Y. Order to
Figure BDA00025336786000004015
Is a loss function. For measurable functions
Figure BDA0002533678600000401
Risks are caused by
Figure BDA00025336786000004016
And (4) defining. Furthermore, the bayesian risk is given by:
Figure BDA0002533678600000402
and corresponding thereto
Figure BDA00025336786000004017
Referred to as a bayesian decision function. Empirical risk is defined as
Figure BDA0002533678600000403
Wherein
Figure BDA00025336786000004018
Is an empirical measure associated with the data,
Figure BDA00025336786000004019
is (X)i,Yi) The dirac measure of (a).
Order to
Figure BDA00025336786000004020
Being a regularization term, the regression problem can then be formulated as
Figure BDA0002533678600000404
Loss function L: as a broad framework, our approach is applicable to a variety of supervisory tasks with different loss functions. Herein, we use least squares loss L (Y, f (x)): (Y-f (x))2Solving the least squares regression problem.
Function space
Figure BDA0002533678600000405
The proper basic function space should be selected according to the characteristics of the basic data set
Figure BDA0002533678600000406
For low dimensional datasets, we fit our algorithm to the basis function space
Figure BDA00025336786000004021
Together are used, wherein Zt~PzAnd is
Figure BDA0002533678600000407
From extreme random trees
Figure BDA0002533678600000408
The step function of the above.
In contrast, for high dimensional datasets, a constant function may not have sufficient representational capacity. Therefore, we introduce a Gaussian radial basis kernel
Figure BDA0002533678600000409
To Reproduce Kernel Hilbert Space (RKHS) and to use the underlying function space as a joint RKHS
Figure BDA00025336786000004010
Please see section b.3.1 of the supplementary information for more details.
The regularization term Ω. The regularization term depends on the predictor of each tree.
For a forest of constant functions, we choose to penalize the segmentation number p, which means we can limit
Figure BDA00025336786000004011
Thereby obtaining a limited VC dimension, thereby making the PAC algorithm learnable. In addition, it can also prevent the learning result from being over-fitted by avoiding the size of the node being too small. For T1
Figure BDA00025336786000004022
Where λ is the regularization term of the t-th tree of (1)tRepresenting the regularization parameter.
For a forest of kernel functions, on the one hand, due to joint expansion
Figure BDA00025336786000004012
Dependent on the tree cutting rule ZtIt is therefore wise to penalize p for similar reasons as a forest of constant function. On the other hand, to avoid overfitting, we also expect that the form of the regression variable f is not too complex. Therefore we also need to penalize the RKHS norm of f, the regularization term becomes
Figure BDA0002533678600000411
T1.., T, where λ1,tAnd λ2,tIs a regularization parameter.
Embedded random forest (mForest)
As a set of T mTree estimators, mfiest is obtained through the RERM framework. The generation process is shown in the following algorithm.
Figure BDA0002533678600000412
cmTree&The t-th cmTree regression predictor can be obtained by the following formula
Figure BDA0002533678600000413
Figure BDA0002533678600000414
Wherein λtWhich is representative of the regularization parameters,
Figure BDA0002533678600000416
is the t-th tree
Figure BDA0002533678600000417
At cutting criterion ZtNumber of cuts. By averaging the cmTree estimators, we can derive the cmForest regression predictor as
Figure BDA0002533678600000415
kmTree&kmforest. kmtree estimator
Figure BDA0002533678600000427
Comprises the following steps:
Figure BDA0002533678600000421
and the kmForest regressor is
Figure BDA0002533678600000422
At data input module 2501, a large data set D that needs to be trained is input. In this embodiment, the total number of iterations of the lifting algorithm is set to m, so that m iterations are required, and the operation of adding the regression models of each iteration to obtain the overall regression model is completed in a loop from 1 to m.
The iterative operation is performed m times in the sample adaptive partitioning and regression model calculation module 2503, wherein in each iteration, a random histogram transform partitioning method is used for partitioning, a local estimation model is obtained in each partitioned grid by using average estimation, an integral regression model is obtained by combination, and a residual value of each sample point is calculated and used as a target value to form a new data pair.
In the regression model integration module 2505, the regression models of each iteration are summed to obtain an overall regression model, and the regression model based on the lifting algorithm is output.
Regression is based on dataset D: { (x)1,x2),...(xn,xn) )) } to predict the value of the output variable Y that is not observed. Wherein the data set is from
Figure BDA0002533678600000428
The unknown probability measure P. In this context, we assume
Figure BDA0002533678600000429
And
Figure BDA00025336786000004210
is a non-empty compact set.
For any fixed R > 0, we convert B toRR represents a radius of 2RdCentered hypercube, i.e. BR:=[-R,R]d:={x=(x1,...,xd)∈Rd,xi∈[-R,R]I 1.. d }, for any R e (0, R), we write
Figure BDA0002533678600000423
Recall that, for 1 ≦ p ≦ infinity, x ≦ x (x)1,...,xd),LpNorm defined by | | x | | non writingp:=(|x1|p+…+|xd|p)1/pAnd L isNorm is formed by
Figure BDA0002533678600000424
We use symbols
Figure BDA00025336786000004211
And
Figure BDA00025336786000004212
indicates the presence of normal numbers c and c', such that an≤cbn,an≥c′bnFor N ∈ N. Furthermore, for x ∈ R, let
Figure BDA00025336786000004213
Represents the largest integer less than or equal to x. Hereinafter, the following multi-index notation is often used. For RdOf (1), we write
Figure BDA0002533678600000425
Figure BDA0002533678600000426
And
Figure BDA0002533678600000431
least squares regression
Taking into account least squares losses
Figure BDA0002533678600000432
L(x,y,f(x)):=(y-f(x))2. For measurable decision function
Figure BDA0002533678600000433
Risks are caused by
Figure BDA00025336786000004310
Definition, empirical risk is
Figure BDA0002533678600000434
To be defined. Bayesian risk is the minimum risk with respect to P and L, consisting of
Figure BDA0002533678600000435
It is given.
In the following, the values considered are in the interval [ -M, M]The predictive variable of (2) is sufficient. To this end, we introduce the concept of clipping (clipping) for the decision function, let us
Figure BDA00025336786000004311
Is the clipping value of t in R, if t < -M, the value is-M, if t is in [ -M, M]The value is t, and if t is larger than M, the value is M. The minimum square loss L is tailorable at M. After cutting, the risk is reduced, i.e.
Figure BDA0002533678600000436
Hence, in the following, we only consider the clipping of the decision function and the corresponding risk.
Histogram transformation in regression problems
For clarity of describing one possible construction process of the histogram transformation, we introduce a random vector (R, S, b), where each element represents a rotation matrix, a stretching matrix and a translation vector, respectively. In particular to
R represents a rotation matrix which is a real d x d orthogonal square with a determinant of 1, i.e.
RT=R-1 det(R)=1
S represents a stretching matrix which is a positive real-valued d x d diagonal scaling matrix, wherein the diagonal elements
Figure BDA00025336786000004312
Is a random variable. Obviously, there are
Figure BDA0002533678600000437
Furthermore, we mean
Figure BDA0002533678600000438
The bin width vector defined in the input space is given by h ═ s-1
b∈[0,1]dIs a d-dimensional translation vector.
Based on the above representation, we define a histogram transform
Figure BDA00025336786000004315
H(x):=R·S·x+b。
It is worth mentioning that we do not necessarily consider the cell width h in the transform space0Case not equal to 1, since the same effect can be achieved by scaling the transform matrix H'. Thus, let
Figure BDA00025336786000004314
For the index of the transformed cell, the transformed cell is given by:
Figure BDA0002533678600000439
including in the input space
Figure BDA0002533678600000449
Corresponding histogram cell of
AH(x):={x'|\H(x')∈A’H(x)}
And we further represent all cells induced by H as
Figure BDA0002533678600000441
In which the repeating cells are counted only once, and
Figure BDA00025336786000004410
as an index set for H, such that
Figure BDA00025336786000004411
In (1)
Figure BDA0002533678600000442
We have
Figure BDA0002533678600000443
The collection thus obtained
Figure BDA00025336786000004412
Form BROne division of (2). For convenience, we use A0Instead of the former
Figure BDA0002533678600000444
Then
Figure BDA00025336786000004413
Form RdOne division of (2).
We present a practical way to construct a histogram transform. First generating the product of d2Independent univariate standard normal random variables form a d x d square matrix M, and then a Householder Q is appliedRThe decomposition is to obtain a factorization of the form M ═ R · W, where R is the orthogonal matrix and W is the upper triangular matrix with directly opposite diagonal elements. The matrix R constructed in this way is an orthogonal matrix and obeys a uniform distribution. If R does not have a positive determinant, then it is not a proper rotation matrix. In this case, we can change the sign of the first column of R to construct a new rotation matrix R that satisfies the condition+
We build a diagonal scaling matrix with the symbol S of the diagonal, where the element SkTaken from the Jeffreys priors, i.e. log(s)i) Following areaOn the middle
Figure BDA00025336786000004414
Is uniformly distributed, whereins 0And
Figure BDA00025336786000004415
is a fixed constant. To simplify the notation, we remember
Figure BDA00025336786000004416
And
Figure BDA0002533678600000445
furthermore, the translation vector b is from hypercube [0, 1]]dIs obtained in a uniform distribution.
Given a histogram transform H, set
Figure BDA00025336786000004417
Form BRAnd (4) dividing. We consider the set of functions defined below
Figure BDA00025336786000004418
Figure BDA0002533678600000446
To limit
Figure BDA00025336786000004419
Complexity of (1), we divide by piHWidth of the box
Figure BDA00025336786000004420
A penalty is imposed. Then, we can go through the pair
Figure BDA00025336786000004421
Empirical Risk Minimization (RERM) with regularization is performed to obtain a Histogram Transformation Regression (HTR), i.e., a
Figure BDA0002533678600000447
Wherein
Figure BDA0002533678600000448
It is worth noting that to simplify the computation, we apply the same penalty to the dimensions instead of each element h1,...,hdAnd (4) punishing respectively.
L2Regularized lifting histogram transform
Boosting is the task of converting multiple inaccurate weak learners into a single accurate predictor. Specifically, we apply a finite set of functions
Figure BDA00025336786000004422
Defined as a set of basic learners, a general boosting algorithm will
Figure BDA00025336786000004423
Function of (1)
Figure BDA00025336786000004424
Combined to minimize experience loss. The final predictor can be expressed as
Figure BDA0002533678600000451
Wherein f ist∈F,t=1,...,T,ωtT is a weight, and T is equal to or greater than 0. From the statistical function gradient descent point of view, reformulating boosting into a stepwise optimization problem with different loss functions. In this case, gradient boosting requires the computation of negative function gradients in response
Figure BDA0002533678600000452
And selects a particular model from the allowed function classes to update the predictor variables at each boosting iteration.
In this work, we focus mainly on boosting algorithms with histogram transformation regressors as the basic learners because they are weak predictors and computationally efficient. Before continuing, we need to introduce the functional space of most interest to us to build our learning theory. Suppose that
Figure BDA0002533678600000456
Is an independent and obedient certain probability measure PHOf the histogram change sequence of (1), hypothesis
Figure BDA0002533678600000457
As defined in (7). Then we can define space E as
Figure BDA0002533678600000453
Furthermore, for f ∈ E, we define
Figure BDA0002533678600000454
Then for any f ∈ E, we immediately get we get it according to the Cauchy-Schwarz inequality
Figure BDA0002533678600000455
In fact, (E, | · | | non-conducting phosphor)E) Is a function space consisting of measurably bounded functions. M is a constant.
As mentioned above, the boosting method can be viewed as an iterative method for optimizing a convex empirical loss function.
Definition 1 let E be the function space (9) and L be the least squares penalty. Given lambda1>0,λ2For > 0, we call a learning method the enhanced histogram transform regression (BHTR) algorithm for E if this method is for each
Figure BDA0002533678600000458
Assigning a function
Figure BDA0002533678600000459
So that
Figure BDA0002533678600000461
Wherein omegaλ(f) Is defined as
Figure BDA0002533678600000462
The regularization term consists of two parts. The motivation for the first term is the fact that early boosting methods, such as Adaboost, may over-fit in the presence of tag noise. L using complex estimator weights2The norm controls the degree of overfitting to achieve consistency and convergence. The second term is added to control the bin width of the histogram transform, which in fact corresponds to adding the base learner ftL ofpNorm because they do not exceed in capacity
Figure BDA0002533678600000468
Is a piecewise constant function over the cells of (1).
For theoretical analysis we also need an infinite sample version of definition 1. To this end, we fix distribution P at
Figure BDA0002533678600000469
And let the function space E be as shown in (9). Then each satisfy
Figure BDA0002533678600000463
F of (a)D,Be.E is referred to as the infinite sample version of BHTR relative to E and L. Furthermore, an error function is approximatedA (λ) is defined as
Figure BDA0002533678600000464
With all of these preparatory tasks, we now propose a general form of the BHTR algorithm. In fact, the randomness of the histogram transform provides an efficient process for performing boosting. With the help of HTR, we repeat the least squares fit of the residuals. Furthermore, we introduce a learning rate ρ to suppress the movement of gradient descent updates, which is related to regularization by contraction.
Algorithm for enhanced histogram transformation for regression
Inputting: training set D: is ═ x1,y1),...,(xn,yn) (ii) a Bandwidth parameter
Figure BDA0002533678600000465
A learning rate ρ;
initialization:
Figure BDA0002533678600000466
from 1 to T
Generating random affine transformation matrices
Figure BDA0002533678600000467
Applying data independent segmentation to the transformed sample space;
applying a constant function to each trellis, i.e. the residual sum function ftIs matched so that
Figure BDA0002533678600000471
Wherein
Figure BDA0002533678600000475
Is/are as follows
Figure BDA0002533678600000476
As defined in (7).
Updating:
Figure BDA0002533678600000472
calculating residual error
Figure BDA0002533678600000473
End the cycle
And (3) outputting: enhanced regression histogram transform estimation
Figure BDA0002533678600000474
Application example of big data regression of the invention
The present invention is applied to the song release year prediction problem as an example. The data set adopted is a 'year prediction song data set' (http:// actual. ics. uci. edu/ml/dates/Yeast PredictionMSD), and the main task is to predict the release year of the song according to the audio features in the data set. This dataset is a subset of the "mega song dataset" (http:// millionsong dataset. com /) containing 463,715 training samples and 51,630 test samples, each observation point having 90 attributes characterizing the timbre of songs released in 1922 to 2011. These features are obtained by computing mel-frequency cepstral coefficients (MFCCs) for the discretized audio sequence. The mel frequency cepstrum coefficient is often applied to a voice recognition task for extracting audio features, retrieving audio information, and the like, and the audio features can be extracted by other methods, such as PNCC and the like.
The process of realizing the algorithm takes a support vector machine regression integration model based on the self-adaptive pure random division and a support vector machine regression integration model based on the self-adaptive histogram transformation division as an example. In the experiment, multiple times of self-adaptive random division are generated according to 90 attributes of test data to obtain a plurality of integral regression models, and an average value is taken to obtain an integrated regression model. In the specific experimental setting, a regression integration model of a support vector machine based on self-adaptive pure random division is combined into 10 times of self-adaptive random division, a sample space is divided into 200 grids in each division, regular parameters in the regression model of the support vector machine are selected from {0.01, 1 and 100}, bandwidth parameters of a Gaussian kernel are selected from {0.001, 0.1 and 10}, and in the selection of the two parameters, 30% of training data are randomly selected as a verification data set, and optimal parameters are automatically selected. For the support vector machine regression integration model based on the self-adaptive histogram transformation division, 5 times of histogram transformation division are respectively generated randomly and integrated by using an average value, a corresponding parameter m is selected to be 2000, namely the division is stopped after the rest 2000 sample points are divided in each grid, and the regular parameter and the loan bandwidth grid in the support vector machine regression model are set to be the same as the parameter of the self-adaptive pure random division.
In the comparison of the year prediction experiment of song release by using a 'year prediction song data set' with the prior art, a support vector machine regression model based on self-adaptive pure random division can achieve higher prediction precision and higher operation speed, the mean square error reaches 81.11, the absolute value of the mean square error of a relative polygon division support vector machine method (85.10) is reduced by 3.99, and the relative value is reduced by 4.7%; the running speed is 327 seconds, the absolute value of the running speed is improved by 92 seconds and the relative value is improved by 22 percent relative to the polygon dividing support vector machine method (419 seconds). The mean square error of the prediction of the adaptive histogram transformation integration model is 83.82, and under the condition that the running time of the prediction of the adaptive histogram transformation integration model is 386 seconds, compared with the polygon division support vector machine method (85.10), the absolute value of the mean square error is reduced by 1.28, and the relative value is reduced by 1.5%. The splicing Gaussian process spatial interpolation method cannot adopt a parallel computing method, so that the operation time is too long (more than 36 hours), and the 'year prediction song data set' cannot be predicted.
In addition, by taking a local average regression integrated model based on self-adaptive pure random division as an example, the effect of the method on model continuity is observed through a simulation experiment. Let the data follow y sin (x) + e, and let the argument x follow a uniform distribution U (0,10), let the random perturbation term e follow a normal distribution N (0, 0.2). In the experiment, 50,000 samples are randomly generated, and the effect of the method on solving the problem of discontinuous division boundary by comparing the method with a polygonal division support vector machine method and a splicing Gaussian process space interpolation method is compared.
Fig. 26 shows simulation experiments on simulated data using stitched gaussian process spatial interpolation regression (left) and polygon-divided support vector machine regression (right). As shown in fig. 26, the regression results of the polygon partition support vector machine method (fig. 26 (left)) and the stitching gaussian process space interpolation method (fig. 26 (right)) can find obvious discontinuous positions, and three typical positions are selected and enlarged on the right side.
Fig. 27 shows a simulation experiment in which the continuity gradually increases as the number of random partitions T increases for the support vector machine regression based on random histogram transform partitioning. By observing the regression model obtained by the invention through fig. 27, when the generated random division times (T) gradually increase, the regression model gradually tends to be continuous and smooth, thereby ensuring the accuracy of regression prediction.
The invention fully utilizes the advantages of randomness of self-adaptive division and integrated learning, solves the problem of discontinuous boundary and improves the accuracy of regression prediction; the invention can be well combined with parallel computing, can not only adopt a CPU processor, but also combine with a GPU processor, greatly saves running time, improves algorithm efficiency, and even can process data with huge data volume and ultrahigh dimensionality.
Besides processing and analyzing audio data, such as voice recognition, audio information retrieval and the like, the invention can also be applied to other large-scale regression tasks, such as an age prediction task in image recognition, position prediction of a 5G terminal, 5G wireless network flow prediction, 5G mobile communication network planning and the like.
To facilitate an understanding of exemplary embodiments, exemplary embodiments and applications of a large data density estimation method based on adaptive random partitioning and model integration according to the present invention have been described and illustrated in the accompanying drawings. It should be understood, however, that the exemplary embodiments are only intended to illustrate exemplary embodiments, and not to limit the scope of the invention. It should also be understood that the exemplary embodiments are not limited to the exemplary embodiments shown and described. Various modifications to the exemplary embodiments will be readily apparent to those skilled in the art.

Claims (11)

1. A data density estimation method, comprising the steps of:
generating a plurality of times of self-adaptive random division for the data set;
in each division, constructing a local density estimation model by using the samples in each division grid respectively;
splicing the local density estimation models together to obtain an overall density estimation model under random division; and integrating the overall density estimation model under the multiple divisions.
2. The data density estimation method of claim 1, wherein the adaptive stochastic partition comprises one of an adaptive pure stochastic partition, an adaptive histogram transform partition.
3. The data density estimation method according to claim 2,
the self-adaptive pure random division is characterized in that t sample points are randomly selected in advance before each division, a grid to be divided is selected as a grid containing the most samples, and division dimensions and tangent points are randomly selected.
4. The data density estimation method according to claim 2,
before each division, the self-adaptive histogram transformation division selects lattices with the number of sample points larger than m for division, selects the dimension to be divided as the dimension with the largest sample variance, and selects tangent points as the median of the dimension data until the number of the sample points in all the lattices is smaller than m.
5. A data density estimation apparatus comprising:
the self-adaptive division module generates a plurality of times of self-adaptive random division on the data set;
a density estimation module, which constructs a local density estimation model by using the samples in each division grid in each division;
the single estimation module of the overall density estimation model combines the local density estimation models together to obtain an overall density estimation model under random division; and
and the integral density estimation model integration module integrates the integral density estimation models divided for multiple times to obtain the density estimation of the data.
6. A data regression method comprising the steps of:
generating a plurality of times of self-adaptive random division for the data set;
obtaining a local regression model on each division grid, and splicing to obtain an overall regression model;
and integrating all the integral regression models to obtain an integrated model.
7. The data regression method of claim 6, wherein the adaptive stochastic partition comprises one of an adaptive pure stochastic partition, an adaptive histogram transform partition, and a stochastic adaptive polygon partition.
8. The data regression method of claim 6,
the local regression model adopts a support vector machine regression (SVR) or a local average method.
9. A data regression device, comprising:
the self-adaptive division module generates a plurality of times of self-adaptive random division on the data set;
the local regression module is used for obtaining a local regression model on each division grid;
the integral regression module is used for splicing the local regression models to obtain an integral regression model;
and the integral regression module integration module is used for integrating all integral regression models to obtain an integrated model and obtain regression analysis of data.
10. An electronic device, comprising:
one or more processors;
a memory;
one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: performing the method of any one of claims 1 to 4, 6 to 8.
11. A computer readable storage medium, characterized in that it stores at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the method according to any one of claims 1 to 4, 6 to 8.
CN202010525621.7A 2020-02-06 2020-06-10 Data density estimation and regression method, corresponding device, electronic device, and medium Pending CN113221065A (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN2020100815328 2020-02-06
CN202010081532 2020-02-06
CN202010502262 2020-06-04
CN2020105022623 2020-06-04

Publications (1)

Publication Number Publication Date
CN113221065A true CN113221065A (en) 2021-08-06

Family

ID=77085713

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010525621.7A Pending CN113221065A (en) 2020-02-06 2020-06-10 Data density estimation and regression method, corresponding device, electronic device, and medium

Country Status (1)

Country Link
CN (1) CN113221065A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792945A (en) * 2021-11-17 2021-12-14 西南交通大学 Dispatching method, device, equipment and readable storage medium of commercial vehicle
CN114861776A (en) * 2022-04-21 2022-08-05 武汉大学 Dynamic self-adaptive network anomaly detection method based on artificial immunity technology
CN117148017A (en) * 2023-10-27 2023-12-01 南京中鑫智电科技有限公司 High-voltage casing oil gas remote monitoring method and system
CN117909852A (en) * 2024-03-19 2024-04-19 山东省地矿工程勘察院(山东省地质矿产勘查开发局八〇一水文地质工程地质大队) Monitoring data state division method for hydraulic loop ecological data analysis

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792945A (en) * 2021-11-17 2021-12-14 西南交通大学 Dispatching method, device, equipment and readable storage medium of commercial vehicle
CN113792945B (en) * 2021-11-17 2022-02-08 西南交通大学 Dispatching method, device, equipment and readable storage medium of commercial vehicle
CN114861776A (en) * 2022-04-21 2022-08-05 武汉大学 Dynamic self-adaptive network anomaly detection method based on artificial immunity technology
CN114861776B (en) * 2022-04-21 2024-04-09 武汉大学 Dynamic self-adaptive network anomaly detection method based on artificial immunity technology
CN117148017A (en) * 2023-10-27 2023-12-01 南京中鑫智电科技有限公司 High-voltage casing oil gas remote monitoring method and system
CN117148017B (en) * 2023-10-27 2023-12-26 南京中鑫智电科技有限公司 High-voltage casing oil gas remote monitoring method and system
CN117909852A (en) * 2024-03-19 2024-04-19 山东省地矿工程勘察院(山东省地质矿产勘查开发局八〇一水文地质工程地质大队) Monitoring data state division method for hydraulic loop ecological data analysis
CN117909852B (en) * 2024-03-19 2024-05-24 山东省地矿工程勘察院(山东省地质矿产勘查开发局八〇一水文地质工程地质大队) Monitoring data state division method for hydraulic loop ecological data analysis

Similar Documents

Publication Publication Date Title
CN108132968B (en) Weak supervision learning method for associated semantic elements in web texts and images
CN113221065A (en) Data density estimation and regression method, corresponding device, electronic device, and medium
Redmond et al. A method for initialising the K-means clustering algorithm using kd-trees
Filippone et al. Applying the possibilistic c-means algorithm in kernel-induced spaces
Yi et al. An improved initialization center algorithm for K-means clustering
Ghadiri et al. BigFCM: Fast, precise and scalable FCM on hadoop
CN115331752B (en) Method capable of adaptively predicting quartz forming environment
Bourouis et al. Color object segmentation and tracking using flexible statistical model and level-set
CN114154557A (en) Cancer tissue classification method, apparatus, electronic device, and storage medium
Reddy et al. A Comparative Survey on K-Means and Hierarchical Clustering in E-Commerce Systems
Si et al. [Retracted] Image Matching Algorithm Based on the Pattern Recognition Genetic Algorithm
CN113516019B (en) Hyperspectral image unmixing method and device and electronic equipment
KR101577249B1 (en) Device and method for voronoi cell-based support clustering
Lim et al. A fuzzy qualitative approach for scene classification
CN115661504A (en) Remote sensing sample classification method based on transfer learning and visual word package
Pappula A Novel Binary Search Tree Method to Find an Item Using Scaling.
Yang et al. Adaptive density peak clustering for determinging cluster center
Devi et al. Similarity measurement in recent biased time series databases using different clustering methods
Schneider et al. Expected similarity estimation for large scale anomaly detection
Zhang et al. Color clustering using self-organizing maps
Du et al. Clustering i: Basic clustering models and algorithms
Fang et al. Density distillation for fast nonparametric density estimation
Li et al. A Fast Color Image Segmentation Approach Using GDF with Improved Region‐Level Ncut
Seng et al. Application of RS Theory and SVM in the Ore-rock Classification
Nguatem et al. Contiguous patch segmentation in pointclouds

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210806