WO2024036709A1

WO2024036709A1 - Anomalous data detection method and apparatus

Info

Publication number: WO2024036709A1
Application number: PCT/CN2022/121926
Authority: WO
Inventors: 庄海琪; 林炳鑫
Original assignee: 深圳前海微众银行股份有限公司
Priority date: 2022-08-18
Filing date: 2022-09-27
Publication date: 2024-02-22
Also published as: CN115357764A

Abstract

Embodiments of the present invention relate to an anomalous data detection method and apparatus. The method comprises: inputting W pieces of data to be detected that are distributed in W dimensions into target models corresponding to target dimensions, wherein a target model corresponding to any target dimension is obtained by means of sample data from which strong influence point data is removed; for any target model, if it is determined that the W pieces of data to be detected do not meet the target model, determining all dimensions included in the target model as anomalous dimensions; for any anomalous dimension, determining an anomalous probability that the anomalous dimension is determined as an anomalous dimension in each target model; and according to the anomalous probability, determining whether the data to be detected corresponding to the anomalous dimension is anomalous data. The presence of anomalous data among the W pieces of data to be detected can be detected, and which data of which dimension is anomalous data among the W pieces of data to be detected can also be accurately positioned. Therefore, automatic and accurate positioning of anomalous data is achieved, without the need of manual re-check.

Description

An abnormal data detection method and device

Cross-references to related applications

This application claims priority to the Chinese patent application submitted to the China Patent Office on August 18, 2022, with application number 202210992301.1 and application title "An abnormal data detection method and device", the entire content of which is incorporated into this application by reference. middle.

Technical field

Embodiments of the present invention relate to the field of computer technology, and in particular, to an abnormal data detection method, device, computing device and computer-readable storage medium.

Background technique

With the development of computer technology, more and more technologies are applied in the financial field. The traditional financial industry is gradually transforming into financial technology (FiKtech). However, due to the security and real-time requirements of the financial industry, higher technology requirements are also put forward. requirements.

With the development of the Internet financial industry and the increasing improvement of computer technology, the amount of data of different dimensions generated by the financial system per unit time is increasing, and the order of magnitude of these dimensions can reach hundreds or thousands. There will inevitably be abnormal data in these data. There are many reasons for abnormal data, such as errors in manual entry, errors in computer processing and calculation, etc. The impact of the existence of abnormal data on subsequent statistical processing and other steps cannot be underestimated, so abnormal data needs to be detected.

The currently used abnormal data detection method has low detection accuracy. After detecting abnormal data, it still needs to be manually checked and confirmed again, which costs high labor and time costs.

In summary, an abnormal data detection method is provided to improve the accuracy of abnormal data detection.

Contents of the invention

Embodiments of the present invention provide an abnormal data detection method to improve the accuracy of abnormal data detection.

In a first aspect, embodiments of the present invention provide an abnormal data detection method, including:

W data to be detected distributed in W dimensions are input into each target model corresponding to each target dimension; the target model corresponding to any target dimension is obtained by removing the sample data from the strong influence point data; so The strong influence point data refers to sample data whose impact on the accuracy of the target model does not meet the preset conditions;

For any target model, if it is determined that W data to be detected do not satisfy the target model, then each dimension included in the target model is determined as an abnormal dimension;

For any abnormal dimension, determine the abnormal probability that the abnormal dimension is determined to be an abnormal dimension in each target model;

Determine whether the data to be detected corresponding to the abnormal dimension is abnormal data according to the abnormality probability.

By inputting W data to be detected distributed in W dimensions into each target model corresponding to each target dimension, the abnormal dimension is determined. Then, the abnormality probability of the abnormal dimension determined as an abnormal dimension in each target model is determined, and it is determined based on the abnormality probability whether the data to be detected corresponding to the abnormal dimension is abnormal data. It not only detects the presence of abnormal data in W data to be detected, but also accurately locates which dimension of data in the W data to be detected is abnormal data. This achieves automatic and accurate positioning of abnormal data, eliminating the need for manual review.

In some embodiments, after determining each dimension included in the target model as an abnormal dimension, the method further includes:

For any abnormal dimension, obtain each historical data corresponding to the abnormal dimension;

Determine the anomaly score of the data to be detected corresponding to the abnormal dimension by clustering the data to be detected corresponding to the abnormal dimension and each historical data;

Determining whether the data to be detected corresponding to the abnormal dimension is abnormal data according to the abnormality probability includes:

Determine whether the data to be detected corresponding to the abnormal dimension is abnormal data based on the abnormality probability and the abnormality score.

By obtaining the historical data of the abnormal dimension, cluster the data to be detected corresponding to the abnormal dimension and each historical data, and obtain the abnormal score of the data to be detected corresponding to the abnormal dimension. Combine the abnormal probability and the abnormal score to determine the data to be detected corresponding to the abnormal dimension. Whether the data is abnormal data. The combination of the two judgment methods not only takes into account the probability that the abnormal dimension is determined to be an abnormal dimension, but also takes into account the historical data of the abnormal dimension, increasing the accuracy of determining abnormal data.

In some embodiments, by clustering the data to be detected corresponding to the abnormal dimension and the historical data, the anomaly score of the data to be detected corresponding to the abnormal dimension is determined, including:

Construct an isolated binary tree for the data to be detected corresponding to the abnormal dimension and each of the historical data;

Calculate the anomaly score of the data to be detected corresponding to the abnormal dimension in the isolated binary tree.

By constructing an isolated binary tree from the data to be detected corresponding to the abnormal dimension and each historical data, the accuracy of determining abnormal data is improved.

In some embodiments, the target model corresponding to any target dimension is determined in the following manner, including:

Obtain initial n groups of sample data distributed in M dimensions; where each group of sample data has M dimensions;

For the target dimension among the M dimensions, select K independent variable dimensions that are correlated with the target dimension from the M dimensions according to the correlation coefficients of the initial n sets of sample data; the target dimension is the M any of the dimensions;

The target model is determined based on the target n groups of sample data distributed in the target dimension and the K independent variable dimensions; the target model is used to characterize the satisfaction between the target dimension and the K independent variable dimensions. relation.

The initial n sets of sample data are distributed in M dimensions, but not all of these M dimensions are necessarily related. Therefore, it is necessary to select the dimensions with relevant relationships. For any target dimension, selection is made based on the correlation coefficients of the initial n groups of sample data to obtain K independent variable dimensions with correlations corresponding to the target dimension. In this way, each dimension can be used as a target dimension, and each target dimension and its independent variable dimensions can correspondingly determine a target model. Taking into account richer scenarios and situations, the accuracy of the determined target model is increased.

In some embodiments, determining the target model based on target n groups of sample data distributed in the target dimension and the K independent variable dimensions includes:

Obtain target n groups of sample data distributed in the target dimension and the K independent variable dimensions;

The influence degree of the candidate group sample data on the accuracy of the target model is determined according to the target n group of sample data and the n-1 group of sample data except the candidate group sample data; the candidate group sample data is the Any group of sample data in the target n groups of sample data;

According to the influence degree of n candidate group sample data, w groups of strong influence point data are removed from the target n group of sample data; 1≤w<n;

The target model is determined based on the retained sample data.

Determine the degree of influence of the candidate group of sample data on the accuracy of the target model based on the target n group of sample data and n-1 groups of sample data except the candidate group of sample data. According to the degree of influence, the target n group of sample data is Remove w groups of strong influence point data, and determine the target model based on the retained sample data. In this way, the sample data that has a greater impact on the accuracy of the target model will be eliminated, the impact of these sample data on the final target model will be minimized, and the accuracy of the target model will be improved. Abnormal data detection is performed on the data to be detected based on a more accurate target model, which improves the accuracy of detecting abnormal data.

In some embodiments, determining the impact of the candidate group sample data on the accuracy of the target model based on the target n group of sample data and n-1 groups of sample data except the candidate group sample data includes:

Fit the target n groups of sample data to obtain the first fitting coefficient of the first fitting model;

Fit n-1 groups of sample data other than the candidate group sample data to obtain the second fitting coefficient of the second fitting model;

The degree of influence is determined based on the first fitting coefficient, the second fitting coefficient, the number of independent variable dimensions included in the target model, and the mean square error of the first fitting model.

According to the first fitting coefficient obtained by fitting including the candidate group sample data and the second fitting coefficient obtained by fitting excluding the candidate group sample data, the influence of the candidate group sample data on the accuracy of the target model is determined to improve to determine the accuracy of impact. As a result, a more accurate target model can be obtained.

In some embodiments, removing w groups of strong influence point data from the target n groups of sample data based on the influence of n candidate groups of sample data includes:

For any candidate group sample data, if the impact of the candidate group sample data on the accuracy of the target model is greater than the first quarter of the F distribution of (p, n-p-1) degrees of freedom, then Determine that the candidate group sample data is strong influence point data; where p is the number of independent variable dimensions included in the target model;

Remove the strong influence point data from the target n groups of sample data.

Using the first quarter of the F distribution with (p, n-p-1) degrees of freedom to determine the degree of influence is more scientific and reasonable, and improves the accuracy of determining strong influence point data. As a result, a more accurate target model can be obtained.

In some embodiments, after determining the target model based on the retained sample data, the method further includes:

Input test data into the target model for testing; obtain the average absolute error rate of the target model;

It is determined that the fitting degree parameter and the average absolute error rate of the target model for fitting the retained sample data respectively meet a preset threshold.

The target models corresponding to each target dimension will not be directly used to detect abnormal data, but will be filtered again in each target model. By inputting the test data into any target model for testing, the average absolute error rate of the target model for testing is obtained. If the fitting parameters of the target model and the average absolute error rate respectively meet the preset thresholds, it means that the fitting accuracy of the target model is high and can be used for subsequent abnormal data detection. Improved accuracy of abnormal data detection.

In a second aspect, embodiments of the present invention also provide an abnormal data detection device, including:

Processing unit for:

In some embodiments, the processing unit is also used to:

The processing unit is specifically used for:

In some embodiments, the processing unit is specifically used to:

The target model is determined based on the retained sample data.

In some embodiments, the processing unit is specifically used to:

Remove the strong influence point data from the target n groups of sample data.

In some embodiments, the processing unit is also used to:

In a third aspect, an embodiment of the present invention further provides a computing device, including:

Memory, used to store computer programs;

A processor, configured to call the computer program stored in the memory, and execute the abnormal data detection method listed in any of the above methods according to the obtained program.

In a fourth aspect, embodiments of the present invention also provide a computer-readable storage medium that stores a computer-executable program, and the computer-executable program is used to cause the computer to execute any of the methods listed above. Abnormal data detection methods.

Description of drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the following will briefly introduce the drawings needed to describe the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. Those of ordinary skill in the art can also obtain other drawings based on these drawings without exerting creative efforts.

Figure 1 is a schematic diagram of a method for determining a target model based on n sets of sample data provided by an embodiment of the present invention;

Figure 2 is a schematic flowchart of a method for determining a target model provided by an embodiment of the present invention;

Figure 3 is a schematic diagram of a fitted straight line obtained by fitting using the least squares method according to an embodiment of the present invention;

Figure 4 is a schematic diagram of a detailed target determination model provided by an embodiment of the present invention;

Figure 5 is a schematic diagram of a possible abnormal data detection method provided by an embodiment of the present invention;

Figure 6 is a schematic diagram of a constructed isolated binary tree provided by an embodiment of the present invention;

Figure 7 is a schematic structural diagram of an abnormal data detection device provided by an embodiment of the present invention;

Figure 8 is a schematic structural diagram of a computer device provided by an embodiment of the present invention.

Detailed ways

In order to make the purpose, implementation and advantages of the present application clearer, the exemplary embodiments of the present application will be clearly and completely described below in conjunction with the accompanying drawings in the exemplary embodiments of the present application. Obviously, the described exemplary embodiments These are only some of the embodiments of this application, not all of them.

Based on the exemplary embodiments described in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of the claims appended to this application. In addition, although the disclosure in this application is introduced in terms of one or several exemplary examples, it should be understood that each aspect of these disclosures can also individually constitute a complete embodiment.

It should be noted that the brief description of terms in this application is only to facilitate understanding of the embodiments described below, and is not intended to limit the embodiments of this application. Unless otherwise stated, these terms should be understood according to their ordinary and usual meaning.

The terms "first", "second", "third", etc. in the description and claims of this application and the above-mentioned drawings are used to distinguish similar or similar objects or entities, and do not necessarily mean to limit specific Sequence or sequence, unless otherwise stated (UKless otherwise iKdicated). It is to be understood that the terms so used are interchangeable under appropriate circumstances and, for example, can be implemented in an order other than that shown or described in accordance with the embodiments of the present application.

In addition, the terms "including" and "having" and any variations thereof are intended to cover but not exclusively include, for example, a product or device that includes a range of components need not be limited to those components explicitly listed, but may include There are other components not expressly listed or inherent to these products or devices.

In order to better explain this application, the technologies or terms involved in this application are first explained as follows.

1. Multiple linear regression: In regression analysis, if there are two or more independent variables, it is called multiple regression. In fact, a phenomenon is often associated with multiple factors. Predicting or estimating the dependent variable using the optimal combination of multiple independent variables is more effective and more realistic than using only one independent variable to predict or estimate. Therefore, multiple linear regression has greater practical significance than single linear regression.

2. Ordinary Least Squares (OLS): It is a mathematical optimization modeling method. It finds the best functional match of the data by minimizing the sum of squared errors. The least squares method can be used to easily obtain unknown data, and minimize the sum of square errors between the obtained data and the actual data.

3. Degree of freedom: refers to the number of independent or freely changing data in the sample when the statistics of the sample are used to estimate the parameters of the population. This is called the degree of freedom of the statistic. Generally speaking, the degrees of freedom are equal to the number of independent variables minus the number of their derivatives; for example, the definition of variance is the sum of the squares of the sample minus the mean (a derivative determined by the sample), so for N random samples In other words, its degree of freedom is N-1.

4. Decision tree: It is a prediction model that represents a mapping relationship between object attributes and object values. Each node in the tree represents an object, and each bifurcation path represents a possible attribute value, and each leaf node corresponds to the value of the object represented by the path from the root node to the leaf node. The decision tree has only a single output. If you want to have multiple outputs, you can build an independent decision tree to handle different outputs. Decision tree is a frequently used technology in data mining. It can be used to analyze data and can also be used to make predictions.

5. Strong influence points: refers to data points that have a strong influence on the parameter estimation of the multiple linear regression model. Since multiple linear regression uses the least squares method for parameter estimation, all records are treated equally at this time. When there are records in the database that are far away from the body of multidimensional spatial data, they will cause the fitted model to be biased toward that data point. The identification of strong influence points is another important issue that should be paid attention to when performing multiple linear regression. Strong influence points are data that have a great impact on the stability and authenticity of parameter estimates. For regression model data sets, strong influence points refer to those points that have a very large influence and impact on the value of statistics.

After research, it is found that in most financial scenarios, there is a linear correlation between data of different dimensions, and there is almost no non-linear relationship. Therefore, based on such characteristics, a method can be designed to automatically dig into the consistent models between data of different dimensions through the analysis of sample data, and then use this model to detect abnormal data in the data to be detected, thereby improving the data overall quality.

In order to ensure the accuracy of abnormal data detection, it is extremely important to determine a business-explanatory model that can reflect the real rules between data in different dimensions. Therefore, how to improve the accuracy of models for determining data of different dimensions has become the focus of our research.

Based on this, embodiments of the present invention provide the following method for determining a target model based on n sets of sample data, as shown in Figure 1, including:

Step 101: Obtain initial n groups of sample data distributed in M dimensions; wherein each group of sample data has M dimensions.

Step 102: For the target dimension among the M dimensions, select K independent variable dimensions that are correlated with the target dimension from the M dimensions according to the correlation coefficients of the initial n groups of sample data; the target dimension is the Any of the M dimensions.

Step 103: Determine the target model based on the target n groups of sample data distributed in the target dimension and the K independent variable dimensions; the target model is used to characterize the target dimension and the K independent variable dimensions. satisfying relationship.

In step 101, initial n groups of sample data are obtained, and each group of sample data is distributed in M dimensions.

The embodiments of the present invention do not limit the ways and methods for obtaining sample data. The sample data can be automatically read from the database or manually imported. For example, read the training data text in excel format through the pandas library or read the database in other ways.

Optionally, preliminary preprocessing can be performed on the read sample data, especially the missing values of the sample data are filled by default. Depending on the data characteristics, you can choose to fill with 0 values or fill with medians. Data can also be cleaned into a format that meets algorithm training requirements.

Table 1 shows possible read sample data.

Table 1

Table 1 contains a date column and five dimension columns. The five dimensions are Dimension A: deposits with interbank funds; Dimension B: domestic commercial banks; Dimension C: other domestic banking financial institutions; Dimension D: domestic other Financial institutions; Dimension E: Interest receivable. Table 1 contains 61 sets of sample data, from January 2016 to January 2021.

There are certain correlations in each dimension. The goal of the embodiments of the present invention is to automatically discover the equal or approximately equal relationships between various dimensions in numerous sample data, and to try to improve the accuracy of the relationship equations through subsequent algorithms. Thus, for any target dimension, a corresponding target model is obtained to achieve the purpose of abnormal data detection based on the target model.

In one possible implementation, the obtained sample data are used as initial n sets of sample data for subsequent determination of the target model.

Another possible implementation is to divide the obtained sample data into a training set and a test set. The training set is used as the initial n sets of sample data for subsequent determination of the target model, and the test set is used to test and verify the obtained target model to evaluate the accuracy of the target model.

If the sample data is divided into a training set and a test set, the sample data can be divided according to a certain proportion. The embodiment of the present invention does not limit the division ratio, such as 9:1, 8:2, etc. When dividing the training set and the test set, the sample data can be sorted and divided according to certain rules, or the sample data can be divided without sorting. There is no restriction on this, because whether the sample data is sorted or not does not affect the determined goal. Model accuracy.

The following uses a detailed example to introduce the division of the obtained sample data into a training set and a test set.

Sort the sample data by date, and divide the sample data according to training set: test set = 9:1. The first 90% of the sample data is used as the training set, and the last 10% of the sample data is used as the test set. In the example in Table 1, a total of 61 sets of sample data were collected, and the first 55 sets of sample data (i.e., January 2016 to July 2020) were used as training sets to determine the target model; the last 6 sets of sample data (i.e., 2020 August 2020 to January 2021) is used as a test set to verify and test the target model to evaluate the accuracy of the target model.

One possible situation is that if the amount of sample data obtained is small, the K-fold cross-validation method can be used to cut the sample data. For example, there are only 10 groups of sample data obtained. If the number of samples is divided into K parts on average (such as 5 parts), each part has 2 sets of data, then 4 of them (8 groups) can be randomly selected as training during the target model determination stage. set, one of which (2 groups) is used as the test set, and the training set is used to obtain the regression coefficient of the target model. The random extraction action is repeated multiple times to generate multiple regression coefficients, and the multiple regression coefficients are weighted and averaged to obtain the final regression coefficient. This makes up for the problem of insufficient training caused by small sample data.

The following uses the first 55 sets of sample data (i.e., January 2016 to July 2020) as the training set to determine the target model as an example to introduce the method of determining the target model.

The first 55 sets of sample data are used as the training set, that is, as the initial n sets of sample data distributed in M dimensions. In the above example, the initial n sets of sample data are distributed in 5 dimensions.

In step 102, since the sample data of the five dimensions may not all have a linear regression relationship, and only a few of the dimensions may have a linear regression relationship, it is necessary to determine for each target dimension in the M dimensions. Independent variable dimensions that are related to the target dimension.

For example, for the five dimensions in Table 1, let each dimension be a target dimension, and select the corresponding independent variable dimension for the target dimension. Then, each data in the target dimension and each data in the independent variable dimension are substituted into the linear regression equation. The linear regression equation is y=θ ₁ x ₁ +θ ₂ x ₂ +θ ₃ x ₃ +...+θ _n x _n . Among them, y is the data corresponding to the target dimension, x1, x2... are the data corresponding to the respective variable dimensions.

Taking dimension A as the target dimension as an example, the method of selecting independent variable dimensions for the target dimension is introduced. In this example, dimension A is the target dimension, dimension B, dimension C, dimension D and dimension E are candidate independent variable dimensions. Next, the independent variable dimension corresponding to dimension A must be selected from these candidate independent variable dimensions.

First, an initial n group of sample data matrices (55×5) are constructed, containing 55 groups of data in 5 dimensions. Move one column of the target dimension (dimension A) to the last column of the matrix, and calculate the correlation coefficient matrix r based on the initial n sets of sample data matrices. The correlation coefficient matrix r is calculated through the covariance formula.

The specific calculation formula is as follows:

where X _i is the monthly data value of any candidate independent variable dimension,

is the average of 55 months of data in the candidate independent variable dimension; Y _i is the monthly data value in the target dimension,

It is the average of 55 months of data for the target dimension. By substituting the above data into Formula 1, the correlation coefficient between the target dimension and any candidate independent variable dimension can be obtained. For example, Yi is the data value of dimension A for each month in 55 months,

is the average value of dimension A’s data in 55 months; X1 is the data value of dimension B in each of 55 months,

is the average of 55 months of data for dimension B. Substituting the above data into formula 1, the correlation coefficient of dimension A and dimension B can be obtained. Using the same method, we can obtain the correlation coefficient between dimension A and dimension C, the correlation coefficient between dimension A and dimension D, and the correlation coefficient between dimension A and dimension E. No more enumeration here.

For example, each correlation coefficient forms the following correlation coefficient matrix r.

The last column is the target dimension column, which is dimension A. According to the last column, the correlation coefficient between dimension A and dimension B is 0.9976391; the correlation coefficient between dimension A and dimension C is -0.07923952, and the correlation coefficient between dimension A and dimension D is 0.63029953. The correlation coefficient between dimension A and dimension E is 0.46870661. The closer the absolute value of the correlation coefficient is to 1, the more relevant the two are.

Then the variance contribution value of each candidate independent variable dimension is calculated based on the correlation coefficient matrix r. The formula for variance contribution is as follows.

Among them, columns is the total number of columns of matrix r. In this example, columns=5. r(i,i) represents the value of the i-th row and i-th column in the correlation coefficient matrix. For example, r(1,5) ² /r(1, 1)=0.9976391 ² =0.99528377. That is, the variance contribution value of dimension B to the target model obtained with dimension A as the target dimension is 0.99528377.

The finally obtained matrix of variance contribution values of dimension B, dimension C, dimension D and dimension E to the target model with dimension A as the target dimension is [0.99528377 0.0062789 0.3972775 0.21968589]. The larger the variance contribution value, the greater the variance contribution value of the target model with dimension A as the target dimension. The target model obtained by the target dimension is more beneficial.

Calculate the F value of the F distribution corresponding to the maximum variance contribution value through Formula 3. The maximum variance contribution value is the variance contribution value corresponding to dimension B.

Among them, nos is n, and in is the number of candidate independent variable dimensions. In this example, n=55, in=4.

Substituting into the formula, the F value of dimension B is 11184.801222455637. The F value is converted into a distribution probability p value of 2.449050249153728e-63 according to the F distribution table. In statistics, the general p value is <0.05, indicating that the independent variable is significant and can be introduced into the regression equation. Therefore, dimension B is first used as the independent variable dimension of the target model.

Then use the following method to perform matrix transformation on the correlation coefficient matrix r:

i: the current row; j: the current column; k: the subscript of the factor with the largest variance contribution in v, where the value is 1; the transformation formula is as follows:

When i! =k and j! =k: the new value of r[i,j]=r[k,j]/r[k,k];

When i! =k and j! =k: the new value of r[i,j]=r[i,j]-r[i,k]*r[k,j]/r[k,k];

When i! =k and j=k: the new value of r[i,j]=-r[i,k]/r[k,k];

In other cases, the new value of r[i,j]=1/r[k,k];

The transformed matrix r is:

Then based on the transformed correlation coefficient matrix r, repeat the above steps of calculating the maximum variance contribution value, and continue to iteratively select new independent variable dimensions. Finally, when the target dimension is dimension A, the obtained independent variable dimensions are dimension B, dimension C and dimension D.

In the same way, when the target dimension is dimension B, the respective variable dimensions are obtained; when the target dimension is dimension C, the independent variable dimensions are obtained; when the target dimension is dimension D, the respective variable dimensions are obtained. I won’t go into details here.

It is worth noting that the number of independent variable dimensions corresponding to different target dimensions may be the same or different. For example, when dimension A is the target dimension, the corresponding independent variable dimensions are 3, namely dimension B, dimension C and dimension D; when dimension B is the target dimension, the corresponding independent variable dimensions are 2, namely dimension C and dimension. Dimension D; when dimension C is the target dimension, the corresponding independent variable dimension is 1, which is dimension D.

In step 103, for any target dimension, the process of determining the target model corresponding to the target dimension is introduced.

For example, let's take an example where the target dimension is dimension A, and the corresponding independent variable dimensions are dimension B, dimension C, and dimension D.

The process of determining the target model is shown in Figure 2, including:

Step 201: Obtain target n groups of sample data distributed in the target dimension and the K independent variable dimensions.

Step 202: Determine the influence of the candidate group of sample data on the accuracy of the target model based on the target n groups of sample data and n-1 groups of sample data except the candidate group of sample data; the candidate group of sample data is any group of sample data among the target n groups of sample data.

Step 203: Remove w groups of strong influence point data from the target n groups of sample data according to the influence degrees of the n candidate groups of sample data; 1≤w<n.

Step 204: Determine the target model based on the retained sample data.

In step 201, target n sets of sample data are determined among the initial n sets of sample data. For example, when the target dimension is dimension A, the corresponding independent variable dimensions are dimension B, dimension C and dimension D, and K=3. Therefore, the determined target n groups of sample data are 55 groups of sample data distributed in four dimensions: dimension A, dimension B, dimension C and dimension D.

The following describes the fitting results obtained if these 55 sets of sample data are used for fitting at this time. Take fitting using the least squares method as an example.

The principle of the least squares method is to try to make each data point close to the fitted straight line by calculating the regression coefficient. Figure 3 shows a schematic diagram of a possible fitting situation. In Figure 3, each point is evenly distributed around the fitted straight line, and the distance between the actual value of each point and the corresponding predicted value on the straight line is the smallest.

In this example, the target equation is: y=θ ₁ x ₁ +θ ₂ x ₂ +θ ₃ x ₃ . Among them, y is dimension A, x1 is dimension B, x2 is dimension C, and x3 is dimension D. We need to calculate the regression coefficient β. The regression coefficient determines the slope of the straight line, so that the straight line can fit the 55 sets of sample data as much as possible, that is, to minimize the sum of distances between all points and the straight line of the equation. The sum of the distances can be calculated using the RSS residual square. and to define.

where y _i is the actual value,

is the predicted value. On the premise of ensuring the minimum RSS, the standard equation Formula 5 is solved by the least squares method to obtain the regression coefficient.

β＝(X ^T X) ^-1 X ^T Y Formula 5

Substitute 55 groups of sample data distributed in 4 dimensions into the above formula to obtain the regression coefficient β. The regression coefficient β is also a matrix. The first fitting model obtained after fitting 55 sets of sample data is: A=1.0053×B+0.25×C+0.9828×D. The fitting degree ^R2 of this first fitting model is 0.999, and the reference value p-value of each parameter significance level is low, which means that the fitting degree of the first fitting model obtained by fitting these 55 sets of sample data is relatively good. Good, the first fitting model can better reflect the rules between these 55 sets of sample data.

On the surface, by evaluating the first fitting model based on the degree of fit and significance level, it can be concluded that the first fitting model is reasonable and more accurate. However, according to business experience and historical data, the relationship that should be satisfied between dimension A, dimension B, dimension C and dimension D is: A=1×B+1×C+1×D. It can be seen that the first fitting model obtained previously is not in line with business experience and historical data, and has no business interpretability. Using such a first fitting model to detect abnormal data on the data to be detected will inevitably lead to a decrease in detection accuracy.

Further analysis showed that abnormal data may have appeared in the 55 sets of sample data. The existence of abnormal data caused the obtained first fitting model to be inconsistent with business experience and historical data. There are many reasons for the occurrence of abnormal data. For example, errors occur during the collection or entry process, or there are errors and abnormalities in the sample data itself.

The following conjecture is verified through leverage ratio analysis.

Based on the previously obtained first fitting model, the leverage ratio of each group of sample data is analyzed. The leverage ratio reflects the degree of influence of each group of sample data on the regression coefficient of the first fitting model. For multiple linear regression, it can be solved by the OLS least squares method. The standard equation of coefficients is derived and the leverage matrix calculation formula is:

H＝X(X ^T X) ^-1 X ^TFormula 6

The H matrix reflects the projection of the actual observed values of each set of sample data onto the predicted values, which is equivalent to converting the actual observed values into predicted values through the H matrix. The leverage ratio of the i-th group of sample data corresponds to the value of the i-th element on the diagonal of the H matrix. In the above example, we calculated the leverage statistics of 55 groups of sample data as shown in Table 2.

Table 2

时间time	杠杆率统计量Leverage statistics
2016年1月January 2016	0.3659530.365953
2016年2月February 2016	0.3751850.375185
2016年3月March 2016	0.0010010.001001
2016年4月April 2016	0.0002120.000212
2016年5月May 2016	0.0080140.008014
2016年6月June 2016	0.0034560.003456
2016年7月July 2016	0.0001470.000147
2016年8月August 2016	0.0355910.035591
2016年9月September 2016	0.0028770.002877
2016年10月October 2016	0.0002160.000216
2016年11月November 2016	0.0005090.000509
2016年12月December 2016	0.0110520.011052
2017年1月January 2017	0.0073500.007350
2017年2月February 2017	0.0380060.038006
2017年3月March 2017	0.0177860.017786
2017年4月April 2017	0.0132820.013282
2017年5月May 2017	0.0079320.007932
2017年6月June 2017	0.0965340.096534
2017年7月July 2017	0.0025920.002592
2017年8月August 2017	0.0075810.007581
2017年9月September 2017	0.0216320.021632
2017年10月October 2017	0.0029010.002901
2017年11月November 2017	0.0095970.009597
2017年12月December 2017	0.0083000.008300
2018年1月January 2018	0.0026000.002600
2018年2月February 2018	0.0056190.005619
2018年3月March 2018	0.0101880.010188
2018年4月April 2018	0.0336330.033633
2018年5月May 2018	0.0265800.026580
2018年6月June 2018	0.0179390.017939
2018年7月July 2018	0.0257240.025724
2018年8月August 2018	0.0236650.023665
2018年9月September 2018	0.0294550.029455
2018年10月October 2018	0.0906730.090673
2018年11月November 2018	0.0664640.066464
2018年12月December 2018	0.0036730.003673
2019年1月January 2019	0.2869480.286948

2019年2月February 2019	0.0806260.080626
2019年3月March 2019	0.0175270.017527
2019年4月April 2019	0.0140010.014001
2019年5月May 2019	0.0097240.009724
2019年6月June 2019	0.0096630.009663
2019年7月July 2019	0.0123190.012319
2019年8月August 2019	0.1833560.183356
2019年9月September 2019	0.0122750.012275
2019年10月October 2019	0.0116230.011623
2019年11月November 2019	0.0202660.020266
2019年12月December 2019	0.0078630.007863
2020年1月January 2020	0.4235670.423567
2020年2月February 2020	0.0178720.017872
2020年3月March 2020	0.1904810.190481
2020年4月April 2020	0.0222840.022284
2020年5月May 2020	0.0260100.026010
2020年6月June 2020	0.2604770.260477
2020年7月July 2020	0.0191990.019199

It can be found that the leverage ratio values corresponding to the first two sets of sample data are 0.365953 and 0.375185 respectively, which are far greater than the average of 2 times the leverage ratio statistic. Therefore, it can be judged that the first two sample data are relatively extreme data. With the existence of such extreme data, it is highly likely that the first fitting model obtained does not conform to business experience and historical data.

However, using leverage ratio analysis to detect abnormal data in sample data is not accurate enough and is not universal. Therefore, a method is provided to determine the strong influence point data in the target n group of sample data, and determine the satisfying relationship between the target dimension and the respective variable dimension based on the sample data after removing the strong influence point data, which is more accurate. See steps 202-204 for details.

In step 202, determine the influence degree of the candidate group sample data on the accuracy of the target model based on the target n group of sample data and n-1 groups of sample data except the candidate group sample data; the candidate group The sample data is any group of sample data among the target n groups of sample data.

Traverse each group of sample data in the target n groups of sample data, and calculate the impact of eliminating this group of sample data on the accuracy of the target model. For example, the first group of sample data is used as the candidate group of sample data, and based on the first fitting model fitted to 55 groups of sample data and the second model fitted to 54 groups of sample data except the first group of sample data, the first The influence of the set of sample data on the accuracy of the target model; using the second set of sample data as the candidate set of sample data, the first fitting model fitted based on the 55 sets of sample data and the 54 sets of samples except the second set of sample data The second model of data fitting determines the impact of the second group of sample data on the accuracy of the target model; the third group of sample data is used as the candidate group of sample data, and the first fitting model is fitted based on the 55 groups of sample data. The second model fitted to 54 groups of sample data except the 3rd group of sample data is used to determine the influence of the 3rd group of sample data on the accuracy of the target model...and so on, and the samples of each group of 55 groups of sample data are obtained. The impact of data on the accuracy of the target model.

The specific way to calculate the influence of any candidate group sample data is as follows: fit the target n groups of sample data to obtain the first fitting coefficient of the first fitting model; Fit n-1 groups of sample data to obtain the second fitting coefficient of the second fitting model; according to the first fitting coefficient, the second fitting coefficient, and the independent variable dimensions included in the target model The quantity and the mean square error of the first fitted model determine the degree of influence.

The specific formula is as follows:

Among them, p is the number of independent variable dimensions included in the model; s is the mean square error of the first fitting model;

is the regression coefficient matrix obtained by fitting the target n group of sample data, that is, the first fitting coefficient;

is the regression coefficient matrix after eliminating the i-th group of sample data, that is, the second fitting coefficient;

It is the predicted value obtained by fitting the target n group of sample data;

is the predicted value after excluding the i-th group of sample data. The i-th group of sample data here is the candidate group of sample data. In this example, p=3. s is calculated by the following formula:

Where n is the number of groups of sample data, and n-p represents the degrees of freedom of the first fitting model. In this example, n=55.

The degree of influence reflects the influence of each group of sample data on the accuracy of the target model. In principle, for a normal model, the degree of influence of each group of sample data on the model is similar. The greater the degree of influence, the greater the probability that the sample data of this group is abnormal. The bigger. Table 3 shows a possible influence degree of each group of sample data.

table 3

时间time	影响度Influence
2016年1月January 2016	1.868793e+001.868793e+00
2016年2月February 2016	2.350884e+002.350884e+00
2016年3月March 2016	7.227036e-077.227036e-07
2016年4月April 2016	1.102575e-061.102575e-06
2016年5月May 2016	2.469101e-062.469101e-06
2016年6月June 2016	1.196689e-061.196689e-06
2016年7月July 2016	1.097542e-091.097542e-09
2016年8月August 2016	1.162187e-041.162187e-04
2016年9月September 2016	2.600043e-072.600043e-07
2016年10月October 2016	7.922831e-087.922831e-08
2016年11月November 2016	1.176777e-081.176777e-08
2016年12月December 2016	1.186501e-051.186501e-05
2017年1月January 2017	5.410101e-065.410101e-06
2017年2月February 2017	2.026929e-042.026929e-04
2017年3月March 2017	5.014946e-055.014946e-05
2017年4月April 2017	3.358984e-053.358984e-05
2017年5月May 2017	5.519991e-065.519991e-06
2017年6月June 2017	8.286230e-048.286230e-04
2017年7月July 2017	1.477823e-061.477823e-06
2017年8月August 2017	1.028268e-071.028268e-07
2017年9月September 2017	7.183496e-087.183496e-08
2017年10月October 2017	5.900770e-075.900770e-07
2017年11月November 2017	2.246217e-052.246217e-05
2017年12月December 2017	8.486959e-068.486959e-06
2018年1月January 2018	6.901573e-076.901573e-07
2018年2月February 2018	6.816755e-066.816755e-06
2018年3月March 2018	4.340092e-064.340092e-06

2018年4月April 2018	2.721236e-052.721236e-05
2018年5月May 2018	9.605791e-059.605791e-05
2018年6月June 2018	4.748002e-054.748002e-05
2018年7月July 2018	3.163452e-063.163452e-06
2018年8月August 2018	1.223781e-061.223781e-06
2018年9月September 2018	1.929106e-041.929106e-04
2018年10月October 2018	1.808463e-031.808463e-03
2018年11月November 2018	6.676218e-066.676218e-06
2018年12月December 2018	1.287415e-051.287415e-05
2019年1月January 2019	2.572175e-022.572175e-02
2019年2月February 2019	1.828469e-031.828469e-03
2019年3月March 2019	1.139297e-051.139297e-05
2019年4月April 2019	2.404209e-052.404209e-05
2019年5月May 2019	3.648963e-033.648963e-03
2019年6月June 2019	1.546139e-051.546139e-05
2019年7月July 2019	8.853532e-068.853532e-06
2019年8月August 2019	3.971031e-043.971031e-04
2019年9月September 2019	3.498020e-053.498020e-05
2019年10月October 2019	2.538927e-052.538927e-05
2019年11月November 2019	6.540673e-056.540673e-05
2019年12月December 2019	9.437233e-059.437233e-05
2020年1月January 2020	5.881022e-025.881022e-02
2020年2月February 2020	6.277971e-066.277971e-06
2020年3月March 2020	7.816644e-037.816644e-03
2020年4月April 2020	5.175514e-045.175514e-04
2020年5月May 2020	5.400912e-065.400912e-06
2020年6月June 2020	5.814514e+005.814514e+00
2020年7月July 2020	5.945416e-055.945416e-05

Table 3 shows the influence degree corresponding to the sample data of each candidate group obtained after eliminating the sample data of each candidate group.

In step 203, w groups of strong influence point data are removed from the target n groups of sample data according to the influence degrees of the n candidate groups of sample data; 1≤w<n.

Strong influence point data has a greater impact on the accuracy of the target model, so it should be eliminated. The embodiment of the present invention does not limit the method of determining strong influence point data.

One possible way is to set it according to the experience and needs of the operation and maintenance personnel. If the accuracy requirements of the target model are relatively high, set the threshold value of the strong influence point data to a larger value; if the accuracy requirements of the target model are relatively high, If it is not too high, set the threshold of strong influence point data slightly lower. For example, the threshold is set to 4/n based on experience, where n is the number of target n groups of sample data. In this example, n=55. If the influence degree of any candidate group sample data is greater than the threshold, the candidate group sample data is determined to be a strong influence point data and is eliminated.

Another possible way is to use F distribution to determine strong influence point data. Specifically, for any candidate group of sample data, if the influence of the candidate group of sample data on the accuracy of the target model is greater than the first quarter of the F distribution of (p, n-p-1) degrees of freedom If one bit is used, it is determined that the candidate group sample data is strong influence point data; where p is the number of independent variable dimensions included in the target model; remove the strong influence point data from the target n group of sample data .

For example, p=3, n=55, so the strong influence point data is determined through the F distribution of (3, 51) degrees of freedom. For the influence degree corresponding to any candidate group sample data in Table 3, compare it with the value of the first quarter of the F distribution of (3, 51) degrees of freedom. If it is greater than this value, then It is determined to be strong influence point data.

After determining the strong influence point data, remove the strong influence point data. For example, the first method provided by the embodiment of the present invention is used to determine strong influence point data, and the first group of sample data, the second group of sample data, and the 54th group of sample data are finally removed. After verification, it was found that the sample data of the 1st and 2nd groups were indeed abnormal samples, but the 54th group of sample data was a sample that actually conformed to the model but the data fluctuated greatly. Although the above method cannot accurately eliminate sample data with only abnormalities, it is possible to eliminate a small number of non-abnormal samples such as the 54th group of sample data, but eliminating a small number of non-abnormal samples will not have a substantial impact on the target model.

In step 204, the target model is determined based on the retained sample data.

For example, after removing 3 sets of strong influence point data in step 203, the target model is determined based on the remaining 52 sets of sample data. The determined target model is: A=1×B+1×C+1×D+1.31e-10, where 1.31e-10 is the intercept constant and can be ignored.

It can be found that after removing the strong influence point data, the obtained target model is consistent with business experience and historical data, and has business interpretability.

The above introduces the process of determining the target model when the target dimension is dimension A. When the target dimensions are dimension B, dimension C, dimension D, and dimension E, the respective target models can be determined according to the process of steps 201-204 respectively. The dimensions included in different target models may be different. In this way, 5 corresponding target models are determined for 5 dimensions.

In some embodiments, sample data in the test set are also used to test and verify each target model obtained above.

For example, previously in step 101, n groups of sample data were divided into training sets and test sets, and the training set was used as the initial n groups of sample data for subsequent determination of the target model. The test set is used to test and validate the target model. Alternatively, the test set can also be obtained through other means. For example, very accurate sample data provided by operation and maintenance personnel that have been determined to have no abnormal data can also be used as a test set.

After determining the target model based on the retained sample data, the method further includes: inputting test data into the target model for testing; obtaining the average absolute error rate of the target model; and determining the response of the target model to the retained sample data. The fitting degree parameters for fitting and the average absolute error rate respectively meet preset thresholds.

For example, after step 204, five target models corresponding to five dimensions are obtained, and the fitting parameters of each target model can also be obtained correspondingly to represent the goodness of the fitting of the target model. Input the 6 test data of the test set into the first target model, and calculate the average absolute error rate according to the following formula:

Among them, y is the actual value of the 6 test data, and y is the predicted value obtained based on the target model. In this way, the average absolute error rate of the first target model is obtained.

In the same way, the average absolute error rate of each target model can be obtained.

One possible way is to screen each target model based on its average absolute error rate and fitness parameters. For example, three of the target models are selected for subsequent detection of abnormal data on the data to be detected.

Another possible way is to score each target model according to the average absolute error rate and fitting degree parameters of the target model, and determine the absolute equation that meets the first preset condition and the approximate equation that meets the second preset condition. , when each target model is subsequently used for abnormal data detection, different weights can be given to the scores of absolute equations and the scores of approximate equations. For example, a target model with an average absolute error rate less than 0.01 and a fit parameter greater than 0.999 is determined as an absolute equation; a target model with an average absolute error rate greater than or equal to 0.01 and less than 0.1 and a fit parameter greater than 0.9 is determined as an approximate equation. Mode.

In this way, the target models corresponding to each target dimension will not be directly used to detect abnormal data, but will be screened again in each target model. By inputting the test data into any target model for testing, the average absolute error rate of the target model for testing is obtained. If the fitting parameters of the target model and the average absolute error rate respectively meet the preset thresholds, it means that the fitting accuracy of the target model is high and can be used for subsequent abnormal data detection. Improved accuracy of abnormal data detection.

In order to better explain the embodiment of the present invention, the above process of determining the target model will be described below in a specific implementation scenario. Figure 4 shows a detailed flow chart for determining the target model.

Step 401: Read sample data.

Step 402: Preprocess the sample data.

Step 403: Divide the sample data into a training set and a test set.

The training set is used as the initial n sets of sample data for determining the target model.

Step 404: For any target dimension, select K independent variable dimensions that are correlated with the target dimension from M dimensions based on the correlation coefficients of the initial n sets of sample data.

Step 405: Obtain target n groups of sample data distributed in the target dimension and K independent variable dimensions.

Step 406: For any candidate group of sample data, determine the degree of influence of the candidate group of sample data on the accuracy of the target model based on the target n group of sample data and n-1 groups of sample data other than the candidate group of sample data.

Step 407: Determine whether the influence degree is greater than 4/n. If it is greater, go to step 408. If it is not greater, go to step 409.

Step 408: Eliminate the candidate group sample data.

Step 409, retain the candidate group sample data.

Step 410: Determine the target model corresponding to the target dimension based on the retained sample data.

By looping steps 404-410, multiple target models corresponding to multiple dimensions can be obtained.

Step 411: Use the test set data to evaluate each target model and obtain absolute equations and reduced equations.

Next, the use of each obtained target model for abnormal data detection of the data to be detected is introduced.

Figure 5 shows a possible abnormal data detection method, including:

Step 501: Input W data to be detected distributed in W dimensions into each target model corresponding to each target dimension.

Step 502: For any target model, if it is determined that W data to be detected do not satisfy the target model, then determine each dimension included in the target model as an abnormal dimension.

Step 503: For any abnormal dimension, determine the abnormal probability that the abnormal dimension is determined to be an abnormal dimension in each target model.

Step 504: Determine whether the data to be detected corresponding to the abnormal dimension is abnormal data according to the abnormality probability.

In step 501, W data to be detected distributed in W dimensions are input into each target model corresponding to each target dimension. The embodiment of the present invention does not limit W dimensions.

One possible way is that the W dimensions include at least M dimensions that determine the target model. For example, the W dimensions are exactly the same as the M dimensions of the determined target model, and they are also distributed in the five dimensions of dimension A, dimension B, dimension C, dimension D and dimension E. For example, in addition to the M dimensions that determine the target model, the W dimensions also include other dimensions.

Another possible way is that the W dimensions may include some of the M dimensions. For example, among the determined target models, since the test set test eliminates part of the target model, the remaining target models do not contain certain dimensions, so the dimensions of the data to be tested do not need to include this dimension. For example, among the five target models corresponding to the previously determined five dimensions, only three target models meet the test set, so only these three target models are used for the detection of the data to be detected. These three target models only include 4 dimensions: dimension A, dimension B, dimension C and dimension D. Then the W dimensions of the data distribution to be detected can also be only dimension A, dimension B, dimension C and dimension D, that is, dimension E is not included.

An example of possible data to be detected is shown in Table 4.

Table 4

The W data to be detected are input into each target model respectively, for example, into three target models. The three target models are:

Target model 1: A=B+C+D;

Target model 2: B=C+D;

Target model 3: C=A+B.

In step 502, for any target model, if it is determined that W data to be detected do not satisfy the target model, then each dimension included in the target model is determined as an abnormal dimension.

For example, for target model 1, input the data to be detected shown in Table 4 into target model 1, and determine that the average absolute error rate does not meet the preset threshold, then add dimension A, dimension B, dimension C and Dimension D is determined to be an abnormal dimension. For target model 2, input the data to be detected shown in Table 4 into target model 2. If it is determined that the average absolute error rate meets the preset threshold, no operation will be performed. For target model 3, input the data to be detected shown in Table 4 into target model 3. If it is determined that the average absolute error rate does not meet the preset threshold, then dimension A, dimension B, and dimension C included in target model 3 are determined to be abnormal. dimensions.

To sum up, the dimensions identified as abnormal dimensions include dimension A, dimension B, dimension C and dimension D.

In step 503, for any abnormal dimension, determine the abnormal probability that the abnormal dimension is determined to be an abnormal dimension in each target model.

One possible way is to determine the ratio of the number of times the abnormal dimension appears in the target model determined to be the abnormal dimension to the number of times the abnormal dimension appears in each target model, and determine the abnormal probability of the abnormal dimension based on the ratio.

For example, dimension A appears twice in each target model, and is determined to be an abnormal dimension in two target models (target model 1 and target model 3). Therefore, the abnormality probability of dimension A is 2/2. =1. Dimension B appears three times in each target model, and is determined to be an abnormal dimension in two target models (target model 1 and target model 3). Therefore, the abnormality probability of dimension B is 2/3. Dimension C appears three times in each target model, and is determined to be an abnormal dimension in two target models (target model 1 and target model 3). Therefore, the abnormality probability of dimension C is 2/3. Dimension D appears twice in each target model, and one target model (target model 1) is determined to be an abnormal dimension. Therefore, the abnormality probability of dimension D is 1/2.

One possible way is to assign different weights to the probability values obtained by the absolute equation and the reduced equation, so as to locate abnormal data more accurately. For example, if target model 1 is an absolute equation, then if target model 1 is determined to be an abnormal dimension, the number of times the abnormal dimension appears in the target model determined to be an abnormal dimension is multiplied by 1; target model 3 is an approximate equation, If target model 3 is determined to be an abnormal dimension, the number of times the abnormal dimension appears in the target model determined to be an abnormal dimension is multiplied by 0.8.

In step 504, it is determined according to the abnormality probability whether the data to be detected corresponding to the abnormal dimension is abnormal data.

Compare the abnormality probability of any abnormal dimension with the preset threshold. If it is greater than the preset threshold, it is determined that the data to be detected corresponding to the abnormal dimension is abnormal data; if it is not greater than the preset threshold, it is determined that the data to be detected corresponding to the abnormal dimension is abnormal data. Detection data is not abnormal data. The determination of the preset threshold here can be set based on the experience and needs of those skilled in the art. There are no restrictions on this. Or, determine the data to be detected corresponding to the abnormal dimensions of the top N digits of abnormality probability as abnormal data.

For example, the data to be detected 100.5 corresponding to dimension A is determined as abnormal data.

By inputting W data to be detected distributed in W dimensions into each target model corresponding to each target dimension, the abnormal dimension is determined. Then determine the abnormal probability of the abnormal dimension being determined as an abnormal dimension in each target model, and determine whether the data to be detected corresponding to the abnormal dimension is abnormal data based on the abnormal probability. It not only detects the presence of abnormal data in W data to be detected, but also accurately locates which dimension of data in the W data to be detected is abnormal data. This achieves automatic and accurate positioning of abnormal data, eliminating the need for manual review.

The embodiment of the present invention also provides another abnormal data detection method, that is, after determining each dimension included in the target model as an abnormal dimension, it also includes: for any abnormal dimension, obtaining each historical data corresponding to the abnormal dimension. ; Determine the abnormality score of the data to be detected corresponding to the abnormal dimension by clustering the data to be detected corresponding to the abnormal dimension and each historical data; determine the data to be detected corresponding to the abnormal dimension according to the abnormality probability Whether the data is abnormal data includes: determining whether the data to be detected corresponding to the abnormal dimension is abnormal data based on the abnormality probability and the abnormality score.

The above method is for any anomaly dimension. For example, for the data to be detected in dimension A, the historical data of dimension A is obtained at the same time. The data to be detected and the historical data of dimension A are clustered to obtain the anomaly score of dimension A.

The embodiment of the present invention does not specifically limit the method of clustering to obtain anomaly scores.

One possible way is to use k-means for clustering, and determine the distance between the data to be detected in any abnormal dimension and each historical data as the anomaly score. For example, if the distance between the data to be detected in dimension A and each historical data is relatively long, the similarity will be small and the anomaly score will be small.

Another possible way is to obtain the anomaly score of any abnormal dimension by constructing an isolated binary tree. Specifically, the method includes: constructing an isolated binary tree for the data to be detected corresponding to the abnormal dimension and each historical data; and calculating the anomaly score of the data to be detected corresponding to the abnormal dimension in the isolated binary tree.

For example, for dimension A, the historical data obtained for dimension A are: 19.49, 20.23, 25.34, 49.12, 36.66. Use the integrated learning method for the data to be detected and historical data (19.49, 20.23, 25.34, 49.12, 36.66, 100.5), iterate N times (for example, 100), and build an isolated binary tree each time. Based on the decision tree algorithm, the detection is randomly processed The data and historical data are cut. Each cut can produce an independent leaf node. In this way, new leaf nodes are continuously cut until the tree reaches the specified height or cannot be cut anymore, and the algorithm ends.

The specific steps to construct an isolated binary tree are as follows: (1) First, randomly select a split point between the minimum and maximum values (19.49 and 100.5) of all sample data, assuming that the random value is 60.2. (2) Put the data nodes in the sample that are greater than the split point value 60.2 on the right branch of the tree, and the data nodes that are less than or equal to 60.2 are placed on the left branch of the tree. (3) Repeat steps (1) and (2) on the basis of branches until all data nodes are randomly divided to form isolated leaf nodes or the tree reaches the specified height.

The first isolated tree is randomly constructed according to the above steps, as shown in Figure 6, and the five random split points are (60.2, 34, 42.2, 22.5, 20).

Calculate the PathLength of each leaf node as h(x):

h(x)＝e+c(T.size) Formula 10

where e is the number of edges that the leaf node has experienced in the process from the root node of the tree to the leaf node, that is, the number of splits, T.size represents the number of samples at the same leaf node as the sample x, C(T.size ) can be regarded as a correction value, indicating the average path length of a binary tree constructed by T.size samples.

Among them, o is Euler's constant 0.5772156649; taking the calculation of the PathLength of the sample 100.5 as an example, e is the number of edges from the root node to the 100.5 node = 1, T.size = 1, so the PathLength of the point 100.5 = 1 + c (1), just substitute the above formula for calculating c(n).

In order to ensure randomness and thus the accuracy of the anomaly score, follow the above method to randomly iterate N times (default is 100), construct 100 random isolated binary trees, and calculate the PathLength of the node 100.5 on each tree, that is, h(x ). Then calculate the isolated forest anomaly score value according to the following formula:

Where n is the number of samples, E(h(x)) is the average PathLength of the samples on 100 isolated trees, and c(n) is the average path length of the tree building n samples. In the above example, n is the number of samples 6, and the c(6) result is calculated according to the formula for calculating c(n) above. Finally, the anomaly score of the node to be detected corresponding to dimension A, that is, sample data 100.5, is calculated.

Afterwards, it is determined whether the data to be detected corresponding to the abnormal dimension is abnormal data based on the abnormal probability corresponding to the abnormal dimension and the abnormal score.

Based on the same technical concept, FIG. 7 exemplarily shows the structure of an abnormal data detection device provided by an embodiment of the present invention, which can perform the process of abnormal data detection.

As shown in Figure 7, the device specifically includes:

Processing unit 701, used for:

In some embodiments, the processing unit 701 is also used to:

The processing unit 701 is specifically used for:

In some embodiments, the processing unit 701 is specifically used to:

The target model is determined based on the retained sample data.

In some embodiments, the processing unit 701 is specifically used to:

Remove the strong influence point data from the target n groups of sample data.

In some embodiments, the processing unit 701 is also used to:

Based on the same technical concept, the embodiment of the present application provides a computer device, as shown in Figure 8, including at least one processor 801, and a memory 802 connected to the at least one processor. The processor is not limited in the embodiment of the present application. As for the specific connection medium between 801 and memory 802, the connection between processor 801 and memory 802 through a bus in Figure 8 is taken as an example. The bus can be divided into address bus, data bus, control bus, etc.

In this embodiment of the present application, the memory 802 stores instructions that can be executed by at least one processor 801. At least one processor 801 can execute the steps of the above abnormal data detection method by executing the instructions stored in the memory 802.

Among them, the processor 801 is the control center of the computer equipment. It can use various interfaces and lines to connect various parts of the computer equipment, and perform abnormal operations by running or executing instructions stored in the memory 802 and calling data stored in the memory 802. Data detection. In some embodiments, the processor 801 may include one or more processing units, and the processor 801 may integrate an application processor and a modem processor, where the application processor mainly processes the operating system, user interface, application programs, etc., The modem processor primarily handles wireless communications. It can be understood that the above modem processor may not be integrated into the processor 801. In some embodiments, the processor 801 and the memory 802 can be implemented on the same chip, and in some embodiments, they can also be implemented on separate chips.

The processor 801 may be a general processor, such as a central processing unit (CPU), a digital signal processor, an application specific integrated circuit (ASIC), a field programmable gate array or other programmable logic devices, discrete gates or transistors. Logic devices and discrete hardware components can implement or execute the methods, steps and logical block diagrams disclosed in the embodiments of this application. A general-purpose processor may be a microprocessor or any conventional processor, etc. The steps of the methods disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware processor for execution, or can be executed by a combination of hardware and software modules in the processor.

As a non-volatile computer-readable storage medium, the memory 802 can be used to store non-volatile software programs, non-volatile computer executable programs and modules. The memory 802 may include at least one type of storage medium, for example, may include flash memory, hard disk, multimedia card, card-type memory, random access memory (RaKdom Access Memory, RAM), static random access memory (Static RaKdom Access Memory, SRAM), Programmable Read OKly Memory (PROM), Read OKly Memory (ROM), Electrically Erasable Programmable Read-OKly Memory (EEPROM), Magnetic Memory, Disk , CD, etc. Memory 802 is, but is not limited to, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory 802 in the embodiment of the present application can also be a circuit or any other device capable of realizing a storage function, used to store program instructions and/or data.

Based on the same technical concept, embodiments of the present invention also provide a computer-readable storage medium. The computer-readable storage medium stores a computer executable program. The computer executable program is used to cause the computer to execute the abnormal data listed in any of the above methods. detection method.

Those skilled in the art will understand that embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the present application. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine, such that the instructions executed by the processor of the computer or other programmable data processing device produce a use A device for realizing the functions specified in one process or multiple processes of the flowchart and/or one block or multiple blocks of the block diagram.

These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions The device implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.

These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device. Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.

Obviously, those skilled in the art can make various changes and modifications to the present application without departing from the spirit and scope of the present application. In this way, if these modifications and variations of the present application fall within the scope of the claims of the present application and equivalent technologies, the present application is also intended to include these modifications and variations.

Claims

An abnormal data detection method, characterized by including:

W data to be detected distributed in W dimensions are input into each target model corresponding to each target dimension; the target model corresponding to any target dimension is obtained by removing the sample data from the strong influence point data; so The strong influence point data refers to sample data whose impact on the accuracy of the target model does not meet the preset conditions;

For any target model, if it is determined that W data to be detected do not satisfy the target model, then each dimension included in the target model is determined as an abnormal dimension;

For any abnormal dimension, determine the abnormal probability that the abnormal dimension is determined to be an abnormal dimension in each target model;

Determine whether the data to be detected corresponding to the abnormal dimension is abnormal data according to the abnormality probability.
The method according to claim 1, characterized in that after determining each dimension included in the target model as an abnormal dimension, it further includes:

For any abnormal dimension, obtain each historical data corresponding to the abnormal dimension;

Determine the anomaly score of the data to be detected corresponding to the abnormal dimension by clustering the data to be detected corresponding to the abnormal dimension and each historical data;

Determining whether the data to be detected corresponding to the abnormal dimension is abnormal data according to the abnormality probability includes:

Determine whether the data to be detected corresponding to the abnormal dimension is abnormal data based on the abnormality probability and the abnormality score.
The method of claim 2, wherein the anomaly score of the data to be detected corresponding to the abnormal dimension is determined by clustering the data to be detected corresponding to the abnormal dimension and the historical data, including:

Construct an isolated binary tree for the data to be detected corresponding to the abnormal dimension and each of the historical data;

Calculate the anomaly score of the data to be detected corresponding to the abnormal dimension in the isolated binary tree.
The method according to claim 1, characterized in that the target model corresponding to any target dimension is determined in the following manner, including:

Obtain initial n groups of sample data distributed in M dimensions; where each group of sample data has M dimensions;

For the target dimension among the M dimensions, select K independent variable dimensions that are correlated with the target dimension from the M dimensions according to the correlation coefficients of the initial n groups of sample data; the target dimension is the M any of the dimensions;

The target model is determined based on the target n groups of sample data distributed in the target dimension and the K independent variable dimensions; the target model is used to characterize the satisfaction between the target dimension and the K independent variable dimensions. relation.
The method of claim 4, wherein determining the target model based on target n groups of sample data distributed in the target dimension and the K independent variable dimensions includes:

Obtain target n groups of sample data distributed in the target dimension and the K independent variable dimensions;

The influence degree of the candidate group sample data on the accuracy of the target model is determined according to the target n group of sample data and the n-1 group of sample data except the candidate group sample data; the candidate group sample data is the Any group of sample data in the target n groups of sample data;

According to the influence degree of n candidate group sample data, w groups of strong influence point data are removed from the target n group of sample data; 1≤w<n;

The target model is determined based on the retained sample data.
The method of claim 5, wherein the accuracy of the candidate group sample data for the target model is determined based on the target n groups of sample data and n-1 groups of sample data except the candidate group sample data. influence, including:

Fit the target n groups of sample data to obtain the first fitting coefficient of the first fitting model;

Fit n-1 groups of sample data other than the candidate group sample data to obtain the second fitting coefficient of the second fitting model;

The degree of influence is determined based on the first fitting coefficient, the second fitting coefficient, the number of independent variable dimensions included in the target model, and the mean square error of the first fitting model.
The method according to claim 5, characterized in that, based on the influence degree of n candidate groups of sample data, removing w groups of strong influence point data from the target n groups of sample data includes:

For any candidate group sample data, if the impact of the candidate group sample data on the accuracy of the target model is greater than the first quarter of the F distribution of (p, n-p-1) degrees of freedom, then Determine that the candidate group sample data is strong influence point data; where p is the number of independent variable dimensions included in the target model;

Remove the strong influence point data from the target n groups of sample data.
The method of claim 5, wherein after determining the target model based on the retained sample data, it further includes:

Input test data into the target model for testing; obtain the average absolute error rate of the target model;

It is determined that the fitting degree parameter and the average absolute error rate of the target model for fitting the retained sample data respectively meet a preset threshold.
An abnormal data detection device, characterized by including:

Processing unit for:

W data to be detected distributed in W dimensions are input into each target model corresponding to each target dimension; the target model corresponding to any target dimension is obtained by removing the sample data from the strong influence point data; so The strong influence point data refers to sample data whose impact on the accuracy of the target model does not meet the preset conditions;

For any target model, if it is determined that W data to be detected do not satisfy the target model, then each dimension included in the target model is determined as an abnormal dimension;

For any abnormal dimension, determine the abnormal probability that the abnormal dimension is determined to be an abnormal dimension in each target model;

Determine whether the data to be detected corresponding to the abnormal dimension is abnormal data according to the abnormality probability.
A computing device, characterized by including:

Memory, used to store computer programs;

A processor, configured to call a computer program stored in the memory, and execute the method according to any one of claims 1 to 8 according to the obtained program.
A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer-executable program, and the computer-executable program is used to cause a computer to execute the method described in any one of claims 1 to 8.