WO2024036709A1 - Anomalous data detection method and apparatus - Google Patents

Anomalous data detection method and apparatus Download PDF

Info

Publication number
WO2024036709A1
WO2024036709A1 PCT/CN2022/121926 CN2022121926W WO2024036709A1 WO 2024036709 A1 WO2024036709 A1 WO 2024036709A1 CN 2022121926 W CN2022121926 W CN 2022121926W WO 2024036709 A1 WO2024036709 A1 WO 2024036709A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
dimension
target
sample data
abnormal
Prior art date
Application number
PCT/CN2022/121926
Other languages
French (fr)
Chinese (zh)
Inventor
庄海琪
林炳鑫
Original Assignee
深圳前海微众银行股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳前海微众银行股份有限公司 filed Critical 深圳前海微众银行股份有限公司
Publication of WO2024036709A1 publication Critical patent/WO2024036709A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization

Definitions

  • Embodiments of the present invention relate to the field of computer technology, and in particular, to an abnormal data detection method, device, computing device and computer-readable storage medium.
  • the currently used abnormal data detection method has low detection accuracy. After detecting abnormal data, it still needs to be manually checked and confirmed again, which costs high labor and time costs.
  • an abnormal data detection method is provided to improve the accuracy of abnormal data detection.
  • Embodiments of the present invention provide an abnormal data detection method to improve the accuracy of abnormal data detection.
  • embodiments of the present invention provide an abnormal data detection method, including:
  • W data to be detected distributed in W dimensions are input into each target model corresponding to each target dimension; the target model corresponding to any target dimension is obtained by removing the sample data from the strong influence point data; so
  • the strong influence point data refers to sample data whose impact on the accuracy of the target model does not meet the preset conditions;
  • each dimension included in the target model is determined as an abnormal dimension
  • the abnormal dimension By inputting W data to be detected distributed in W dimensions into each target model corresponding to each target dimension, the abnormal dimension is determined. Then, the abnormality probability of the abnormal dimension determined as an abnormal dimension in each target model is determined, and it is determined based on the abnormality probability whether the data to be detected corresponding to the abnormal dimension is abnormal data. It not only detects the presence of abnormal data in W data to be detected, but also accurately locates which dimension of data in the W data to be detected is abnormal data. This achieves automatic and accurate positioning of abnormal data, eliminating the need for manual review.
  • the method further includes:
  • Determining whether the data to be detected corresponding to the abnormal dimension is abnormal data according to the abnormality probability includes:
  • the data to be detected corresponding to the abnormal dimension By obtaining the historical data of the abnormal dimension, cluster the data to be detected corresponding to the abnormal dimension and each historical data, and obtain the abnormal score of the data to be detected corresponding to the abnormal dimension. Combine the abnormal probability and the abnormal score to determine the data to be detected corresponding to the abnormal dimension. Whether the data is abnormal data.
  • the combination of the two judgment methods not only takes into account the probability that the abnormal dimension is determined to be an abnormal dimension, but also takes into account the historical data of the abnormal dimension, increasing the accuracy of determining abnormal data.
  • the anomaly score of the data to be detected corresponding to the abnormal dimension is determined, including:
  • the target model corresponding to any target dimension is determined in the following manner, including:
  • the target dimension among the M dimensions, select K independent variable dimensions that are correlated with the target dimension from the M dimensions according to the correlation coefficients of the initial n sets of sample data; the target dimension is the M any of the dimensions;
  • the target model is determined based on the target n groups of sample data distributed in the target dimension and the K independent variable dimensions; the target model is used to characterize the satisfaction between the target dimension and the K independent variable dimensions. relation.
  • the initial n sets of sample data are distributed in M dimensions, but not all of these M dimensions are necessarily related. Therefore, it is necessary to select the dimensions with relevant relationships. For any target dimension, selection is made based on the correlation coefficients of the initial n groups of sample data to obtain K independent variable dimensions with correlations corresponding to the target dimension. In this way, each dimension can be used as a target dimension, and each target dimension and its independent variable dimensions can correspondingly determine a target model. Taking into account richer scenarios and situations, the accuracy of the determined target model is increased.
  • determining the target model based on target n groups of sample data distributed in the target dimension and the K independent variable dimensions includes:
  • the influence degree of the candidate group sample data on the accuracy of the target model is determined according to the target n group of sample data and the n-1 group of sample data except the candidate group sample data;
  • the candidate group sample data is the Any group of sample data in the target n groups of sample data;
  • w groups of strong influence point data are removed from the target n group of sample data; 1 ⁇ w ⁇ n;
  • the target model is determined based on the retained sample data.
  • the target n group of sample data is Remove w groups of strong influence point data, and determine the target model based on the retained sample data. In this way, the sample data that has a greater impact on the accuracy of the target model will be eliminated, the impact of these sample data on the final target model will be minimized, and the accuracy of the target model will be improved.
  • Abnormal data detection is performed on the data to be detected based on a more accurate target model, which improves the accuracy of detecting abnormal data.
  • determining the impact of the candidate group sample data on the accuracy of the target model based on the target n group of sample data and n-1 groups of sample data except the candidate group sample data includes:
  • the degree of influence is determined based on the first fitting coefficient, the second fitting coefficient, the number of independent variable dimensions included in the target model, and the mean square error of the first fitting model.
  • the influence of the candidate group sample data on the accuracy of the target model is determined to improve to determine the accuracy of impact. As a result, a more accurate target model can be obtained.
  • removing w groups of strong influence point data from the target n groups of sample data based on the influence of n candidate groups of sample data includes:
  • the method further includes:
  • the target models corresponding to each target dimension will not be directly used to detect abnormal data, but will be filtered again in each target model.
  • the average absolute error rate of the target model for testing is obtained. If the fitting parameters of the target model and the average absolute error rate respectively meet the preset thresholds, it means that the fitting accuracy of the target model is high and can be used for subsequent abnormal data detection. Improved accuracy of abnormal data detection.
  • embodiments of the present invention also provide an abnormal data detection device, including:
  • Processing unit for:
  • W data to be detected distributed in W dimensions are input into each target model corresponding to each target dimension; the target model corresponding to any target dimension is obtained by removing the sample data from the strong influence point data; so
  • the strong influence point data refers to sample data whose impact on the accuracy of the target model does not meet the preset conditions;
  • each dimension included in the target model is determined as an abnormal dimension
  • the processing unit is also used to:
  • the processing unit is specifically used for:
  • the processing unit is specifically used to:
  • the processing unit is specifically used to:
  • the target dimension among the M dimensions, select K independent variable dimensions that are correlated with the target dimension from the M dimensions according to the correlation coefficients of the initial n sets of sample data; the target dimension is the M any of the dimensions;
  • the target model is determined based on the target n groups of sample data distributed in the target dimension and the K independent variable dimensions; the target model is used to characterize the satisfaction between the target dimension and the K independent variable dimensions. relation.
  • the processing unit is specifically used to:
  • the influence degree of the candidate group sample data on the accuracy of the target model is determined according to the target n group of sample data and the n-1 group of sample data except the candidate group sample data;
  • the candidate group sample data is the Any group of sample data in the target n groups of sample data;
  • w groups of strong influence point data are removed from the target n group of sample data; 1 ⁇ w ⁇ n;
  • the target model is determined based on the retained sample data.
  • the processing unit is specifically used to:
  • the degree of influence is determined based on the first fitting coefficient, the second fitting coefficient, the number of independent variable dimensions included in the target model, and the mean square error of the first fitting model.
  • the processing unit is specifically used to:
  • the processing unit is also used to:
  • an embodiment of the present invention further provides a computing device, including:
  • Memory used to store computer programs
  • a processor configured to call the computer program stored in the memory, and execute the abnormal data detection method listed in any of the above methods according to the obtained program.
  • embodiments of the present invention also provide a computer-readable storage medium that stores a computer-executable program, and the computer-executable program is used to cause the computer to execute any of the methods listed above. Abnormal data detection methods.
  • Figure 1 is a schematic diagram of a method for determining a target model based on n sets of sample data provided by an embodiment of the present invention
  • Figure 2 is a schematic flowchart of a method for determining a target model provided by an embodiment of the present invention
  • Figure 3 is a schematic diagram of a fitted straight line obtained by fitting using the least squares method according to an embodiment of the present invention
  • Figure 4 is a schematic diagram of a detailed target determination model provided by an embodiment of the present invention.
  • Figure 5 is a schematic diagram of a possible abnormal data detection method provided by an embodiment of the present invention.
  • Figure 6 is a schematic diagram of a constructed isolated binary tree provided by an embodiment of the present invention.
  • Figure 7 is a schematic structural diagram of an abnormal data detection device provided by an embodiment of the present invention.
  • Figure 8 is a schematic structural diagram of a computer device provided by an embodiment of the present invention.
  • Multiple linear regression In regression analysis, if there are two or more independent variables, it is called multiple regression. In fact, a phenomenon is often associated with multiple factors. Predicting or estimating the dependent variable using the optimal combination of multiple independent variables is more effective and more realistic than using only one independent variable to predict or estimate. Therefore, multiple linear regression has greater practical significance than single linear regression.
  • OLS Ordinary Least Squares
  • Degree of freedom refers to the number of independent or freely changing data in the sample when the statistics of the sample are used to estimate the parameters of the population. This is called the degree of freedom of the statistic.
  • the degrees of freedom are equal to the number of independent variables minus the number of their derivatives; for example, the definition of variance is the sum of the squares of the sample minus the mean (a derivative determined by the sample), so for N random samples In other words, its degree of freedom is N-1.
  • Decision tree It is a prediction model that represents a mapping relationship between object attributes and object values. Each node in the tree represents an object, and each bifurcation path represents a possible attribute value, and each leaf node corresponds to the value of the object represented by the path from the root node to the leaf node.
  • the decision tree has only a single output. If you want to have multiple outputs, you can build an independent decision tree to handle different outputs. Decision tree is a frequently used technology in data mining. It can be used to analyze data and can also be used to make predictions.
  • Strong influence points refers to data points that have a strong influence on the parameter estimation of the multiple linear regression model. Since multiple linear regression uses the least squares method for parameter estimation, all records are treated equally at this time. When there are records in the database that are far away from the body of multidimensional spatial data, they will cause the fitted model to be biased toward that data point. The identification of strong influence points is another important issue that should be paid attention to when performing multiple linear regression. Strong influence points are data that have a great impact on the stability and authenticity of parameter estimates. For regression model data sets, strong influence points refer to those points that have a very large influence and impact on the value of statistics.
  • a method can be designed to automatically dig into the consistent models between data of different dimensions through the analysis of sample data, and then use this model to detect abnormal data in the data to be detected, thereby improving the data overall quality.
  • embodiments of the present invention provide the following method for determining a target model based on n sets of sample data, as shown in Figure 1, including:
  • Step 101 Obtain initial n groups of sample data distributed in M dimensions; wherein each group of sample data has M dimensions.
  • Step 102 For the target dimension among the M dimensions, select K independent variable dimensions that are correlated with the target dimension from the M dimensions according to the correlation coefficients of the initial n groups of sample data; the target dimension is the Any of the M dimensions.
  • Step 103 Determine the target model based on the target n groups of sample data distributed in the target dimension and the K independent variable dimensions; the target model is used to characterize the target dimension and the K independent variable dimensions. satisfying relationship.
  • step 101 initial n groups of sample data are obtained, and each group of sample data is distributed in M dimensions.
  • the embodiments of the present invention do not limit the ways and methods for obtaining sample data.
  • the sample data can be automatically read from the database or manually imported. For example, read the training data text in excel format through the pandas library or read the database in other ways.
  • preliminary preprocessing can be performed on the read sample data, especially the missing values of the sample data are filled by default. Depending on the data characteristics, you can choose to fill with 0 values or fill with medians. Data can also be cleaned into a format that meets algorithm training requirements.
  • Table 1 shows possible read sample data.
  • Table 1 contains a date column and five dimension columns. The five dimensions are Dimension A: deposits with interbank funds; Dimension B: domestic commercial banks; Dimension C: other domestic banking financial institutions; Dimension D: domestic other Financial institutions; Dimension E: Interest receivable. Table 1 contains 61 sets of sample data, from January 2016 to January 2021.
  • the goal of the embodiments of the present invention is to automatically discover the equal or approximately equal relationships between various dimensions in numerous sample data, and to try to improve the accuracy of the relationship equations through subsequent algorithms.
  • a corresponding target model is obtained to achieve the purpose of abnormal data detection based on the target model.
  • the obtained sample data are used as initial n sets of sample data for subsequent determination of the target model.
  • Another possible implementation is to divide the obtained sample data into a training set and a test set.
  • the training set is used as the initial n sets of sample data for subsequent determination of the target model, and the test set is used to test and verify the obtained target model to evaluate the accuracy of the target model.
  • the sample data can be divided according to a certain proportion.
  • the embodiment of the present invention does not limit the division ratio, such as 9:1, 8:2, etc.
  • the sample data can be sorted and divided according to certain rules, or the sample data can be divided without sorting. There is no restriction on this, because whether the sample data is sorted or not does not affect the determined goal. Model accuracy.
  • the following uses a detailed example to introduce the division of the obtained sample data into a training set and a test set.
  • test set 9:1.
  • the first 90% of the sample data is used as the training set, and the last 10% of the sample data is used as the test set.
  • a total of 61 sets of sample data were collected, and the first 55 sets of sample data (i.e., January 2016 to July 2020) were used as training sets to determine the target model; the last 6 sets of sample data (i.e., 2020 August 2020 to January 2021) is used as a test set to verify and test the target model to evaluate the accuracy of the target model.
  • the K-fold cross-validation method can be used to cut the sample data. For example, there are only 10 groups of sample data obtained. If the number of samples is divided into K parts on average (such as 5 parts), each part has 2 sets of data, then 4 of them (8 groups) can be randomly selected as training during the target model determination stage. set, one of which (2 groups) is used as the test set, and the training set is used to obtain the regression coefficient of the target model. The random extraction action is repeated multiple times to generate multiple regression coefficients, and the multiple regression coefficients are weighted and averaged to obtain the final regression coefficient. This makes up for the problem of insufficient training caused by small sample data.
  • the following uses the first 55 sets of sample data (i.e., January 2016 to July 2020) as the training set to determine the target model as an example to introduce the method of determining the target model.
  • the first 55 sets of sample data are used as the training set, that is, as the initial n sets of sample data distributed in M dimensions.
  • the initial n sets of sample data are distributed in 5 dimensions.
  • step 102 since the sample data of the five dimensions may not all have a linear regression relationship, and only a few of the dimensions may have a linear regression relationship, it is necessary to determine for each target dimension in the M dimensions. Independent variable dimensions that are related to the target dimension.
  • each dimension be a target dimension
  • select the corresponding independent variable dimension for the target dimension For example, for the five dimensions in Table 1, let each dimension be a target dimension, and select the corresponding independent variable dimension for the target dimension. Then, each data in the target dimension and each data in the independent variable dimension are substituted into the linear regression equation.
  • y is the data corresponding to the target dimension
  • x1, x2... are the data corresponding to the respective variable dimensions.
  • dimension A is the target dimension
  • dimension B is the target dimension
  • dimension C is the target dimension
  • dimension D is the candidate independent variable dimensions.
  • the independent variable dimension corresponding to dimension A must be selected from these candidate independent variable dimensions.
  • an initial n group of sample data matrices (55 ⁇ 5) are constructed, containing 55 groups of data in 5 dimensions. Move one column of the target dimension (dimension A) to the last column of the matrix, and calculate the correlation coefficient matrix r based on the initial n sets of sample data matrices.
  • the correlation coefficient matrix r is calculated through the covariance formula.
  • X i is the monthly data value of any candidate independent variable dimension, is the average of 55 months of data in the candidate independent variable dimension
  • Y i is the monthly data value in the target dimension, It is the average of 55 months of data for the target dimension.
  • the correlation coefficient between the target dimension and any candidate independent variable dimension can be obtained.
  • Yi is the data value of dimension A for each month in 55 months, is the average value of dimension A’s data in 55 months
  • X1 is the data value of dimension B in each of 55 months, is the average of 55 months of data for dimension B.
  • the correlation coefficient of dimension A and dimension B can be obtained.
  • each correlation coefficient forms the following correlation coefficient matrix r.
  • the last column is the target dimension column, which is dimension A.
  • the correlation coefficient between dimension A and dimension B is 0.9976391; the correlation coefficient between dimension A and dimension C is -0.07923952, and the correlation coefficient between dimension A and dimension D is 0.63029953.
  • the correlation coefficient between dimension A and dimension E is 0.46870661. The closer the absolute value of the correlation coefficient is to 1, the more relevant the two are.
  • variance contribution value of each candidate independent variable dimension is calculated based on the correlation coefficient matrix r.
  • the formula for variance contribution is as follows.
  • columns is the total number of columns of matrix r.
  • columns 5.
  • r(i,i) represents the value of the i-th row and i-th column in the correlation coefficient matrix.
  • the finally obtained matrix of variance contribution values of dimension B, dimension C, dimension D and dimension E to the target model with dimension A as the target dimension is [0.99528377 0.0062789 0.3972775 0.21968589].
  • the target model obtained by the target dimension is more beneficial.
  • the maximum variance contribution value is the variance contribution value corresponding to dimension B.
  • nos is n, and in is the number of candidate independent variable dimensions.
  • the F value of dimension B is 11184.801222455637.
  • the F value is converted into a distribution probability p value of 2.449050249153728e-63 according to the F distribution table.
  • the general p value is ⁇ 0.05, indicating that the independent variable is significant and can be introduced into the regression equation. Therefore, dimension B is first used as the independent variable dimension of the target model.
  • the transformed matrix r is:
  • the obtained independent variable dimensions are dimension B, dimension C and dimension D.
  • the number of independent variable dimensions corresponding to different target dimensions may be the same or different.
  • the corresponding independent variable dimensions are 3, namely dimension B, dimension C and dimension D
  • dimension B is the target dimension
  • the corresponding independent variable dimensions are 2, namely dimension C and dimension.
  • Dimension D when dimension C is the target dimension, the corresponding independent variable dimension is 1, which is dimension D.
  • step 103 for any target dimension, the process of determining the target model corresponding to the target dimension is introduced.
  • the target dimension is dimension A
  • the corresponding independent variable dimensions are dimension B, dimension C, and dimension D.
  • Step 201 Obtain target n groups of sample data distributed in the target dimension and the K independent variable dimensions.
  • Step 202 Determine the influence of the candidate group of sample data on the accuracy of the target model based on the target n groups of sample data and n-1 groups of sample data except the candidate group of sample data; the candidate group of sample data is any group of sample data among the target n groups of sample data.
  • Step 203 Remove w groups of strong influence point data from the target n groups of sample data according to the influence degrees of the n candidate groups of sample data; 1 ⁇ w ⁇ n.
  • Step 204 Determine the target model based on the retained sample data.
  • target n sets of sample data are determined among the initial n sets of sample data.
  • the target dimension is dimension A
  • the corresponding independent variable dimensions are dimension B, dimension C and dimension D
  • K 3. Therefore, the determined target n groups of sample data are 55 groups of sample data distributed in four dimensions: dimension A, dimension B, dimension C and dimension D.
  • Figure 3 shows a schematic diagram of a possible fitting situation.
  • each point is evenly distributed around the fitted straight line, and the distance between the actual value of each point and the corresponding predicted value on the straight line is the smallest.
  • y is dimension A
  • x1 is dimension B
  • x2 is dimension C
  • x3 is dimension D.
  • the regression coefficient determines the slope of the straight line, so that the straight line can fit the 55 sets of sample data as much as possible, that is, to minimize the sum of distances between all points and the straight line of the equation.
  • the sum of the distances can be calculated using the RSS residual square. and to define.
  • the regression coefficient ⁇ is also a matrix.
  • the fitting degree R2 of this first fitting model is 0.999, and the reference value p-value of each parameter significance level is low, which means that the fitting degree of the first fitting model obtained by fitting these 55 sets of sample data is relatively good. Good, the first fitting model can better reflect the rules between these 55 sets of sample data.
  • abnormal data may have appeared in the 55 sets of sample data.
  • the existence of abnormal data caused the obtained first fitting model to be inconsistent with business experience and historical data. There are many reasons for the occurrence of abnormal data. For example, errors occur during the collection or entry process, or there are errors and abnormalities in the sample data itself.
  • the leverage ratio of each group of sample data is analyzed.
  • the leverage ratio reflects the degree of influence of each group of sample data on the regression coefficient of the first fitting model. For multiple linear regression, it can be solved by the OLS least squares method. The standard equation of coefficients is derived and the leverage matrix calculation formula is:
  • the H matrix reflects the projection of the actual observed values of each set of sample data onto the predicted values, which is equivalent to converting the actual observed values into predicted values through the H matrix.
  • the leverage ratio of the i-th group of sample data corresponds to the value of the i-th element on the diagonal of the H matrix. In the above example, we calculated the leverage statistics of 55 groups of sample data as shown in Table 2.
  • the leverage ratio values corresponding to the first two sets of sample data are 0.365953 and 0.375185 respectively, which are far greater than the average of 2 times the leverage ratio statistic. Therefore, it can be judged that the first two sample data are relatively extreme data. With the existence of such extreme data, it is highly likely that the first fitting model obtained does not conform to business experience and historical data.
  • a method is provided to determine the strong influence point data in the target n group of sample data, and determine the satisfying relationship between the target dimension and the respective variable dimension based on the sample data after removing the strong influence point data, which is more accurate. See steps 202-204 for details.
  • step 202 determine the influence degree of the candidate group sample data on the accuracy of the target model based on the target n group of sample data and n-1 groups of sample data except the candidate group sample data; the candidate group
  • the sample data is any group of sample data among the target n groups of sample data.
  • the first group of sample data is used as the candidate group of sample data, and based on the first fitting model fitted to 55 groups of sample data and the second model fitted to 54 groups of sample data except the first group of sample data, the first The influence of the set of sample data on the accuracy of the target model; using the second set of sample data as the candidate set of sample data, the first fitting model fitted based on the 55 sets of sample data and the 54 sets of samples except the second set of sample data
  • the second model of data fitting determines the impact of the second group of sample data on the accuracy of the target model
  • the third group of sample data is used as the candidate group of sample data
  • the first fitting model is fitted based on the 55 groups of sample data.
  • the second model fitted to 54 groups of sample data except the 3rd group of sample data is used to determine the influence of the 3rd group of sample data on the accuracy of the target model...and so on, and the samples of each group of 55 groups of sample data are obtained. The impact of data on the accuracy of the target model.
  • the specific way to calculate the influence of any candidate group sample data is as follows: fit the target n groups of sample data to obtain the first fitting coefficient of the first fitting model; Fit n-1 groups of sample data to obtain the second fitting coefficient of the second fitting model; according to the first fitting coefficient, the second fitting coefficient, and the independent variable dimensions included in the target model The quantity and the mean square error of the first fitted model determine the degree of influence.
  • p is the number of independent variable dimensions included in the model
  • s is the mean square error of the first fitting model
  • It is the predicted value obtained by fitting the target n group of sample data
  • the i-th group of sample data here is the candidate group of sample data.
  • p 3.
  • s is calculated by the following formula:
  • n is the number of groups of sample data
  • the degree of influence reflects the influence of each group of sample data on the accuracy of the target model. In principle, for a normal model, the degree of influence of each group of sample data on the model is similar. The greater the degree of influence, the greater the probability that the sample data of this group is abnormal. The bigger. Table 3 shows a possible influence degree of each group of sample data.
  • Table 3 shows the influence degree corresponding to the sample data of each candidate group obtained after eliminating the sample data of each candidate group.
  • the influence of the candidate group sample data on the accuracy of the target model is determined to improve to determine the accuracy of impact. As a result, a more accurate target model can be obtained.
  • step 203 w groups of strong influence point data are removed from the target n groups of sample data according to the influence degrees of the n candidate groups of sample data; 1 ⁇ w ⁇ n.
  • Strong influence point data has a greater impact on the accuracy of the target model, so it should be eliminated.
  • the embodiment of the present invention does not limit the method of determining strong influence point data.
  • the threshold value of the strong influence point data is set to a larger value; if the accuracy requirements of the target model are relatively high, If it is not too high, set the threshold of strong influence point data slightly lower.
  • Another possible way is to use F distribution to determine strong influence point data. Specifically, for any candidate group of sample data, if the influence of the candidate group of sample data on the accuracy of the target model is greater than the first quarter of the F distribution of (p, n-p-1) degrees of freedom If one bit is used, it is determined that the candidate group sample data is strong influence point data; where p is the number of independent variable dimensions included in the target model; remove the strong influence point data from the target n group of sample data .
  • the strong influence point data is determined through the F distribution of (3, 51) degrees of freedom.
  • the influence degree corresponding to any candidate group sample data in Table 3 compare it with the value of the first quarter of the F distribution of (3, 51) degrees of freedom. If it is greater than this value, then It is determined to be strong influence point data.
  • the strong influence point data After determining the strong influence point data, remove the strong influence point data.
  • the first method provided by the embodiment of the present invention is used to determine strong influence point data, and the first group of sample data, the second group of sample data, and the 54th group of sample data are finally removed.
  • the sample data of the 1st and 2nd groups were indeed abnormal samples, but the 54th group of sample data was a sample that actually conformed to the model but the data fluctuated greatly.
  • the above method cannot accurately eliminate sample data with only abnormalities, it is possible to eliminate a small number of non-abnormal samples such as the 54th group of sample data, but eliminating a small number of non-abnormal samples will not have a substantial impact on the target model.
  • step 204 the target model is determined based on the retained sample data.
  • the target model is determined based on the remaining 52 sets of sample data.
  • the obtained target model is consistent with business experience and historical data, and has business interpretability.
  • the above introduces the process of determining the target model when the target dimension is dimension A.
  • the target dimensions are dimension B, dimension C, dimension D, and dimension E
  • the respective target models can be determined according to the process of steps 201-204 respectively.
  • the dimensions included in different target models may be different. In this way, 5 corresponding target models are determined for 5 dimensions.
  • the initial n sets of sample data are distributed in M dimensions, but not all of these M dimensions are necessarily related. Therefore, it is necessary to select the dimensions with relevant relationships. For any target dimension, selection is made based on the correlation coefficients of the initial n groups of sample data to obtain K independent variable dimensions with correlations corresponding to the target dimension. In this way, each dimension can be used as a target dimension, and each target dimension and its independent variable dimensions can correspondingly determine a target model. Taking into account richer scenarios and situations, the accuracy of the determined target model is increased.
  • the target n group of sample data is Remove w groups of strong influence point data, and determine the target model based on the retained sample data. In this way, the sample data that has a greater impact on the accuracy of the target model will be eliminated, the impact of these sample data on the final target model will be minimized, and the accuracy of the target model will be improved.
  • Abnormal data detection is performed on the data to be detected based on a more accurate target model, which improves the accuracy of detecting abnormal data.
  • sample data in the test set are also used to test and verify each target model obtained above.
  • n groups of sample data were divided into training sets and test sets, and the training set was used as the initial n groups of sample data for subsequent determination of the target model.
  • the test set is used to test and validate the target model.
  • the test set can also be obtained through other means. For example, very accurate sample data provided by operation and maintenance personnel that have been determined to have no abnormal data can also be used as a test set.
  • the method further includes: inputting test data into the target model for testing; obtaining the average absolute error rate of the target model; and determining the response of the target model to the retained sample data.
  • the fitting degree parameters for fitting and the average absolute error rate respectively meet preset thresholds.
  • step 204 For example, after step 204, five target models corresponding to five dimensions are obtained, and the fitting parameters of each target model can also be obtained correspondingly to represent the goodness of the fitting of the target model.
  • y is the actual value of the 6 test data
  • y is the predicted value obtained based on the target model. In this way, the average absolute error rate of the first target model is obtained.
  • One possible way is to screen each target model based on its average absolute error rate and fitness parameters. For example, three of the target models are selected for subsequent detection of abnormal data on the data to be detected.
  • Another possible way is to score each target model according to the average absolute error rate and fitting degree parameters of the target model, and determine the absolute equation that meets the first preset condition and the approximate equation that meets the second preset condition.
  • different weights can be given to the scores of absolute equations and the scores of approximate equations. For example, a target model with an average absolute error rate less than 0.01 and a fit parameter greater than 0.999 is determined as an absolute equation; a target model with an average absolute error rate greater than or equal to 0.01 and less than 0.1 and a fit parameter greater than 0.9 is determined as an approximate equation.
  • Mode is possible way.
  • the target models corresponding to each target dimension will not be directly used to detect abnormal data, but will be screened again in each target model.
  • the average absolute error rate of the target model for testing is obtained. If the fitting parameters of the target model and the average absolute error rate respectively meet the preset thresholds, it means that the fitting accuracy of the target model is high and can be used for subsequent abnormal data detection. Improved accuracy of abnormal data detection.
  • Figure 4 shows a detailed flow chart for determining the target model.
  • Step 401 Read sample data.
  • Step 402 Preprocess the sample data.
  • Step 403 Divide the sample data into a training set and a test set.
  • the training set is used as the initial n sets of sample data for determining the target model.
  • Step 404 For any target dimension, select K independent variable dimensions that are correlated with the target dimension from M dimensions based on the correlation coefficients of the initial n sets of sample data.
  • Step 405 Obtain target n groups of sample data distributed in the target dimension and K independent variable dimensions.
  • Step 406 For any candidate group of sample data, determine the degree of influence of the candidate group of sample data on the accuracy of the target model based on the target n group of sample data and n-1 groups of sample data other than the candidate group of sample data.
  • Step 407 Determine whether the influence degree is greater than 4/n. If it is greater, go to step 408. If it is not greater, go to step 409.
  • Step 408 Eliminate the candidate group sample data.
  • Step 409 retain the candidate group sample data.
  • Step 410 Determine the target model corresponding to the target dimension based on the retained sample data.
  • Step 411 Use the test set data to evaluate each target model and obtain absolute equations and reduced equations.
  • Figure 5 shows a possible abnormal data detection method, including:
  • Step 501 Input W data to be detected distributed in W dimensions into each target model corresponding to each target dimension.
  • Step 502 For any target model, if it is determined that W data to be detected do not satisfy the target model, then determine each dimension included in the target model as an abnormal dimension.
  • Step 503 For any abnormal dimension, determine the abnormal probability that the abnormal dimension is determined to be an abnormal dimension in each target model.
  • Step 504 Determine whether the data to be detected corresponding to the abnormal dimension is abnormal data according to the abnormality probability.
  • step 501 W data to be detected distributed in W dimensions are input into each target model corresponding to each target dimension.
  • the embodiment of the present invention does not limit W dimensions.
  • the W dimensions include at least M dimensions that determine the target model.
  • the W dimensions are exactly the same as the M dimensions of the determined target model, and they are also distributed in the five dimensions of dimension A, dimension B, dimension C, dimension D and dimension E.
  • the W dimensions also include other dimensions.
  • the W dimensions may include some of the M dimensions.
  • the remaining target models do not contain certain dimensions, so the dimensions of the data to be tested do not need to include this dimension.
  • the five target models corresponding to the previously determined five dimensions only three target models meet the test set, so only these three target models are used for the detection of the data to be detected.
  • These three target models only include 4 dimensions: dimension A, dimension B, dimension C and dimension D.
  • the W dimensions of the data distribution to be detected can also be only dimension A, dimension B, dimension C and dimension D, that is, dimension E is not included.
  • the W data to be detected are input into each target model respectively, for example, into three target models.
  • the three target models are:
  • step 502 for any target model, if it is determined that W data to be detected do not satisfy the target model, then each dimension included in the target model is determined as an abnormal dimension.
  • target model 1 input the data to be detected shown in Table 4 into target model 1, and determine that the average absolute error rate does not meet the preset threshold, then add dimension A, dimension B, dimension C and Dimension D is determined to be an abnormal dimension.
  • target model 2 input the data to be detected shown in Table 4 into target model 2. If it is determined that the average absolute error rate meets the preset threshold, no operation will be performed.
  • target model 3 input the data to be detected shown in Table 4 into target model 3. If it is determined that the average absolute error rate does not meet the preset threshold, then dimension A, dimension B, and dimension C included in target model 3 are determined to be abnormal. dimensions.
  • the dimensions identified as abnormal dimensions include dimension A, dimension B, dimension C and dimension D.
  • step 503 for any abnormal dimension, determine the abnormal probability that the abnormal dimension is determined to be an abnormal dimension in each target model.
  • One possible way is to determine the ratio of the number of times the abnormal dimension appears in the target model determined to be the abnormal dimension to the number of times the abnormal dimension appears in each target model, and determine the abnormal probability of the abnormal dimension based on the ratio.
  • Dimension B appears three times in each target model, and is determined to be an abnormal dimension in two target models (target model 1 and target model 3). Therefore, the abnormality probability of dimension B is 2/3.
  • Dimension C appears three times in each target model, and is determined to be an abnormal dimension in two target models (target model 1 and target model 3). Therefore, the abnormality probability of dimension C is 2/3.
  • Dimension D appears twice in each target model, and one target model (target model 1) is determined to be an abnormal dimension. Therefore, the abnormality probability of dimension D is 1/2.
  • target model 1 is an absolute equation
  • target model 1 is determined to be an abnormal dimension
  • the number of times the abnormal dimension appears in the target model determined to be an abnormal dimension is multiplied by 1
  • target model 3 is an approximate equation
  • target model 3 is determined to be an abnormal dimension
  • the number of times the abnormal dimension appears in the target model determined to be an abnormal dimension is multiplied by 0.8.
  • step 504 it is determined according to the abnormality probability whether the data to be detected corresponding to the abnormal dimension is abnormal data.
  • the preset threshold compares the abnormality probability of any abnormal dimension with the preset threshold. If it is greater than the preset threshold, it is determined that the data to be detected corresponding to the abnormal dimension is abnormal data; if it is not greater than the preset threshold, it is determined that the data to be detected corresponding to the abnormal dimension is abnormal data. Detection data is not abnormal data.
  • the determination of the preset threshold here can be set based on the experience and needs of those skilled in the art. There are no restrictions on this. Or, determine the data to be detected corresponding to the abnormal dimensions of the top N digits of abnormality probability as abnormal data.
  • the data to be detected 100.5 corresponding to dimension A is determined as abnormal data.
  • the abnormal dimension is determined. Then determine the abnormal probability of the abnormal dimension being determined as an abnormal dimension in each target model, and determine whether the data to be detected corresponding to the abnormal dimension is abnormal data based on the abnormal probability. It not only detects the presence of abnormal data in W data to be detected, but also accurately locates which dimension of data in the W data to be detected is abnormal data. This achieves automatic and accurate positioning of abnormal data, eliminating the need for manual review.
  • the embodiment of the present invention also provides another abnormal data detection method, that is, after determining each dimension included in the target model as an abnormal dimension, it also includes: for any abnormal dimension, obtaining each historical data corresponding to the abnormal dimension. ; Determine the abnormality score of the data to be detected corresponding to the abnormal dimension by clustering the data to be detected corresponding to the abnormal dimension and each historical data; determine the data to be detected corresponding to the abnormal dimension according to the abnormality probability Whether the data is abnormal data includes: determining whether the data to be detected corresponding to the abnormal dimension is abnormal data based on the abnormality probability and the abnormality score.
  • the above method is for any anomaly dimension.
  • the historical data of dimension A is obtained at the same time.
  • the data to be detected and the historical data of dimension A are clustered to obtain the anomaly score of dimension A.
  • the data to be detected corresponding to the abnormal dimension By obtaining the historical data of the abnormal dimension, cluster the data to be detected corresponding to the abnormal dimension and each historical data, and obtain the abnormal score of the data to be detected corresponding to the abnormal dimension. Combine the abnormal probability and the abnormal score to determine the data to be detected corresponding to the abnormal dimension. Whether the data is abnormal data.
  • the combination of the two judgment methods not only takes into account the probability that the abnormal dimension is determined to be an abnormal dimension, but also takes into account the historical data of the abnormal dimension, increasing the accuracy of determining abnormal data.
  • the embodiment of the present invention does not specifically limit the method of clustering to obtain anomaly scores.
  • One possible way is to use k-means for clustering, and determine the distance between the data to be detected in any abnormal dimension and each historical data as the anomaly score. For example, if the distance between the data to be detected in dimension A and each historical data is relatively long, the similarity will be small and the anomaly score will be small.
  • the method includes: constructing an isolated binary tree for the data to be detected corresponding to the abnormal dimension and each historical data; and calculating the anomaly score of the data to be detected corresponding to the abnormal dimension in the isolated binary tree.
  • the historical data obtained for dimension A are: 19.49, 20.23, 25.34, 49.12, 36.66.
  • N times for example, 100
  • the detection is randomly processed
  • the data and historical data are cut. Each cut can produce an independent leaf node. In this way, new leaf nodes are continuously cut until the tree reaches the specified height or cannot be cut anymore, and the algorithm ends.
  • the specific steps to construct an isolated binary tree are as follows: (1) First, randomly select a split point between the minimum and maximum values (19.49 and 100.5) of all sample data, assuming that the random value is 60.2. (2) Put the data nodes in the sample that are greater than the split point value 60.2 on the right branch of the tree, and the data nodes that are less than or equal to 60.2 are placed on the left branch of the tree. (3) Repeat steps (1) and (2) on the basis of branches until all data nodes are randomly divided to form isolated leaf nodes or the tree reaches the specified height.
  • the first isolated tree is randomly constructed according to the above steps, as shown in Figure 6, and the five random split points are (60.2, 34, 42.2, 22.5, 20).
  • T.size represents the number of samples at the same leaf node as the sample x
  • C(T.size ) can be regarded as a correction value, indicating the average path length of a binary tree constructed by T.size samples.
  • n is the number of samples
  • E(h(x)) is the average PathLength of the samples on 100 isolated trees
  • c(n) is the average path length of the tree building n samples.
  • n is the number of samples 6
  • the c(6) result is calculated according to the formula for calculating c(n) above.
  • the data to be detected corresponding to the abnormal dimension is abnormal data based on the abnormal probability corresponding to the abnormal dimension and the abnormal score.
  • FIG. 7 exemplarily shows the structure of an abnormal data detection device provided by an embodiment of the present invention, which can perform the process of abnormal data detection.
  • the device specifically includes:
  • Processing unit 701 used for:
  • W data to be detected distributed in W dimensions are input into each target model corresponding to each target dimension; the target model corresponding to any target dimension is obtained by removing the sample data from the strong influence point data; so
  • the strong influence point data refers to sample data whose impact on the accuracy of the target model does not meet the preset conditions;
  • each dimension included in the target model is determined as an abnormal dimension
  • processing unit 701 is also used to:
  • the processing unit 701 is specifically used for:
  • processing unit 701 is specifically used to:
  • processing unit 701 is specifically used to:
  • the target dimension among the M dimensions, select K independent variable dimensions that are correlated with the target dimension from the M dimensions according to the correlation coefficients of the initial n sets of sample data; the target dimension is the M any of the dimensions;
  • the target model is determined based on the target n groups of sample data distributed in the target dimension and the K independent variable dimensions; the target model is used to characterize the satisfaction between the target dimension and the K independent variable dimensions. relation.
  • processing unit 701 is specifically used to:
  • the influence degree of the candidate group sample data on the accuracy of the target model is determined according to the target n group of sample data and the n-1 group of sample data except the candidate group sample data;
  • the candidate group sample data is the Any group of sample data in the target n groups of sample data;
  • w groups of strong influence point data are removed from the target n group of sample data; 1 ⁇ w ⁇ n;
  • the target model is determined based on the retained sample data.
  • processing unit 701 is specifically used to:
  • the degree of influence is determined based on the first fitting coefficient, the second fitting coefficient, the number of independent variable dimensions included in the target model, and the mean square error of the first fitting model.
  • processing unit 701 is specifically used to:
  • processing unit 701 is also used to:
  • the embodiment of the present application provides a computer device, as shown in Figure 8, including at least one processor 801, and a memory 802 connected to the at least one processor.
  • the processor is not limited in the embodiment of the present application.
  • the specific connection medium between 801 and memory 802 the connection between processor 801 and memory 802 through a bus in Figure 8 is taken as an example.
  • the bus can be divided into address bus, data bus, control bus, etc.
  • the memory 802 stores instructions that can be executed by at least one processor 801. At least one processor 801 can execute the steps of the above abnormal data detection method by executing the instructions stored in the memory 802.
  • the processor 801 is the control center of the computer equipment. It can use various interfaces and lines to connect various parts of the computer equipment, and perform abnormal operations by running or executing instructions stored in the memory 802 and calling data stored in the memory 802. Data detection.
  • the processor 801 may include one or more processing units, and the processor 801 may integrate an application processor and a modem processor, where the application processor mainly processes the operating system, user interface, application programs, etc., The modem processor primarily handles wireless communications. It can be understood that the above modem processor may not be integrated into the processor 801.
  • the processor 801 and the memory 802 can be implemented on the same chip, and in some embodiments, they can also be implemented on separate chips.
  • the processor 801 may be a general processor, such as a central processing unit (CPU), a digital signal processor, an application specific integrated circuit (ASIC), a field programmable gate array or other programmable logic devices, discrete gates or transistors.
  • Logic devices and discrete hardware components can implement or execute the methods, steps and logical block diagrams disclosed in the embodiments of this application.
  • a general-purpose processor may be a microprocessor or any conventional processor, etc. The steps of the methods disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware processor for execution, or can be executed by a combination of hardware and software modules in the processor.
  • the memory 802 can be used to store non-volatile software programs, non-volatile computer executable programs and modules.
  • the memory 802 may include at least one type of storage medium, for example, may include flash memory, hard disk, multimedia card, card-type memory, random access memory (RaKdom Access Memory, RAM), static random access memory (Static RaKdom Access Memory, SRAM), Programmable Read OKly Memory (PROM), Read OKly Memory (ROM), Electrically Erasable Programmable Read-OKly Memory (EEPROM), Magnetic Memory, Disk , CD, etc.
  • Memory 802 is, but is not limited to, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
  • the memory 802 in the embodiment of the present application can also be a circuit or any other device capable of realizing a storage function, used to store program instructions and/or data.
  • embodiments of the present invention also provide a computer-readable storage medium.
  • the computer-readable storage medium stores a computer executable program.
  • the computer executable program is used to cause the computer to execute the abnormal data listed in any of the above methods. detection method.
  • embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions
  • the device implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.
  • These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device.
  • Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Complex Calculations (AREA)
  • Length Measuring Devices By Optical Means (AREA)
  • Length Measuring Devices With Unspecified Measuring Means (AREA)

Abstract

Embodiments of the present invention relate to an anomalous data detection method and apparatus. The method comprises: inputting W pieces of data to be detected that are distributed in W dimensions into target models corresponding to target dimensions, wherein a target model corresponding to any target dimension is obtained by means of sample data from which strong influence point data is removed; for any target model, if it is determined that the W pieces of data to be detected do not meet the target model, determining all dimensions included in the target model as anomalous dimensions; for any anomalous dimension, determining an anomalous probability that the anomalous dimension is determined as an anomalous dimension in each target model; and according to the anomalous probability, determining whether the data to be detected corresponding to the anomalous dimension is anomalous data. The presence of anomalous data among the W pieces of data to be detected can be detected, and which data of which dimension is anomalous data among the W pieces of data to be detected can also be accurately positioned. Therefore, automatic and accurate positioning of anomalous data is achieved, without the need of manual re-check.

Description

一种异常数据检测方法及装置An abnormal data detection method and device
相关申请的交叉引用Cross-references to related applications
本申请要求在2022年08月18日提交中国专利局、申请号为202210992301.1、申请名称为“一种异常数据检测方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application submitted to the China Patent Office on August 18, 2022, with application number 202210992301.1 and application title "An abnormal data detection method and device", the entire content of which is incorporated into this application by reference. middle.
技术领域Technical field
本发明实施例涉及计算机技术领域,尤其涉及一种异常数据检测方法、装置、计算设备及计算机可读存储介质。Embodiments of the present invention relate to the field of computer technology, and in particular, to an abnormal data detection method, device, computing device and computer-readable storage medium.
背景技术Background technique
随着计算机技术的发展,越来越多的技术应用在金融领域,传统金融业正在逐步向金融科技(FiKtech)转变,但由于金融行业的安全性、实时性要求,也对技术提出了更高的要求。With the development of computer technology, more and more technologies are applied in the financial field. The traditional financial industry is gradually transforming into financial technology (FiKtech). However, due to the security and real-time requirements of the financial industry, higher technology requirements are also put forward. requirements.
随着互联网金融行业的发展和计算机技术的日益完善,金融系统单位时间内产生的不同维度的数据量的规模越来越大,且这些维度的数量级可达成百上千个。在这些数据中不免会存在异常数据,异常数据产生的原因有很多,例如由于人工录入的失误、计算机处理计算的错误等。异常数据的存在对后续统计处理等步骤产生的影响不可小觑,因此需要将异常数据检测出来。With the development of the Internet financial industry and the increasing improvement of computer technology, the amount of data of different dimensions generated by the financial system per unit time is increasing, and the order of magnitude of these dimensions can reach hundreds or thousands. There will inevitably be abnormal data in these data. There are many reasons for abnormal data, such as errors in manual entry, errors in computer processing and calculation, etc. The impact of the existence of abnormal data on subsequent statistical processing and other steps cannot be underestimated, so abnormal data needs to be detected.
目前采用的异常数据检测方法检测精度较低,在检测出异常数据后依然需要人工进行再次检查确认,花费的人力成本和时间成本较高。The currently used abnormal data detection method has low detection accuracy. After detecting abnormal data, it still needs to be manually checked and confirmed again, which costs high labor and time costs.
综上,提供一种异常数据检测方法,用以提高异常数据检测的精度。In summary, an abnormal data detection method is provided to improve the accuracy of abnormal data detection.
发明内容Contents of the invention
本发明实施例提供一种异常数据检测方法,用以提高异常数据检测的精度。Embodiments of the present invention provide an abnormal data detection method to improve the accuracy of abnormal data detection.
第一方面,本发明实施例提供一种异常数据检测方法,包括:In a first aspect, embodiments of the present invention provide an abnormal data detection method, including:
将W个分布在W个维度的待检测数据输入至各目标维度对应的各目标模型中;所述任一目标维度对应的目标模型是通过去除了强影响点数据后的样本数据得到的;所述强影响点数据是指对所述目标模型的准确性的影响度不满足预设条件的样本数据;W data to be detected distributed in W dimensions are input into each target model corresponding to each target dimension; the target model corresponding to any target dimension is obtained by removing the sample data from the strong influence point data; so The strong influence point data refers to sample data whose impact on the accuracy of the target model does not meet the preset conditions;
针对任一目标模型,若确定W个待检测数据不满足所述目标模型,则将所述目标模型中包含的各维度确定为异常维度;For any target model, if it is determined that W data to be detected do not satisfy the target model, then each dimension included in the target model is determined as an abnormal dimension;
针对任一异常维度,确定所述异常维度在各目标模型中被确定为异常维度的异常概率;For any abnormal dimension, determine the abnormal probability that the abnormal dimension is determined to be an abnormal dimension in each target model;
根据所述异常概率确定所述异常维度对应的待检测数据是否为异常数据。Determine whether the data to be detected corresponding to the abnormal dimension is abnormal data according to the abnormality probability.
通过将W个分布在W个维度的待检测数据输入至各目标维度对应的各目标模型中,确定异常维度。进而确定异常维度在各目标模型中被确定为异常维度的异常概率,根据异常概率确定所述异常维度对应的待检测数据是否为异常数据。不仅实现了检测出W个待检测数据中存在异常数据,还能够准确定位到这W个待检测数据中,哪个维度的数据为异常数据。从而实现了异常数据的自动化准确定位,无需人工再行查看。By inputting W data to be detected distributed in W dimensions into each target model corresponding to each target dimension, the abnormal dimension is determined. Then, the abnormality probability of the abnormal dimension determined as an abnormal dimension in each target model is determined, and it is determined based on the abnormality probability whether the data to be detected corresponding to the abnormal dimension is abnormal data. It not only detects the presence of abnormal data in W data to be detected, but also accurately locates which dimension of data in the W data to be detected is abnormal data. This achieves automatic and accurate positioning of abnormal data, eliminating the need for manual review.
在一些实施例中,将所述目标模型中包含的各维度确定为异常维度之后,还包括:In some embodiments, after determining each dimension included in the target model as an abnormal dimension, the method further includes:
针对任一异常维度,获取所述异常维度对应的各历史数据;For any abnormal dimension, obtain each historical data corresponding to the abnormal dimension;
通过将所述异常维度对应的待检测数据和所述各历史数据进行聚类,确定所述异常维度对应的待检测数据的异常得分;Determine the anomaly score of the data to be detected corresponding to the abnormal dimension by clustering the data to be detected corresponding to the abnormal dimension and each historical data;
根据所述异常概率确定所述异常维度对应的待检测数据是否为异常数据,包括:Determining whether the data to be detected corresponding to the abnormal dimension is abnormal data according to the abnormality probability includes:
根据所述异常概率和所述异常得分确定所述异常维度对应的待检测数据是否为异常数据。Determine whether the data to be detected corresponding to the abnormal dimension is abnormal data based on the abnormality probability and the abnormality score.
通过获取异常维度的历史数据,将异常维度对应的待检测数据和各历史数据进行聚类,得到异常维度对应的待检测数据的异常得分,结合异常概率和异常得分确定该异常维度对应的待检测数据是否为异常数据。两种判断方法结合,既考虑到了该异常维度被确定为异常维度的概率,又考虑到 了该异常维度的历史上的数据的情况,增加了确定异常数据的准确性。By obtaining the historical data of the abnormal dimension, cluster the data to be detected corresponding to the abnormal dimension and each historical data, and obtain the abnormal score of the data to be detected corresponding to the abnormal dimension. Combine the abnormal probability and the abnormal score to determine the data to be detected corresponding to the abnormal dimension. Whether the data is abnormal data. The combination of the two judgment methods not only takes into account the probability that the abnormal dimension is determined to be an abnormal dimension, but also takes into account the historical data of the abnormal dimension, increasing the accuracy of determining abnormal data.
在一些实施例中,通过将所述异常维度对应的待检测数据和所述各历史数据进行聚类,确定所述异常维度对应的待检测数据的异常得分,包括:In some embodiments, by clustering the data to be detected corresponding to the abnormal dimension and the historical data, the anomaly score of the data to be detected corresponding to the abnormal dimension is determined, including:
对所述异常维度对应的待检测数据和所述各历史数据构造孤立二叉树;Construct an isolated binary tree for the data to be detected corresponding to the abnormal dimension and each of the historical data;
计算在所述孤立二叉树中所述异常维度对应的待检测数据的异常得分。Calculate the anomaly score of the data to be detected corresponding to the abnormal dimension in the isolated binary tree.
通过将异常维度对应的待检测数据和各历史数据构造孤立二叉树,提高了确定异常数据的准确性。By constructing an isolated binary tree from the data to be detected corresponding to the abnormal dimension and each historical data, the accuracy of determining abnormal data is improved.
在一些实施例中,通过如下方式确定所述任一目标维度对应的目标模型,包括:In some embodiments, the target model corresponding to any target dimension is determined in the following manner, including:
获取分布在M个维度的初始n组样本数据;其中,每一组样本数据具有M个维度;Obtain initial n groups of sample data distributed in M dimensions; where each group of sample data has M dimensions;
针对M个维度中的目标维度,根据所述初始n组样本数据的相关系数从M个维度中选取与所述目标维度存在相关关系的K个自变量维度;所述目标维度为所述M个维度中的任一维度;For the target dimension among the M dimensions, select K independent variable dimensions that are correlated with the target dimension from the M dimensions according to the correlation coefficients of the initial n sets of sample data; the target dimension is the M any of the dimensions;
根据分布在所述目标维度和所述K个自变量维度的目标n组样本数据确定所述目标模型;所述目标模型用于表征所述目标维度和所述K个自变量维度之间满足的关系。The target model is determined based on the target n groups of sample data distributed in the target dimension and the K independent variable dimensions; the target model is used to characterize the satisfaction between the target dimension and the K independent variable dimensions. relation.
初始n组样本数据分布在M个维度中,但这M个维度不一定全部存在相关关系。因此需要将存在相关关系的维度选择出来。针对任一目标维度,根据所述初始n组样本数据的相关系数进行选择,得到该目标维度对应的具有相关关系的K个自变量维度。如此,每个维度都可作为目标维度,每个目标维度及其自变量维度都可对应确定一个目标模型。考虑到了更加丰富的场景和情况,增加了确定的目标模型的准确性。The initial n sets of sample data are distributed in M dimensions, but not all of these M dimensions are necessarily related. Therefore, it is necessary to select the dimensions with relevant relationships. For any target dimension, selection is made based on the correlation coefficients of the initial n groups of sample data to obtain K independent variable dimensions with correlations corresponding to the target dimension. In this way, each dimension can be used as a target dimension, and each target dimension and its independent variable dimensions can correspondingly determine a target model. Taking into account richer scenarios and situations, the accuracy of the determined target model is increased.
在一些实施例中,根据分布在所述目标维度和所述K个自变量维度的目标n组样本数据确定所述目标模型,包括:In some embodiments, determining the target model based on target n groups of sample data distributed in the target dimension and the K independent variable dimensions includes:
获取分布在所述目标维度和所述K个自变量维度的目标n组样本数据;Obtain target n groups of sample data distributed in the target dimension and the K independent variable dimensions;
根据所述目标n组样本数据和除候选组样本数据以外的n-1组样本数据确定所述候选组样本数据对所述目标模型的准确性的影响度;所述候选组样本数据为所述目标n组样本数据中的任一组样本数据;The influence degree of the candidate group sample data on the accuracy of the target model is determined according to the target n group of sample data and the n-1 group of sample data except the candidate group sample data; the candidate group sample data is the Any group of sample data in the target n groups of sample data;
根据n个候选组样本数据的影响度,从所述目标n组样本数据中去除w组强影响点数据;1≤w<n;According to the influence degree of n candidate group sample data, w groups of strong influence point data are removed from the target n group of sample data; 1≤w<n;
根据保留的样本数据确定所述目标模型。The target model is determined based on the retained sample data.
根据所述目标n组样本数据和除候选组样本数据以外的n-1组样本数据确定所述候选组样本数据对目标模型的准确性的影响度,根据影响度从所述目标n组样本数据中去除w组强影响点数据,根据保留的样本数据确定所述目标模型。如此,将对目标模型的准确性影响较大的样本数据剔除,尽量减少这些样本数据对最终得到的目标模型的影响,提高了目标模型的准确性。基于准确性更高的目标模型对待检测数据进行异常数据检测,提高了检测异常数据的准确性。Determine the degree of influence of the candidate group of sample data on the accuracy of the target model based on the target n group of sample data and n-1 groups of sample data except the candidate group of sample data. According to the degree of influence, the target n group of sample data is Remove w groups of strong influence point data, and determine the target model based on the retained sample data. In this way, the sample data that has a greater impact on the accuracy of the target model will be eliminated, the impact of these sample data on the final target model will be minimized, and the accuracy of the target model will be improved. Abnormal data detection is performed on the data to be detected based on a more accurate target model, which improves the accuracy of detecting abnormal data.
在一些实施例中,根据所述目标n组样本数据和除候选组样本数据以外的n-1组样本数据确定所述候选组样本数据对所述目标模型的准确性的影响度,包括:In some embodiments, determining the impact of the candidate group sample data on the accuracy of the target model based on the target n group of sample data and n-1 groups of sample data except the candidate group sample data includes:
对所述目标n组样本数据进行拟合,得到第一拟合模型的第一拟合系数;Fit the target n groups of sample data to obtain the first fitting coefficient of the first fitting model;
对除所述候选组样本数据以外的n-1组样本数据进行拟合,得到第二拟合模型的第二拟合系数;Fit n-1 groups of sample data other than the candidate group sample data to obtain the second fitting coefficient of the second fitting model;
根据所述第一拟合系数、所述第二拟合系数、所述目标模型中包含的自变量维度的数量和所述第一拟合模型的均方误差确定所述影响度。The degree of influence is determined based on the first fitting coefficient, the second fitting coefficient, the number of independent variable dimensions included in the target model, and the mean square error of the first fitting model.
根据包括候选组样本数据进行拟合得到的第一拟合系数和不包括候选组样本数据进行拟合得到的第二拟合系数,确定候选组样本数据对目标模型的准确性的影响度,提高了确定影响度的准确性。从而就可得到更加准确的目标模型。According to the first fitting coefficient obtained by fitting including the candidate group sample data and the second fitting coefficient obtained by fitting excluding the candidate group sample data, the influence of the candidate group sample data on the accuracy of the target model is determined to improve to determine the accuracy of impact. As a result, a more accurate target model can be obtained.
在一些实施例中,根据n个候选组样本数据的影响度,从所述目标n组样本数据中去除w组强影响点数据,包括:In some embodiments, removing w groups of strong influence point data from the target n groups of sample data based on the influence of n candidate groups of sample data includes:
针对任一候选组样本数据,若所述候选组样本数据对所述目标模型的准确性的影响度大于(p,n-p-1)自由度的F分布的第一个四分之一位,则确定所述候选组样本数据为强影响点数据;其中, p为所述目标模型中包含的自变量维度的数量;For any candidate group sample data, if the impact of the candidate group sample data on the accuracy of the target model is greater than the first quarter of the F distribution of (p, n-p-1) degrees of freedom, then Determine that the candidate group sample data is strong influence point data; where p is the number of independent variable dimensions included in the target model;
从所述目标n组样本数据中去除所述强影响点数据。Remove the strong influence point data from the target n groups of sample data.
采用(p,n-p-1)自由度的F分布的第一个四分之一位来确定影响度的大小,更加科学合理,提高了确定强影响点数据的准确性。从而就可得到更加准确的目标模型。Using the first quarter of the F distribution with (p, n-p-1) degrees of freedom to determine the degree of influence is more scientific and reasonable, and improves the accuracy of determining strong influence point data. As a result, a more accurate target model can be obtained.
在一些实施例中,根据保留的样本数据确定所述目标模型之后,还包括:In some embodiments, after determining the target model based on the retained sample data, the method further includes:
将测试数据输入所述目标模型中进行测试;得到所述目标模型的平均绝对误差率;Input test data into the target model for testing; obtain the average absolute error rate of the target model;
确定所述目标模型对保留的样本数据进行拟合的拟合度参数和所述平均绝对误差率分别满足预设阈值。It is determined that the fitting degree parameter and the average absolute error rate of the target model for fitting the retained sample data respectively meet a preset threshold.
不会直接将各目标维度对应的目标模型均用于检测异常数据,而是在各目标模型中再进行一次筛选。通过测试数据输入至任一目标模型中进行测试,得到目标模型进行测试的平均绝对误差率。若目标模型的拟合度参数和所述平均绝对误差率分别满足预设阈值,则说明该目标模型拟合的准确性较高,可用于后续的异常数据检测。提高了异常数据检测的准确性。The target models corresponding to each target dimension will not be directly used to detect abnormal data, but will be filtered again in each target model. By inputting the test data into any target model for testing, the average absolute error rate of the target model for testing is obtained. If the fitting parameters of the target model and the average absolute error rate respectively meet the preset thresholds, it means that the fitting accuracy of the target model is high and can be used for subsequent abnormal data detection. Improved accuracy of abnormal data detection.
第二方面,本发明实施例还提供一种异常数据检测装置,包括:In a second aspect, embodiments of the present invention also provide an abnormal data detection device, including:
处理单元,用于:Processing unit for:
将W个分布在W个维度的待检测数据输入至各目标维度对应的各目标模型中;所述任一目标维度对应的目标模型是通过去除了强影响点数据后的样本数据得到的;所述强影响点数据是指对所述目标模型的准确性的影响度不满足预设条件的样本数据;W data to be detected distributed in W dimensions are input into each target model corresponding to each target dimension; the target model corresponding to any target dimension is obtained by removing the sample data from the strong influence point data; so The strong influence point data refers to sample data whose impact on the accuracy of the target model does not meet the preset conditions;
针对任一目标模型,若确定W个待检测数据不满足所述目标模型,则将所述目标模型中包含的各维度确定为异常维度;For any target model, if it is determined that W data to be detected do not satisfy the target model, then each dimension included in the target model is determined as an abnormal dimension;
针对任一异常维度,确定所述异常维度在各目标模型中被确定为异常维度的异常概率;For any abnormal dimension, determine the abnormal probability that the abnormal dimension is determined to be an abnormal dimension in each target model;
根据所述异常概率确定所述异常维度对应的待检测数据是否为异常数据。Determine whether the data to be detected corresponding to the abnormal dimension is abnormal data according to the abnormality probability.
在一些实施例中,所述处理单元还用于:In some embodiments, the processing unit is also used to:
针对任一异常维度,获取所述异常维度对应的各历史数据;For any abnormal dimension, obtain each historical data corresponding to the abnormal dimension;
通过将所述异常维度对应的待检测数据和所述各历史数据进行聚类,确定所述异常维度对应的待检测数据的异常得分;Determine the anomaly score of the data to be detected corresponding to the abnormal dimension by clustering the data to be detected corresponding to the abnormal dimension and each historical data;
所述处理单元具体用于:The processing unit is specifically used for:
根据所述异常概率和所述异常得分确定所述异常维度对应的待检测数据是否为异常数据。Determine whether the data to be detected corresponding to the abnormal dimension is abnormal data based on the abnormality probability and the abnormality score.
在一些实施例中,所述处理单元具体用于:In some embodiments, the processing unit is specifically used to:
对所述异常维度对应的待检测数据和所述各历史数据构造孤立二叉树;Construct an isolated binary tree for the data to be detected corresponding to the abnormal dimension and each of the historical data;
计算在所述孤立二叉树中所述异常维度对应的待检测数据的异常得分。Calculate the anomaly score of the data to be detected corresponding to the abnormal dimension in the isolated binary tree.
在一些实施例中,所述处理单元具体用于:In some embodiments, the processing unit is specifically used to:
获取分布在M个维度的初始n组样本数据;其中,每一组样本数据具有M个维度;Obtain initial n groups of sample data distributed in M dimensions; where each group of sample data has M dimensions;
针对M个维度中的目标维度,根据所述初始n组样本数据的相关系数从M个维度中选取与所述目标维度存在相关关系的K个自变量维度;所述目标维度为所述M个维度中的任一维度;For the target dimension among the M dimensions, select K independent variable dimensions that are correlated with the target dimension from the M dimensions according to the correlation coefficients of the initial n sets of sample data; the target dimension is the M any of the dimensions;
根据分布在所述目标维度和所述K个自变量维度的目标n组样本数据确定所述目标模型;所述目标模型用于表征所述目标维度和所述K个自变量维度之间满足的关系。The target model is determined based on the target n groups of sample data distributed in the target dimension and the K independent variable dimensions; the target model is used to characterize the satisfaction between the target dimension and the K independent variable dimensions. relation.
在一些实施例中,所述处理单元具体用于:In some embodiments, the processing unit is specifically used to:
获取分布在所述目标维度和所述K个自变量维度的目标n组样本数据;Obtain target n groups of sample data distributed in the target dimension and the K independent variable dimensions;
根据所述目标n组样本数据和除候选组样本数据以外的n-1组样本数据确定所述候选组样本数据对所述目标模型的准确性的影响度;所述候选组样本数据为所述目标n组样本数据中的任一组样本数据;The influence degree of the candidate group sample data on the accuracy of the target model is determined according to the target n group of sample data and the n-1 group of sample data except the candidate group sample data; the candidate group sample data is the Any group of sample data in the target n groups of sample data;
根据n个候选组样本数据的影响度,从所述目标n组样本数据中去除w组强影响点数据;1≤w<n;According to the influence degree of n candidate group sample data, w groups of strong influence point data are removed from the target n group of sample data; 1≤w<n;
根据保留的样本数据确定所述目标模型。The target model is determined based on the retained sample data.
在一些实施例中,所述处理单元具体用于:In some embodiments, the processing unit is specifically used to:
对所述目标n组样本数据进行拟合,得到第一拟合模型的第一拟合系数;Fit the target n groups of sample data to obtain the first fitting coefficient of the first fitting model;
对除所述候选组样本数据以外的n-1组样本数据进行拟合,得到第二拟合模型的第二拟合系数;Fit n-1 groups of sample data other than the candidate group sample data to obtain the second fitting coefficient of the second fitting model;
根据所述第一拟合系数、所述第二拟合系数、所述目标模型中包含的自变量维度的数量和所述第一拟合模型的均方误差确定所述影响度。The degree of influence is determined based on the first fitting coefficient, the second fitting coefficient, the number of independent variable dimensions included in the target model, and the mean square error of the first fitting model.
在一些实施例中,所述处理单元具体用于:In some embodiments, the processing unit is specifically used to:
针对任一候选组样本数据,若所述候选组样本数据对所述目标模型的准确性的影响度大于(p,n-p-1)自由度的F分布的第一个四分之一位,则确定所述候选组样本数据为强影响点数据;其中,p为所述目标模型中包含的自变量维度的数量;For any candidate group sample data, if the impact of the candidate group sample data on the accuracy of the target model is greater than the first quarter of the F distribution of (p, n-p-1) degrees of freedom, then Determine that the candidate group sample data is strong influence point data; where p is the number of independent variable dimensions included in the target model;
从所述目标n组样本数据中去除所述强影响点数据。Remove the strong influence point data from the target n groups of sample data.
在一些实施例中,所述处理单元还用于:In some embodiments, the processing unit is also used to:
将测试数据输入所述目标模型中进行测试;得到所述目标模型的平均绝对误差率;Input test data into the target model for testing; obtain the average absolute error rate of the target model;
确定所述目标模型对保留的样本数据进行拟合的拟合度参数和所述平均绝对误差率分别满足预设阈值。It is determined that the fitting degree parameter and the average absolute error rate of the target model for fitting the retained sample data respectively meet a preset threshold.
第三方面,本发明实施例还提供一种计算设备,包括:In a third aspect, an embodiment of the present invention further provides a computing device, including:
存储器,用于存储计算机程序;Memory, used to store computer programs;
处理器,用于调用所述存储器中存储的计算机程序,按照获得的程序执行上述任一方式所列的异常数据检测方法。A processor, configured to call the computer program stored in the memory, and execute the abnormal data detection method listed in any of the above methods according to the obtained program.
第四方面,本发明实施例还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可执行程序,所述计算机可执行程序用于使计算机执行上述任一方式所列的异常数据检测方法。In a fourth aspect, embodiments of the present invention also provide a computer-readable storage medium that stores a computer-executable program, and the computer-executable program is used to cause the computer to execute any of the methods listed above. Abnormal data detection methods.
附图说明Description of drawings
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简要介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the following will briefly introduce the drawings needed to describe the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. Those of ordinary skill in the art can also obtain other drawings based on these drawings without exerting creative efforts.
图1为本发明实施例提供的一种根据n组样本数据确定目标模型的方法示意图;Figure 1 is a schematic diagram of a method for determining a target model based on n sets of sample data provided by an embodiment of the present invention;
图2为本发明实施例提供的一种确定目标模型方法的流程示意图;Figure 2 is a schematic flowchart of a method for determining a target model provided by an embodiment of the present invention;
图3为本发明实施例提供的一种采用最小二乘法进行拟合得到的拟合直线的示意图;Figure 3 is a schematic diagram of a fitted straight line obtained by fitting using the least squares method according to an embodiment of the present invention;
图4为本发明实施例提供的一种详细的确定目标模型的示意图;Figure 4 is a schematic diagram of a detailed target determination model provided by an embodiment of the present invention;
图5为本发明实施例提供的一种可能的异常数据检测方法的示意图;Figure 5 is a schematic diagram of a possible abnormal data detection method provided by an embodiment of the present invention;
图6为本发明实施例提供的一种构造的孤立二叉树的示意图;Figure 6 is a schematic diagram of a constructed isolated binary tree provided by an embodiment of the present invention;
图7为本发明实施例提供的一种异常数据检测装置的结构示意图;Figure 7 is a schematic structural diagram of an abnormal data detection device provided by an embodiment of the present invention;
图8为本发明实施例提供的一种计算机设备的结构示意图。Figure 8 is a schematic structural diagram of a computer device provided by an embodiment of the present invention.
具体实施方式Detailed ways
为使本申请的目的、实施方式和优点更加清楚,下面将结合本申请示例性实施例中的附图,对本申请示例性实施方式进行清楚、完整地描述,显然,所描述的示例性实施例仅是本申请一部分实施例,而不是全部的实施例。In order to make the purpose, implementation and advantages of the present application clearer, the exemplary embodiments of the present application will be clearly and completely described below in conjunction with the accompanying drawings in the exemplary embodiments of the present application. Obviously, the described exemplary embodiments These are only some of the embodiments of this application, not all of them.
基于本申请描述的示例性实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请所附权利要求保护的范围。此外,虽然本申请中公开内容按照示范性一个或几个实例来介绍,但应理解,可以就这些公开内容的各个方面也可以单独构成一个完整实施方式。Based on the exemplary embodiments described in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of the claims appended to this application. In addition, although the disclosure in this application is introduced in terms of one or several exemplary examples, it should be understood that each aspect of these disclosures can also individually constitute a complete embodiment.
需要说明的是,本申请中对于术语的简要说明,仅是为了方便理解接下来描述的实施方式,而不是意图限定本申请的实施方式。除非另有说明,这些术语应当按照其普通和通常的含义理解。It should be noted that the brief description of terms in this application is only to facilitate understanding of the embodiments described below, and is not intended to limit the embodiments of this application. Unless otherwise stated, these terms should be understood according to their ordinary and usual meaning.
本申请中说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”等是用于区别类似或同类的对象或实体,而不必然意味着限定特定的顺序或先后次序,除非另外注明(UKless  otherwise iKdicated)。应该理解这样使用的用语在适当情况下可以互换,例如能够根据本申请实施例图示或描述中给出那些以外的顺序实施。The terms "first", "second", "third", etc. in the description and claims of this application and the above-mentioned drawings are used to distinguish similar or similar objects or entities, and do not necessarily mean to limit specific Sequence or sequence, unless otherwise stated (UKless otherwise iKdicated). It is to be understood that the terms so used are interchangeable under appropriate circumstances and, for example, can be implemented in an order other than that shown or described in accordance with the embodiments of the present application.
此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖但不排他的包含,例如,包含了一系列组件的产品或设备不必限于清楚地列出的那些组件,而是可包括没有清楚地列出的或对于这些产品或设备固有的其它组件。In addition, the terms "including" and "having" and any variations thereof are intended to cover but not exclusively include, for example, a product or device that includes a range of components need not be limited to those components explicitly listed, but may include There are other components not expressly listed or inherent to these products or devices.
为了更好的解释本申请,先对本申请涉及的技术或名词解释如下。In order to better explain this application, the technologies or terms involved in this application are first explained as follows.
1、多元线性回归:在回归分析中,如果有两个或两个以上的自变量,就称为多元回归。事实上,一种现象常常是与多个因素相联系的,由多个自变量的最优组合共同来预测或估计因变量,比只用一个自变量进行预测或估计更有效,更符合实际。因此多元线性回归比一元线性回归的实用意义更大。1. Multiple linear regression: In regression analysis, if there are two or more independent variables, it is called multiple regression. In fact, a phenomenon is often associated with multiple factors. Predicting or estimating the dependent variable using the optimal combination of multiple independent variables is more effective and more realistic than using only one independent variable to predict or estimate. Therefore, multiple linear regression has greater practical significance than single linear regression.
2、最小二乘法(Ordinary Least Squares,OLS):是一种数学优化建模方法。它通过最小化误差的平方和寻找数据的最佳函数匹配。利用最小二乘法可以简便的求得未知的数据,并使得求得的数据与实际数据之间误差的平方和为最小。2. Ordinary Least Squares (OLS): It is a mathematical optimization modeling method. It finds the best functional match of the data by minimizing the sum of squared errors. The least squares method can be used to easily obtain unknown data, and minimize the sum of square errors between the obtained data and the actual data.
3、自由度:是指当以样本的统计量来估计总体的参数时,样本中独立或能自由变化的数据的个数,称为该统计量的自由度。一般来说,自由度等于独立变量数减掉其衍生量数;举例来说,方差的定义是样本减平均值(一个由样本决定的衍生量)的平方之和,因此对N个随机样本而言,其自由度为N-1。3. Degree of freedom: refers to the number of independent or freely changing data in the sample when the statistics of the sample are used to estimate the parameters of the population. This is called the degree of freedom of the statistic. Generally speaking, the degrees of freedom are equal to the number of independent variables minus the number of their derivatives; for example, the definition of variance is the sum of the squares of the sample minus the mean (a derivative determined by the sample), so for N random samples In other words, its degree of freedom is N-1.
4、决策树:是一种预测模型,代表的是对象属性与对象值之间的一种映射关系。树中每个节点表示某个对象,而每个分叉路径则代表某个可能的属性值,而每个叶节点则对应从根节点到该叶节点所经历的路径所表示的对象的值。决策树仅有单一输出,若欲有复数输出,可以建立独立的决策树以处理不同输出。数据挖掘中决策树是一种经常要用到的技术,可以用于分析数据,同样也可以用来做预测。4. Decision tree: It is a prediction model that represents a mapping relationship between object attributes and object values. Each node in the tree represents an object, and each bifurcation path represents a possible attribute value, and each leaf node corresponds to the value of the object represented by the path from the root node to the leaf node. The decision tree has only a single output. If you want to have multiple outputs, you can build an independent decision tree to handle different outputs. Decision tree is a frequently used technology in data mining. It can be used to analyze data and can also be used to make predictions.
5、强影响点:指对多重线性回归模型参数估计有很强影响的数据点。由于多重线性回归采用最小二乘法进行参数估计,此时对所有的记录均一视同仁。当数据库中存在远离多维空间数据主体的记录时,它们将导致拟合的模型偏向该数据点。对于强影响点的识别是进行多重线性回归时应该注意的另一个重要问题。强影响点是对参数估计的稳定性及真实性具有很大影响的数据,对于回归模型数据集中的强影响点是指那些对统计量的取值有非常大的影响力和冲击力的点。5. Strong influence points: refers to data points that have a strong influence on the parameter estimation of the multiple linear regression model. Since multiple linear regression uses the least squares method for parameter estimation, all records are treated equally at this time. When there are records in the database that are far away from the body of multidimensional spatial data, they will cause the fitted model to be biased toward that data point. The identification of strong influence points is another important issue that should be paid attention to when performing multiple linear regression. Strong influence points are data that have a great impact on the stability and authenticity of parameter estimates. For regression model data sets, strong influence points refer to those points that have a very large influence and impact on the value of statistics.
经过研究发现,大部分金融场景下,不同维度的数据之间存在着线性相关的关系,几乎很少存在非线性的关系。因此基于这样的特性,可以设计一种方法,通过对样本数据的分析自动化地深入挖掘不同维度的数据之间符合的模型,然后用这种模型来对待检测数据进行异常数据的检测,从而提升数据的整体质量。After research, it is found that in most financial scenarios, there is a linear correlation between data of different dimensions, and there is almost no non-linear relationship. Therefore, based on such characteristics, a method can be designed to automatically dig into the consistent models between data of different dimensions through the analysis of sample data, and then use this model to detect abnormal data in the data to be detected, thereby improving the data overall quality.
为了保证异常数据检测的准确性,确定出能够反映不同维度的数据之间的符合真实规律的且具备业务解释性的模型就成为异常重要的一环。因此如何提高确定不同维度的数据之间的模型的准确性成为了我们的研究重点。In order to ensure the accuracy of abnormal data detection, it is extremely important to determine a business-explanatory model that can reflect the real rules between data in different dimensions. Therefore, how to improve the accuracy of models for determining data of different dimensions has become the focus of our research.
基于此,本发明实施例提供如下根据n组样本数据确定目标模型的方法,如图1所示,包括:Based on this, embodiments of the present invention provide the following method for determining a target model based on n sets of sample data, as shown in Figure 1, including:
步骤101,获取分布在M个维度的初始n组样本数据;其中,每一组样本数据具有M个维度。Step 101: Obtain initial n groups of sample data distributed in M dimensions; wherein each group of sample data has M dimensions.
步骤102,针对M个维度中的目标维度,根据所述初始n组样本数据的相关系数从M个维度中选取与所述目标维度存在相关关系的K个自变量维度;所述目标维度为所述M个维度中的任一维度。Step 102: For the target dimension among the M dimensions, select K independent variable dimensions that are correlated with the target dimension from the M dimensions according to the correlation coefficients of the initial n groups of sample data; the target dimension is the Any of the M dimensions.
步骤103,根据分布在所述目标维度和所述K个自变量维度的目标n组样本数据确定所述目标模型;所述目标模型用于表征所述目标维度和所述K个自变量维度之间满足的关系。Step 103: Determine the target model based on the target n groups of sample data distributed in the target dimension and the K independent variable dimensions; the target model is used to characterize the target dimension and the K independent variable dimensions. satisfying relationship.
在步骤101中,获取初始n组样本数据,每一组样本数据分布在M个维度。In step 101, initial n groups of sample data are obtained, and each group of sample data is distributed in M dimensions.
本发明实施例对获取样本数据的途径和方法不作限制,可以从数据库中自动读取,或者由人工导入样本数据。例如,通过pandas库读取excel格式训练数据文本或者其他方式读取数据库。The embodiments of the present invention do not limit the ways and methods for obtaining sample data. The sample data can be automatically read from the database or manually imported. For example, read the training data text in excel format through the pandas library or read the database in other ways.
可选地,可以对读取的样本数据进行初步的预处理,特别对于样本数据的缺失值进行默认填充, 根据数据特性可以选择0值填充或者中位数填充等方式。还可以对数据进行清洗以满足算法训练要求的格式。Optionally, preliminary preprocessing can be performed on the read sample data, especially the missing values of the sample data are filled by default. Depending on the data characteristics, you can choose to fill with 0 values or fill with medians. Data can also be cleaned into a format that meets algorithm training requirements.
表1示出了可能的读取到的样本数据。Table 1 shows possible read sample data.
表1Table 1
Figure PCTCN2022121926-appb-000001
Figure PCTCN2022121926-appb-000001
Figure PCTCN2022121926-appb-000002
Figure PCTCN2022121926-appb-000002
在表1中,包含了日期列,和5个维度列,5个维度分别为维度A:存放同业款项;维度B:境内商业银行;维度C:境内其他银行业金融机构;维度D:境内其他金融机构;维度E:应收利息。表1包含了61组样本数据,分别为从2016年1月到2021年1月的样本数据。Table 1 contains a date column and five dimension columns. The five dimensions are Dimension A: deposits with interbank funds; Dimension B: domestic commercial banks; Dimension C: other domestic banking financial institutions; Dimension D: domestic other Financial institutions; Dimension E: Interest receivable. Table 1 contains 61 sets of sample data, from January 2016 to January 2021.
各维度中存在一定的勾稽关系,本发明实施例的目标为在众多样本数据中自动挖掘出各维度间存在的等式或约等式关系,并通过后续算法尽量提升关系等式的准确性。从而针对任一目标维度,得到一个对应的目标模型,以基于目标模型实现异常数据检测的目的。There are certain correlations in each dimension. The goal of the embodiments of the present invention is to automatically discover the equal or approximately equal relationships between various dimensions in numerous sample data, and to try to improve the accuracy of the relationship equations through subsequent algorithms. Thus, for any target dimension, a corresponding target model is obtained to achieve the purpose of abnormal data detection based on the target model.
一种可能的实施方式,获得的样本数据均作为初始n组样本数据进行后续的目标模型的确定。In one possible implementation, the obtained sample data are used as initial n sets of sample data for subsequent determination of the target model.
另一种可能的实施方式,将获得的样本数据划分为训练集和测试集。将训练集作为初始n组样本数据进行后续的目标模型的确定,将测试集用于对得到的目标模型进行测试和验证,以评估目标模型的准确性。Another possible implementation is to divide the obtained sample data into a training set and a test set. The training set is used as the initial n sets of sample data for subsequent determination of the target model, and the test set is used to test and verify the obtained target model to evaluate the accuracy of the target model.
若将样本数据划分为训练集和测试集,则可以按照一定比例对样本数据进行划分。本发明实施例对划分比例不作限制,例如9:1、8:2等等。在划分训练集和测试集时,可以按照一定规则对样本数据进行排序后划分,也可以不对样本数据进行排序后划分,对此不作限制,因为对样本数据的排序与否均不影响确定的目标模型的准确性。If the sample data is divided into a training set and a test set, the sample data can be divided according to a certain proportion. The embodiment of the present invention does not limit the division ratio, such as 9:1, 8:2, etc. When dividing the training set and the test set, the sample data can be sorted and divided according to certain rules, or the sample data can be divided without sorting. There is no restriction on this, because whether the sample data is sorted or not does not affect the determined goal. Model accuracy.
下面以一个详细的例子对获得的样本数据划分为训练集和测试集进行介绍。The following uses a detailed example to introduce the division of the obtained sample data into a training set and a test set.
将样本数据按日期进行排序,并按照训练集:测试集=9:1对样本数据进行划分,前90%的样本数据作为训练集,后10%的样本数据作为测试集。在表1的例子中,共采集61组样本数据,将前55组样本数据(即2016年1月到2020年7月)作为训练集,用于确定目标模型;最后6组样本数据(即2020年8月到2021年1月)作为测试集,用于对目标模型进行验证和测试,以评估目 标模型的准确性。Sort the sample data by date, and divide the sample data according to training set: test set = 9:1. The first 90% of the sample data is used as the training set, and the last 10% of the sample data is used as the test set. In the example in Table 1, a total of 61 sets of sample data were collected, and the first 55 sets of sample data (i.e., January 2016 to July 2020) were used as training sets to determine the target model; the last 6 sets of sample data (i.e., 2020 August 2020 to January 2021) is used as a test set to verify and test the target model to evaluate the accuracy of the target model.
一种可能的情况,若获取的样本数据的数据量较小,则可以采用K折交叉验证方式进行样本数据的切割。例如获取的样本数据仅有10组,将样本数量平均折成K份(如5份),每份有2组数据,那么在目标模型确定阶段可以随机的抽取其中4份(8组)作为训练集,其中1份(2组)作为测试集,采用训练集得到目标模型的回归系数。重复多次随机抽取动作生成多个回归系数,将多个回归系数进行加权平均得到最终的回归系数。如此,弥补了样本数据少导致训练不充分的问题。One possible situation is that if the amount of sample data obtained is small, the K-fold cross-validation method can be used to cut the sample data. For example, there are only 10 groups of sample data obtained. If the number of samples is divided into K parts on average (such as 5 parts), each part has 2 sets of data, then 4 of them (8 groups) can be randomly selected as training during the target model determination stage. set, one of which (2 groups) is used as the test set, and the training set is used to obtain the regression coefficient of the target model. The random extraction action is repeated multiple times to generate multiple regression coefficients, and the multiple regression coefficients are weighted and averaged to obtain the final regression coefficient. This makes up for the problem of insufficient training caused by small sample data.
下面以前55组样本数据(即2016年1月到2020年7月)作为训练集,来确定目标模型为例,介绍确定目标模型的方法。The following uses the first 55 sets of sample data (i.e., January 2016 to July 2020) as the training set to determine the target model as an example to introduce the method of determining the target model.
前55组样本数据作为训练集,即,作为分布在M个维度的初始n组样本数据。在上述例子中,初始n组样本数据分布在5个维度。The first 55 sets of sample data are used as the training set, that is, as the initial n sets of sample data distributed in M dimensions. In the above example, the initial n sets of sample data are distributed in 5 dimensions.
在步骤102中,由于5个维度的样本数据之间不一定全都具有线性回归关系,可能只有其中的几个维度之间具有线性回归关系,因此需要针对M个维度中的每个目标维度,确定与该目标维度存在相关关系的自变量维度。In step 102, since the sample data of the five dimensions may not all have a linear regression relationship, and only a few of the dimensions may have a linear regression relationship, it is necessary to determine for each target dimension in the M dimensions. Independent variable dimensions that are related to the target dimension.
例如,针对表1的5个维度,令每个维度为目标维度,为该目标维度选择对应的自变量维度。之后将目标维度的各数据和自变量维度的各数据代入线性回归方程。线性回归方程为y=θ 1x 12x 23x 3+...+θ nx n。其中y为目标维度对应的各数据,x1、x2……为各自变量维度对应的各数据。 For example, for the five dimensions in Table 1, let each dimension be a target dimension, and select the corresponding independent variable dimension for the target dimension. Then, each data in the target dimension and each data in the independent variable dimension are substituted into the linear regression equation. The linear regression equation is y=θ 1 x 12 x 23 x 3 +...+θ n x n . Among them, y is the data corresponding to the target dimension, x1, x2... are the data corresponding to the respective variable dimensions.
以将维度A作为目标维度为例,介绍为目标维度选择自变量维度的方法。在本例中,维度A为目标维度,维度B、维度C、维度D和维度E为候选自变量维度,接下来要从这些候选自变量维度中选取出维度A对应的自变量维度。Taking dimension A as the target dimension as an example, the method of selecting independent variable dimensions for the target dimension is introduced. In this example, dimension A is the target dimension, dimension B, dimension C, dimension D and dimension E are candidate independent variable dimensions. Next, the independent variable dimension corresponding to dimension A must be selected from these candidate independent variable dimensions.
首先构建初始n组样本数据矩阵(55×5),包含5个维度55组数据。将目标维度(维度A)的一列移动到矩阵的最后一列,根据初始n组样本数据矩阵计算出相关系数矩阵r,相关系数矩阵r通过协方差公式计算得出。First, an initial n group of sample data matrices (55×5) are constructed, containing 55 groups of data in 5 dimensions. Move one column of the target dimension (dimension A) to the last column of the matrix, and calculate the correlation coefficient matrix r based on the initial n sets of sample data matrices. The correlation coefficient matrix r is calculated through the covariance formula.
具体计算公式如下:The specific calculation formula is as follows:
Figure PCTCN2022121926-appb-000003
Figure PCTCN2022121926-appb-000003
其中X i为任一候选自变量维度每月的数据值,
Figure PCTCN2022121926-appb-000004
为该候选自变量维度55个月的数据的平均值;Y i为目标维度每月的数据值,
Figure PCTCN2022121926-appb-000005
为目标维度55个月的数据的平均值。将上述数据代入公式1,即可得目标维度与任一候选自变量维度的相关系数。例如,Yi为维度A在55个月中每个月的数据值,
Figure PCTCN2022121926-appb-000006
为维度A在55个月的数据的平均值;X1为维度B在55个月中每个月的数据值,
Figure PCTCN2022121926-appb-000007
为维度B在55个月的数据的平均值,将上述数据代入公式1,可得维度A和维度B的相关系数。采用同样的方式,可得,维度A和维度C的相关系数,维度A和维度D的相关系数,维度A和维度E的相关系数。此处不再一一列举。
where X i is the monthly data value of any candidate independent variable dimension,
Figure PCTCN2022121926-appb-000004
is the average of 55 months of data in the candidate independent variable dimension; Y i is the monthly data value in the target dimension,
Figure PCTCN2022121926-appb-000005
It is the average of 55 months of data for the target dimension. By substituting the above data into Formula 1, the correlation coefficient between the target dimension and any candidate independent variable dimension can be obtained. For example, Yi is the data value of dimension A for each month in 55 months,
Figure PCTCN2022121926-appb-000006
is the average value of dimension A’s data in 55 months; X1 is the data value of dimension B in each of 55 months,
Figure PCTCN2022121926-appb-000007
is the average of 55 months of data for dimension B. Substituting the above data into formula 1, the correlation coefficient of dimension A and dimension B can be obtained. Using the same method, we can obtain the correlation coefficient between dimension A and dimension C, the correlation coefficient between dimension A and dimension D, and the correlation coefficient between dimension A and dimension E. No more enumeration here.
例如,各相关系数形成了如下相关系数矩阵r。For example, each correlation coefficient forms the following correlation coefficient matrix r.
Figure PCTCN2022121926-appb-000008
Figure PCTCN2022121926-appb-000008
最后一列为目标维度列,即维度A,根据最后一列可知,维度A和维度B的相关系数为0.9976391;维度A和维度C的相关系数为-0.07923952,维度A和维度D的相关系数为0.63029953, 维度A和维度E的相关系数为0.46870661。相关系数的绝对值越接近1表示两者越相关。The last column is the target dimension column, which is dimension A. According to the last column, the correlation coefficient between dimension A and dimension B is 0.9976391; the correlation coefficient between dimension A and dimension C is -0.07923952, and the correlation coefficient between dimension A and dimension D is 0.63029953. The correlation coefficient between dimension A and dimension E is 0.46870661. The closer the absolute value of the correlation coefficient is to 1, the more relevant the two are.
然后基于相关系数矩阵r计算每个候选自变量维度的方差贡献值。方差贡献值的公式如下。Then the variance contribution value of each candidate independent variable dimension is calculated based on the correlation coefficient matrix r. The formula for variance contribution is as follows.
Figure PCTCN2022121926-appb-000009
Figure PCTCN2022121926-appb-000009
其中,columns为矩阵r的总列数,在本例中,columns=5。r(i,i)代表相关系数矩阵中第i行第i列的数值。例如,r(1,5) 2/r(1,1)=0.9976391 2=0.99528377。即,维度B对以维度A为目标维度得到的目标模型的方差贡献值为0.99528377。 Among them, columns is the total number of columns of matrix r. In this example, columns=5. r(i,i) represents the value of the i-th row and i-th column in the correlation coefficient matrix. For example, r(1,5) 2 /r(1, 1)=0.9976391 2 =0.99528377. That is, the variance contribution value of dimension B to the target model obtained with dimension A as the target dimension is 0.99528377.
最终得到的维度B、维度C、维度D和维度E对以维度A为目标维度得到的目标模型的方差贡献值的矩阵为[0.99528377 0.0062789 0.3972775 0.21968589],方差贡献值越大表示对以维度A为目标维度得到的目标模型越有益。The finally obtained matrix of variance contribution values of dimension B, dimension C, dimension D and dimension E to the target model with dimension A as the target dimension is [0.99528377 0.0062789 0.3972775 0.21968589]. The larger the variance contribution value, the greater the variance contribution value of the target model with dimension A as the target dimension. The target model obtained by the target dimension is more beneficial.
通过公式3计算最大方差贡献值对应F分布的F值。最大方差贡献值为维度B对应的方差贡献值。Calculate the F value of the F distribution corresponding to the maximum variance contribution value through Formula 3. The maximum variance contribution value is the variance contribution value corresponding to dimension B.
Figure PCTCN2022121926-appb-000010
Figure PCTCN2022121926-appb-000010
其中,nos就是n,in为候选自变量维度的数量。在本例中,n=55,in=4。Among them, nos is n, and in is the number of candidate independent variable dimensions. In this example, n=55, in=4.
代入公式计算得出维度B的F值为11184.801222455637,对照F分布表将F值转化为分布概率p值为2.449050249153728e-63。在统计学中,一般p值<0.05,表明该自变量是显著的,可以引入到回归方程中。因此首先将维度B作为目标模型的自变量维度。Substituting into the formula, the F value of dimension B is 11184.801222455637. The F value is converted into a distribution probability p value of 2.449050249153728e-63 according to the F distribution table. In statistics, the general p value is <0.05, indicating that the independent variable is significant and can be introduced into the regression equation. Therefore, dimension B is first used as the independent variable dimension of the target model.
然后采用如下方式对相关系数矩阵r进行矩阵变换:Then use the following method to perform matrix transformation on the correlation coefficient matrix r:
i:当前第几行;j:当前第几列;k:v中方差贡献值最大的因子下标,这里值是1;变换公式如下:i: the current row; j: the current column; k: the subscript of the factor with the largest variance contribution in v, where the value is 1; the transformation formula is as follows:
当i!=k并且j!=k时:r[i,j]的新值=r[k,j]/r[k,k];When i! =k and j! =k: the new value of r[i,j]=r[k,j]/r[k,k];
当i!=k并且j!=k时:r[i,j]的新值=r[i,j]-r[i,k]*r[k,j]/r[k,k];When i! =k and j! =k: the new value of r[i,j]=r[i,j]-r[i,k]*r[k,j]/r[k,k];
当i!=k并且j=k时:r[i,j]的新值=-r[i,k]/r[k,k];When i! =k and j=k: the new value of r[i,j]=-r[i,k]/r[k,k];
其他情况r[i,j]的新值=1/r[k,k];In other cases, the new value of r[i,j]=1/r[k,k];
变换后的矩阵r为:The transformed matrix r is:
Figure PCTCN2022121926-appb-000011
Figure PCTCN2022121926-appb-000011
然后基于变换后的相关系数矩阵r重复上述计算最大方差贡献值的步骤,不断逐步迭代挑选新的自变量维度。最终当目标维度为维度A时,得到的自变量维度为维度B、维度C和维度D。Then based on the transformed correlation coefficient matrix r, repeat the above steps of calculating the maximum variance contribution value, and continue to iteratively select new independent variable dimensions. Finally, when the target dimension is dimension A, the obtained independent variable dimensions are dimension B, dimension C and dimension D.
采用同样的方式,可得当目标维度为维度B时,得到的各自变量维度;目标维度为维度C时,得到的自变量维度;目标维度为维度D时,得到的各自变量维度。在此不再一一赘述。In the same way, when the target dimension is dimension B, the respective variable dimensions are obtained; when the target dimension is dimension C, the independent variable dimensions are obtained; when the target dimension is dimension D, the respective variable dimensions are obtained. I won’t go into details here.
值得注意的是,不同的目标维度对应的自变量维度的数量可能相同也可以能不同。例如,维度A为目标维度时,对应的自变量维度为3个,分别为维度B、维度C和维度D;维度B为目标维度时,对应的自变量维度为2个,分别为维度C和维度D;维度C为目标维度时,对应的自变量维度为1个,为维度D。It is worth noting that the number of independent variable dimensions corresponding to different target dimensions may be the same or different. For example, when dimension A is the target dimension, the corresponding independent variable dimensions are 3, namely dimension B, dimension C and dimension D; when dimension B is the target dimension, the corresponding independent variable dimensions are 2, namely dimension C and dimension. Dimension D; when dimension C is the target dimension, the corresponding independent variable dimension is 1, which is dimension D.
在步骤103中,针对任一目标维度而言,介绍确定该目标维度对应的目标模型的过程。In step 103, for any target dimension, the process of determining the target model corresponding to the target dimension is introduced.
例如,针对目标维度为维度A,对应的自变量维度为维度B、维度C和维度D进行举例。For example, let's take an example where the target dimension is dimension A, and the corresponding independent variable dimensions are dimension B, dimension C, and dimension D.
确定目标模型的过程如图2所示,包括:The process of determining the target model is shown in Figure 2, including:
步骤201,获取分布在所述目标维度和所述K个自变量维度的目标n组样本数据。Step 201: Obtain target n groups of sample data distributed in the target dimension and the K independent variable dimensions.
步骤202,根据所述目标n组样本数据和除候选组样本数据以外的n-1组样本数据确定所述候选组样本数据对所述目标模型的准确性的影响度;所述候选组样本数据为所述目标n组样本数据中的任一组样本数据。Step 202: Determine the influence of the candidate group of sample data on the accuracy of the target model based on the target n groups of sample data and n-1 groups of sample data except the candidate group of sample data; the candidate group of sample data is any group of sample data among the target n groups of sample data.
步骤203,根据n个候选组样本数据的影响度,从所述目标n组样本数据中去除w组强影响点数据;1≤w<n。Step 203: Remove w groups of strong influence point data from the target n groups of sample data according to the influence degrees of the n candidate groups of sample data; 1≤w<n.
步骤204,根据保留的样本数据确定所述目标模型。Step 204: Determine the target model based on the retained sample data.
在步骤201中,在初始n组样本数据中确定目标n组样本数据。示例性地,当目标维度为维度A时,对应的自变量维度为维度B、维度C和维度D,K=3。因此确定的目标n组样本数据为,分布在维度A、维度B、维度C和维度D共4个维度的55组样本数据。In step 201, target n sets of sample data are determined among the initial n sets of sample data. For example, when the target dimension is dimension A, the corresponding independent variable dimensions are dimension B, dimension C and dimension D, and K=3. Therefore, the determined target n groups of sample data are 55 groups of sample data distributed in four dimensions: dimension A, dimension B, dimension C and dimension D.
下面介绍若此时采用这55组样本数据进行拟合,得到的拟合结果。以采用最小二乘法进行拟合为例。The following describes the fitting results obtained if these 55 sets of sample data are used for fitting at this time. Take fitting using the least squares method as an example.
最小二乘法原理是通过计算回归系数尽量让各个数据点可以贴近拟合的直线,图3示出了一种可能的拟合情况示意图。在图3中,各点均匀地分布在拟合直线的周围,各点的实际值和在直线上对应的预测值的距离最小。The principle of the least squares method is to try to make each data point close to the fitted straight line by calculating the regression coefficient. Figure 3 shows a schematic diagram of a possible fitting situation. In Figure 3, each point is evenly distributed around the fitted straight line, and the distance between the actual value of each point and the corresponding predicted value on the straight line is the smallest.
在本例中,目标方程为:y=θ 1x 12x 23x 3。其中y为维度A,x1为维度B,x2为维度C,x3为维度D。我们要计算回归系数β,回归系数决定了直线的斜率,让直线可以尽量拟合55组样本数据,也就是要让所有点与方程直线之间的距离总和最小,距离总和可以用RSS残差平方和来定义。 In this example, the target equation is: y=θ 1 x 12 x 23 x 3 . Among them, y is dimension A, x1 is dimension B, x2 is dimension C, and x3 is dimension D. We need to calculate the regression coefficient β. The regression coefficient determines the slope of the straight line, so that the straight line can fit the 55 sets of sample data as much as possible, that is, to minimize the sum of distances between all points and the straight line of the equation. The sum of the distances can be calculated using the RSS residual square. and to define.
Figure PCTCN2022121926-appb-000012
Figure PCTCN2022121926-appb-000012
其中y i为实际值,
Figure PCTCN2022121926-appb-000013
为预测值。在确保RSS最小的前提下,通过最小二乘法求解标准方程公式5,求出回归系数。
where y i is the actual value,
Figure PCTCN2022121926-appb-000013
is the predicted value. On the premise of ensuring the minimum RSS, the standard equation Formula 5 is solved by the least squares method to obtain the regression coefficient.
β=(X TX) -1X TY     公式5 β=(X T X) -1 X T Y Formula 5
将分布在4个维度的55组样本数据代入上述公式,得到回归系数β。回归系数β也是个矩阵。根据55组样本数据拟合后得到的第一拟合模型为:A=1.0053×B+0.25×C+0.9828×D。该第一拟合模型的拟合度R 2为0.999,各参数显著性水平参考值p-value较低,表示根据这55组样本数据进行拟合得到的第一拟合模型的拟合度较佳,该第一拟合模型可以较好地反映这55组样本数据之间的规则。 Substitute 55 groups of sample data distributed in 4 dimensions into the above formula to obtain the regression coefficient β. The regression coefficient β is also a matrix. The first fitting model obtained after fitting 55 sets of sample data is: A=1.0053×B+0.25×C+0.9828×D. The fitting degree R2 of this first fitting model is 0.999, and the reference value p-value of each parameter significance level is low, which means that the fitting degree of the first fitting model obtained by fitting these 55 sets of sample data is relatively good. Good, the first fitting model can better reflect the rules between these 55 sets of sample data.
表面上根据拟合度和显著性水平来评估该第一拟合模型,可以得到该第一拟合模型是合理且较准确的结论。然而,根据业务经验和历史数据可知,维度A、维度B、维度C和维度D之间应该满足的关系是:A=1×B+1×C+1×D。由此可知,之前得到的第一拟合模型是不符合业务经验和历史数据的,不具有业务解释性。采用这样的第一拟合模型对待检测数据进行异常数据检测必然会出现检测准确性下降的问题。On the surface, by evaluating the first fitting model based on the degree of fit and significance level, it can be concluded that the first fitting model is reasonable and more accurate. However, according to business experience and historical data, the relationship that should be satisfied between dimension A, dimension B, dimension C and dimension D is: A=1×B+1×C+1×D. It can be seen that the first fitting model obtained previously is not in line with business experience and historical data, and has no business interpretability. Using such a first fitting model to detect abnormal data on the data to be detected will inevitably lead to a decrease in detection accuracy.
进一步分析可知,55组样本数据中可能出现了异常数据,异常数据的存在导致了所得到的第一拟合模型是不符合业务经验和历史数据的。异常数据的出现原因有很多,例如,采集或者录入过程中出现差错,或者样本数据本身存在误差和异常。Further analysis showed that abnormal data may have appeared in the 55 sets of sample data. The existence of abnormal data caused the obtained first fitting model to be inconsistent with business experience and historical data. There are many reasons for the occurrence of abnormal data. For example, errors occur during the collection or entry process, or there are errors and abnormalities in the sample data itself.
下面通过杠杆率分析来验证上述猜想。The following conjecture is verified through leverage ratio analysis.
基于之前得到的第一拟合模型分析每组样本数据的杠杆率,杠杆率反映了每组样本数据对于第一拟合模型的回归系数的影响程度,对于多元线性回归可以通过OLS最小二乘法求解系数的标准方程推导得出杠杆率矩阵计算公式为:Based on the previously obtained first fitting model, the leverage ratio of each group of sample data is analyzed. The leverage ratio reflects the degree of influence of each group of sample data on the regression coefficient of the first fitting model. For multiple linear regression, it can be solved by the OLS least squares method. The standard equation of coefficients is derived and the leverage matrix calculation formula is:
H=X(X TX) -1X T   公式6 H=X(X T X) -1 X TFormula 6
H矩阵反映了每组样本数据的实际观测值对于预测值的投影,相当于通过H矩阵可以将实际观测值转换成预测值。其中第i组样本数据的杠杆率即对应H矩阵对角线第i个元素的值。在上述示例中我们计算55组样本数据的杠杆率统计量如表2所示。The H matrix reflects the projection of the actual observed values of each set of sample data onto the predicted values, which is equivalent to converting the actual observed values into predicted values through the H matrix. The leverage ratio of the i-th group of sample data corresponds to the value of the i-th element on the diagonal of the H matrix. In the above example, we calculated the leverage statistics of 55 groups of sample data as shown in Table 2.
表2Table 2
时间time 杠杆率统计量Leverage statistics
2016年1月January 2016 0.3659530.365953
2016年2月February 2016 0.3751850.375185
2016年3月March 2016 0.0010010.001001
2016年4月April 2016 0.0002120.000212
2016年5月May 2016 0.0080140.008014
2016年6月June 2016 0.0034560.003456
2016年7月July 2016 0.0001470.000147
2016年8月August 2016 0.0355910.035591
2016年9月September 2016 0.0028770.002877
2016年10月October 2016 0.0002160.000216
2016年11月November 2016 0.0005090.000509
2016年12月December 2016 0.0110520.011052
2017年1月January 2017 0.0073500.007350
2017年2月February 2017 0.0380060.038006
2017年3月March 2017 0.0177860.017786
2017年4月April 2017 0.0132820.013282
2017年5月May 2017 0.0079320.007932
2017年6月June 2017 0.0965340.096534
2017年7月July 2017 0.0025920.002592
2017年8月August 2017 0.0075810.007581
2017年9月September 2017 0.0216320.021632
2017年10月October 2017 0.0029010.002901
2017年11月November 2017 0.0095970.009597
2017年12月December 2017 0.0083000.008300
2018年1月January 2018 0.0026000.002600
2018年2月February 2018 0.0056190.005619
2018年3月March 2018 0.0101880.010188
2018年4月April 2018 0.0336330.033633
2018年5月May 2018 0.0265800.026580
2018年6月June 2018 0.0179390.017939
2018年7月July 2018 0.0257240.025724
2018年8月August 2018 0.0236650.023665
2018年9月September 2018 0.0294550.029455
2018年10月October 2018 0.0906730.090673
2018年11月November 2018 0.0664640.066464
2018年12月December 2018 0.0036730.003673
2019年1月January 2019 0.2869480.286948
2019年2月February 2019 0.0806260.080626
2019年3月March 2019 0.0175270.017527
2019年4月April 2019 0.0140010.014001
2019年5月May 2019 0.0097240.009724
2019年6月June 2019 0.0096630.009663
2019年7月July 2019 0.0123190.012319
2019年8月August 2019 0.1833560.183356
2019年9月September 2019 0.0122750.012275
2019年10月October 2019 0.0116230.011623
2019年11月November 2019 0.0202660.020266
2019年12月December 2019 0.0078630.007863
2020年1月January 2020 0.4235670.423567
2020年2月February 2020 0.0178720.017872
2020年3月March 2020 0.1904810.190481
2020年4月April 2020 0.0222840.022284
2020年5月May 2020 0.0260100.026010
2020年6月June 2020 0.2604770.260477
2020年7月July 2020 0.0191990.019199
可以发现,前两组样本数据对应的杠杆率值分别为0.365953和0.375185,远远大于2倍的杠杆率统计量的平均值,因此可以判断前两条样本数据是较为极端的数据。有这样极端数据的存在,得到的第一拟合模型不符合业务经验和历史数据的可能性极大。It can be found that the leverage ratio values corresponding to the first two sets of sample data are 0.365953 and 0.375185 respectively, which are far greater than the average of 2 times the leverage ratio statistic. Therefore, it can be judged that the first two sample data are relatively extreme data. With the existence of such extreme data, it is highly likely that the first fitting model obtained does not conform to business experience and historical data.
但是采用杠杆率分析的方法将样本数据中的异常数据检测出来不够准确,不具有普适性。因此提供一种方法,用于确定目标n组样本数据中的强影响点数据,根据去除了强影响点数据后的样本数据确定目标维度和各自变量维度之间满足的关系,更加准确。详见步骤202-204。However, using leverage ratio analysis to detect abnormal data in sample data is not accurate enough and is not universal. Therefore, a method is provided to determine the strong influence point data in the target n group of sample data, and determine the satisfying relationship between the target dimension and the respective variable dimension based on the sample data after removing the strong influence point data, which is more accurate. See steps 202-204 for details.
在步骤202中,根据所述目标n组样本数据和除候选组样本数据以外的n-1组样本数据确定所述候选组样本数据对所述目标模型的准确性的影响度;所述候选组样本数据为所述目标n组样本数据中的任一组样本数据。In step 202, determine the influence degree of the candidate group sample data on the accuracy of the target model based on the target n group of sample data and n-1 groups of sample data except the candidate group sample data; the candidate group The sample data is any group of sample data among the target n groups of sample data.
遍历目标n组样本数据中的每一组样本数据,计算剔除该组样本数据对目标模型的准确性的影响度。例如,将第1组样本数据作为候选组样本数据,根据55组样本数据拟合的第一拟合模型和除第1组样本数据以外的54组样本数据拟合的第二模型,确定第1组样本数据对目标模型的准确性的影响度;将第2组样本数据作为候选组样本数据,根据55组样本数据拟合的第一拟合模型和除第2组样本数据以外的54组样本数据拟合的第二模型,确定第2组样本数据对目标模型的准确性的影响度;将第3组样本数据作为候选组样本数据,根据55组样本数据拟合的第一拟合模型和除第3组样本数据以外的54组样本数据拟合的第二模型,确定第3组样本数据对目标模型的准确性的影响度……依次类推,得到了55组样本数据中,每组样本数据对目标模型的准确性的影响度。Traverse each group of sample data in the target n groups of sample data, and calculate the impact of eliminating this group of sample data on the accuracy of the target model. For example, the first group of sample data is used as the candidate group of sample data, and based on the first fitting model fitted to 55 groups of sample data and the second model fitted to 54 groups of sample data except the first group of sample data, the first The influence of the set of sample data on the accuracy of the target model; using the second set of sample data as the candidate set of sample data, the first fitting model fitted based on the 55 sets of sample data and the 54 sets of samples except the second set of sample data The second model of data fitting determines the impact of the second group of sample data on the accuracy of the target model; the third group of sample data is used as the candidate group of sample data, and the first fitting model is fitted based on the 55 groups of sample data. The second model fitted to 54 groups of sample data except the 3rd group of sample data is used to determine the influence of the 3rd group of sample data on the accuracy of the target model...and so on, and the samples of each group of 55 groups of sample data are obtained. The impact of data on the accuracy of the target model.
具体计算任一候选组样本数据的影响度的方式如下:对所述目标n组样本数据进行拟合,得到第一拟合模型的第一拟合系数;对除所述候选组样本数据以外的n-1组样本数据进行拟合,得到第二拟合模型的第二拟合系数;根据所述第一拟合系数、所述第二拟合系数、所述目标模型中包含的自变量维度的数量和所述第一拟合模型的均方误差确定所述影响度。The specific way to calculate the influence of any candidate group sample data is as follows: fit the target n groups of sample data to obtain the first fitting coefficient of the first fitting model; Fit n-1 groups of sample data to obtain the second fitting coefficient of the second fitting model; according to the first fitting coefficient, the second fitting coefficient, and the independent variable dimensions included in the target model The quantity and the mean square error of the first fitted model determine the degree of influence.
具体公式如下:The specific formula is as follows:
Figure PCTCN2022121926-appb-000014
Figure PCTCN2022121926-appb-000014
其中,p为模型中包含的自变量维度的数量;s为第一拟合模型的均方误差;
Figure PCTCN2022121926-appb-000015
为根据目标n组样本数据拟合得到的回归系数矩阵,即,第一拟合系数;
Figure PCTCN2022121926-appb-000016
为剔除第i组样本数据后的回归系数矩阵,即第二拟合系数;
Figure PCTCN2022121926-appb-000017
为根据目标n组样本数据拟合得到的预测值;
Figure PCTCN2022121926-appb-000018
为剔除第i组样本数据后的预测值。这里的第i组样本数据就是候选组样本数据。在本例中,p=3。s通过如下公式计算:
Among them, p is the number of independent variable dimensions included in the model; s is the mean square error of the first fitting model;
Figure PCTCN2022121926-appb-000015
is the regression coefficient matrix obtained by fitting the target n group of sample data, that is, the first fitting coefficient;
Figure PCTCN2022121926-appb-000016
is the regression coefficient matrix after eliminating the i-th group of sample data, that is, the second fitting coefficient;
Figure PCTCN2022121926-appb-000017
It is the predicted value obtained by fitting the target n group of sample data;
Figure PCTCN2022121926-appb-000018
is the predicted value after excluding the i-th group of sample data. The i-th group of sample data here is the candidate group of sample data. In this example, p=3. s is calculated by the following formula:
Figure PCTCN2022121926-appb-000019
Figure PCTCN2022121926-appb-000019
其中n为样本数据的组数,n-p代表了第一拟合模型的自由度。在本例中,n=55。Where n is the number of groups of sample data, and n-p represents the degrees of freedom of the first fitting model. In this example, n=55.
影响度反映了每组样本数据对于目标模型的准确性的影响度,原则上一个正常的模型,每组样本数据对于模型的影响程度是相似的,影响程度越大说明该组样本数据异常的概率越大。表3示出了一种可能的每组样本数据的影响度。The degree of influence reflects the influence of each group of sample data on the accuracy of the target model. In principle, for a normal model, the degree of influence of each group of sample data on the model is similar. The greater the degree of influence, the greater the probability that the sample data of this group is abnormal. The bigger. Table 3 shows a possible influence degree of each group of sample data.
表3table 3
时间time 影响度Influence
2016年1月January 2016 1.868793e+001.868793e+00
2016年2月February 2016 2.350884e+002.350884e+00
2016年3月March 2016 7.227036e-077.227036e-07
2016年4月April 2016 1.102575e-061.102575e-06
2016年5月May 2016 2.469101e-062.469101e-06
2016年6月June 2016 1.196689e-061.196689e-06
2016年7月July 2016 1.097542e-091.097542e-09
2016年8月August 2016 1.162187e-041.162187e-04
2016年9月September 2016 2.600043e-072.600043e-07
2016年10月October 2016 7.922831e-087.922831e-08
2016年11月November 2016 1.176777e-081.176777e-08
2016年12月December 2016 1.186501e-051.186501e-05
2017年1月January 2017 5.410101e-065.410101e-06
2017年2月February 2017 2.026929e-042.026929e-04
2017年3月March 2017 5.014946e-055.014946e-05
2017年4月April 2017 3.358984e-053.358984e-05
2017年5月May 2017 5.519991e-065.519991e-06
2017年6月June 2017 8.286230e-048.286230e-04
2017年7月July 2017 1.477823e-061.477823e-06
2017年8月August 2017 1.028268e-071.028268e-07
2017年9月September 2017 7.183496e-087.183496e-08
2017年10月October 2017 5.900770e-075.900770e-07
2017年11月November 2017 2.246217e-052.246217e-05
2017年12月December 2017 8.486959e-068.486959e-06
2018年1月January 2018 6.901573e-076.901573e-07
2018年2月February 2018 6.816755e-066.816755e-06
2018年3月March 2018 4.340092e-064.340092e-06
2018年4月April 2018 2.721236e-052.721236e-05
2018年5月May 2018 9.605791e-059.605791e-05
2018年6月June 2018 4.748002e-054.748002e-05
2018年7月July 2018 3.163452e-063.163452e-06
2018年8月August 2018 1.223781e-061.223781e-06
2018年9月September 2018 1.929106e-041.929106e-04
2018年10月October 2018 1.808463e-031.808463e-03
2018年11月November 2018 6.676218e-066.676218e-06
2018年12月December 2018 1.287415e-051.287415e-05
2019年1月January 2019 2.572175e-022.572175e-02
2019年2月February 2019 1.828469e-031.828469e-03
2019年3月March 2019 1.139297e-051.139297e-05
2019年4月April 2019 2.404209e-052.404209e-05
2019年5月May 2019 3.648963e-033.648963e-03
2019年6月June 2019 1.546139e-051.546139e-05
2019年7月July 2019 8.853532e-068.853532e-06
2019年8月August 2019 3.971031e-043.971031e-04
2019年9月September 2019 3.498020e-053.498020e-05
2019年10月October 2019 2.538927e-052.538927e-05
2019年11月November 2019 6.540673e-056.540673e-05
2019年12月December 2019 9.437233e-059.437233e-05
2020年1月January 2020 5.881022e-025.881022e-02
2020年2月February 2020 6.277971e-066.277971e-06
2020年3月March 2020 7.816644e-037.816644e-03
2020年4月April 2020 5.175514e-045.175514e-04
2020年5月May 2020 5.400912e-065.400912e-06
2020年6月June 2020 5.814514e+005.814514e+00
2020年7月July 2020 5.945416e-055.945416e-05
在表3中示意出了,剔除每一候选组样本数据后,得到的该候选组样本数据对应的影响度。Table 3 shows the influence degree corresponding to the sample data of each candidate group obtained after eliminating the sample data of each candidate group.
根据包括候选组样本数据进行拟合得到的第一拟合系数和不包括候选组样本数据进行拟合得到的第二拟合系数,确定候选组样本数据对目标模型的准确性的影响度,提高了确定影响度的准确性。从而就可得到更加准确的目标模型。According to the first fitting coefficient obtained by fitting including the candidate group sample data and the second fitting coefficient obtained by fitting excluding the candidate group sample data, the influence of the candidate group sample data on the accuracy of the target model is determined to improve to determine the accuracy of impact. As a result, a more accurate target model can be obtained.
在步骤203中,根据n个候选组样本数据的影响度,从所述目标n组样本数据中去除w组强影响点数据;1≤w<n。In step 203, w groups of strong influence point data are removed from the target n groups of sample data according to the influence degrees of the n candidate groups of sample data; 1≤w<n.
强影响点数据对目标模型的准确性影响较大,因此应当剔除,本发明实施例对确定强影响点数据的方式不作限制。Strong influence point data has a greater impact on the accuracy of the target model, so it should be eliminated. The embodiment of the present invention does not limit the method of determining strong influence point data.
一种可能的方式,根据运维人员的经验和需求进行设置,若对目标模型的准确性要求较高,则将强影响点数据的阈值设置的较大;若对目标模型的准确性要求相对不太高,则将强影响点数据的阈值设置的稍低。例如,根据经验将阈值设置为4/n,n为目标n组样本数据的组数。在本例中,n=55。若任一候选组样本数据的影响度大于阈值,则将该候选组样本数据确定为强影响点数据,将其剔除。One possible way is to set it according to the experience and needs of the operation and maintenance personnel. If the accuracy requirements of the target model are relatively high, set the threshold value of the strong influence point data to a larger value; if the accuracy requirements of the target model are relatively high, If it is not too high, set the threshold of strong influence point data slightly lower. For example, the threshold is set to 4/n based on experience, where n is the number of target n groups of sample data. In this example, n=55. If the influence degree of any candidate group sample data is greater than the threshold, the candidate group sample data is determined to be a strong influence point data and is eliminated.
另一种可能的方式,采用F分布来确定强影响点数据。具体来说,针对任一候选组样本数据,若所述候选组样本数据对所述目标模型的准确性的影响度大于(p,n-p-1)自由度的F分布的第一个四分之一位,则确定所述候选组样本数据为强影响点数据;其中,p为所述目标模型中包含的自 变量维度的数量;从所述目标n组样本数据中去除所述强影响点数据。Another possible way is to use F distribution to determine strong influence point data. Specifically, for any candidate group of sample data, if the influence of the candidate group of sample data on the accuracy of the target model is greater than the first quarter of the F distribution of (p, n-p-1) degrees of freedom If one bit is used, it is determined that the candidate group sample data is strong influence point data; where p is the number of independent variable dimensions included in the target model; remove the strong influence point data from the target n group of sample data .
举例来说,p=3,n=55,因此通过(3,51)自由度的F分布来确定强影响点数据。针对表3中的任一候选组样本数据对应的影响度,将其与(3,51)自由度的F分布的第一个四分之一位的数值进行比较,若大于该数值,则将其确定为强影响点数据。For example, p=3, n=55, so the strong influence point data is determined through the F distribution of (3, 51) degrees of freedom. For the influence degree corresponding to any candidate group sample data in Table 3, compare it with the value of the first quarter of the F distribution of (3, 51) degrees of freedom. If it is greater than this value, then It is determined to be strong influence point data.
采用(p,n-p-1)自由度的F分布的第一个四分之一位来确定影响度的大小,更加科学合理,提高了确定强影响点数据的准确性。从而就可得到更加准确的目标模型。Using the first quarter of the F distribution with (p, n-p-1) degrees of freedom to determine the degree of influence is more scientific and reasonable, and improves the accuracy of determining strong influence point data. As a result, a more accurate target model can be obtained.
在确定了强影响点数据后,将强影响点数据去除。例如,采用本发明实施例提供的第一种方式确定强影响点数据,最终去除了第1组样本数据、第2组样本数据和第54组样本数据。经过核对发现,其中第1和第2组样本数据确实为异常样本,但是第54组样本数据为实际符合模型但数据波动较大的样本。尽管上述方法无法精确剔除仅存在异常的样本数据,有可能剔除少量无异常的样本如第54组样本数据,但剔除少量无异常样本并不会对目标模型造成实质性的影响。After determining the strong influence point data, remove the strong influence point data. For example, the first method provided by the embodiment of the present invention is used to determine strong influence point data, and the first group of sample data, the second group of sample data, and the 54th group of sample data are finally removed. After verification, it was found that the sample data of the 1st and 2nd groups were indeed abnormal samples, but the 54th group of sample data was a sample that actually conformed to the model but the data fluctuated greatly. Although the above method cannot accurately eliminate sample data with only abnormalities, it is possible to eliminate a small number of non-abnormal samples such as the 54th group of sample data, but eliminating a small number of non-abnormal samples will not have a substantial impact on the target model.
在步骤204中,根据保留的样本数据确定所述目标模型。In step 204, the target model is determined based on the retained sample data.
例如,在经过步骤203中,去除了3组强影响点数据后,根据剩余的52组样本数据确定目标模型。确定的目标模型为:A=1×B+1×C+1×D+1.31e-10,其中1.31e-10为截距常量,可以忽略不计。For example, after removing 3 sets of strong influence point data in step 203, the target model is determined based on the remaining 52 sets of sample data. The determined target model is: A=1×B+1×C+1×D+1.31e-10, where 1.31e-10 is the intercept constant and can be ignored.
可以发现,在去除了强影响点数据后,得到的目标模型符合业务经验和历史数据,具备业务解释性。It can be found that after removing the strong influence point data, the obtained target model is consistent with business experience and historical data, and has business interpretability.
以上介绍了针对目标维度为维度A时确定的目标模型的过程。当目标维度为维度B、维度C、维度D、维度E时,可分别按照步骤201-204的过程确定各自的目标模型。不同的目标模型中包含的维度可能不同。如此,针对5个维度确定了5个对应的目标模型。The above introduces the process of determining the target model when the target dimension is dimension A. When the target dimensions are dimension B, dimension C, dimension D, and dimension E, the respective target models can be determined according to the process of steps 201-204 respectively. The dimensions included in different target models may be different. In this way, 5 corresponding target models are determined for 5 dimensions.
初始n组样本数据分布在M个维度中,但这M个维度不一定全部存在相关关系。因此需要将存在相关关系的维度选择出来。针对任一目标维度,根据所述初始n组样本数据的相关系数进行选择,得到该目标维度对应的具有相关关系的K个自变量维度。如此,每个维度都可作为目标维度,每个目标维度及其自变量维度都可对应确定一个目标模型。考虑到了更加丰富的场景和情况,增加了确定的目标模型的准确性。The initial n sets of sample data are distributed in M dimensions, but not all of these M dimensions are necessarily related. Therefore, it is necessary to select the dimensions with relevant relationships. For any target dimension, selection is made based on the correlation coefficients of the initial n groups of sample data to obtain K independent variable dimensions with correlations corresponding to the target dimension. In this way, each dimension can be used as a target dimension, and each target dimension and its independent variable dimensions can correspondingly determine a target model. Taking into account richer scenarios and situations, the accuracy of the determined target model is increased.
根据所述目标n组样本数据和除候选组样本数据以外的n-1组样本数据确定所述候选组样本数据对目标模型的准确性的影响度,根据影响度从所述目标n组样本数据中去除w组强影响点数据,根据保留的样本数据确定所述目标模型。如此,将对目标模型的准确性影响较大的样本数据剔除,尽量减少这些样本数据对最终得到的目标模型的影响,提高了目标模型的准确性。基于准确性更高的目标模型对待检测数据进行异常数据检测,提高了检测异常数据的准确性。Determine the degree of influence of the candidate group of sample data on the accuracy of the target model based on the target n group of sample data and n-1 groups of sample data except the candidate group of sample data. According to the degree of influence, the target n group of sample data is Remove w groups of strong influence point data, and determine the target model based on the retained sample data. In this way, the sample data that has a greater impact on the accuracy of the target model will be eliminated, the impact of these sample data on the final target model will be minimized, and the accuracy of the target model will be improved. Abnormal data detection is performed on the data to be detected based on a more accurate target model, which improves the accuracy of detecting abnormal data.
在一些实施例中,还会采用测试集中的样本数据对上述得到的各目标模型进行测试和验证。In some embodiments, sample data in the test set are also used to test and verify each target model obtained above.
例如,之前在步骤101中,将n组样本数据划分为训练集和测试集,训练集作为初始n组样本数据用于后续的目标模型的确定。测试集用于对目标模型进行测试和验证。或者,测试集的获取也可通过其他途径,例如,由运维人员提供的非常准确已经确定无异常数据的样本数据,也可作为测试集。For example, previously in step 101, n groups of sample data were divided into training sets and test sets, and the training set was used as the initial n groups of sample data for subsequent determination of the target model. The test set is used to test and validate the target model. Alternatively, the test set can also be obtained through other means. For example, very accurate sample data provided by operation and maintenance personnel that have been determined to have no abnormal data can also be used as a test set.
则在根据保留的样本数据确定所述目标模型之后,还包括:将测试数据输入所述目标模型中进行测试;得到所述目标模型的平均绝对误差率;确定所述目标模型对保留的样本数据进行拟合的拟合度参数和所述平均绝对误差率分别满足预设阈值。After determining the target model based on the retained sample data, the method further includes: inputting test data into the target model for testing; obtaining the average absolute error rate of the target model; and determining the response of the target model to the retained sample data. The fitting degree parameters for fitting and the average absolute error rate respectively meet preset thresholds.
举例来说,在步骤204之后,得到了5个维度对应的5个目标模型,还可以对应得到每个目标模型的拟合度参数,用以表征该目标模型的拟合情况的好坏。将测试集的6条测试数据输入至第1个目标模型,根据如下公式计算平均绝对误差率:For example, after step 204, five target models corresponding to five dimensions are obtained, and the fitting parameters of each target model can also be obtained correspondingly to represent the goodness of the fitting of the target model. Input the 6 test data of the test set into the first target model, and calculate the average absolute error rate according to the following formula:
Figure PCTCN2022121926-appb-000020
Figure PCTCN2022121926-appb-000020
其中,y为6个测试数据的实际值,y为根据该目标模型得到的预测值。如此得到了第1个目标模型的平均绝对误差率。Among them, y is the actual value of the 6 test data, and y is the predicted value obtained based on the target model. In this way, the average absolute error rate of the first target model is obtained.
同样的方式可以得到各目标模型的平均绝对误差率。In the same way, the average absolute error rate of each target model can be obtained.
一种可能的方式,根据各目标模型的平均绝对误差率和拟合度参数对各目标模型进行筛选。例如,筛选出其中的3个目标模型进行后续的对待检测数据检测异常数据。One possible way is to screen each target model based on its average absolute error rate and fitness parameters. For example, three of the target models are selected for subsequent detection of abnormal data on the data to be detected.
另一种可能的方式,根据目标模型的平均绝对误差率和拟合度参数对各目标模型进行打分,确定出符合第一预设条件的绝对等式和符合第二预设条件的约等式,在后续采用各目标模型进行异常数据检测时,可以为绝对等式的评分和约等式的评分赋予不同的权重。例如,将平均绝对误差率小于0.01且拟合度参数大于0.999的目标模型确定为绝对等式;将平均绝对误差率大于等于0.01且小于0.1且拟合度参数大于0.9的目标模型确定为约等式。Another possible way is to score each target model according to the average absolute error rate and fitting degree parameters of the target model, and determine the absolute equation that meets the first preset condition and the approximate equation that meets the second preset condition. , when each target model is subsequently used for abnormal data detection, different weights can be given to the scores of absolute equations and the scores of approximate equations. For example, a target model with an average absolute error rate less than 0.01 and a fit parameter greater than 0.999 is determined as an absolute equation; a target model with an average absolute error rate greater than or equal to 0.01 and less than 0.1 and a fit parameter greater than 0.9 is determined as an approximate equation. Mode.
如此,不会直接将各目标维度对应的目标模型均用于检测异常数据,而是在各目标模型中再进行一次筛选。通过测试数据输入至任一目标模型中进行测试,得到目标模型进行测试的平均绝对误差率。若目标模型的拟合度参数和所述平均绝对误差率分别满足预设阈值,则说明该目标模型拟合的准确性较高,可用于后续的异常数据检测。提高了异常数据检测的准确性。In this way, the target models corresponding to each target dimension will not be directly used to detect abnormal data, but will be screened again in each target model. By inputting the test data into any target model for testing, the average absolute error rate of the target model for testing is obtained. If the fitting parameters of the target model and the average absolute error rate respectively meet the preset thresholds, it means that the fitting accuracy of the target model is high and can be used for subsequent abnormal data detection. Improved accuracy of abnormal data detection.
为了更好的解释本发明实施例,下面将在具体实施场景下来描述上述确定目标模型的流程。图4示出了详细的确定目标模型的流程图。In order to better explain the embodiment of the present invention, the above process of determining the target model will be described below in a specific implementation scenario. Figure 4 shows a detailed flow chart for determining the target model.
步骤401,读取样本数据。Step 401: Read sample data.
步骤402,对样本数据进行预处理。Step 402: Preprocess the sample data.
步骤403,将样本数据划分为训练集和测试集。Step 403: Divide the sample data into a training set and a test set.
其中训练集作为初始n组样本数据用于目标模型的确定。The training set is used as the initial n sets of sample data for determining the target model.
步骤404,针对任一目标维度,根据初始n组样本数据的相关系数从M个维度中选取与目标维度存在相关关系的K个自变量维度。Step 404: For any target dimension, select K independent variable dimensions that are correlated with the target dimension from M dimensions based on the correlation coefficients of the initial n sets of sample data.
步骤405,获取分布在目标维度和K个自变量维度的目标n组样本数据。Step 405: Obtain target n groups of sample data distributed in the target dimension and K independent variable dimensions.
步骤406,针对任一候选组样本数据,根据目标n组样本数据和除候选组样本数据以外的n-1组样本数据确定候选组样本数据对目标模型的准确性的影响度。Step 406: For any candidate group of sample data, determine the degree of influence of the candidate group of sample data on the accuracy of the target model based on the target n group of sample data and n-1 groups of sample data other than the candidate group of sample data.
步骤407,确定影响度是否大于4/n,若大于,则进入步骤408,若不大于,则进入步骤409。Step 407: Determine whether the influence degree is greater than 4/n. If it is greater, go to step 408. If it is not greater, go to step 409.
步骤408,剔除该候选组样本数据。Step 408: Eliminate the candidate group sample data.
步骤409,保留该候选组样本数据。Step 409, retain the candidate group sample data.
步骤410,根据保留的样本数据确定该目标维度对应的目标模型。Step 410: Determine the target model corresponding to the target dimension based on the retained sample data.
循环步骤404-410,可得多个维度对应的多个目标模型。By looping steps 404-410, multiple target models corresponding to multiple dimensions can be obtained.
步骤411,采用测试集数据对各目标模型进行评估,得到绝对等式和约等式。Step 411: Use the test set data to evaluate each target model and obtain absolute equations and reduced equations.
接下来对采用得到的各目标模型对待检测数据作异常数据检测进行介绍。Next, the use of each obtained target model for abnormal data detection of the data to be detected is introduced.
图5示出了一种可能的异常数据检测方法,包括:Figure 5 shows a possible abnormal data detection method, including:
步骤501,将W个分布在W个维度的待检测数据输入至各目标维度对应的各目标模型中。Step 501: Input W data to be detected distributed in W dimensions into each target model corresponding to each target dimension.
步骤502,针对任一目标模型,若确定W个待检测数据不满足所述目标模型,则将所述目标模型中包含的各维度确定为异常维度。Step 502: For any target model, if it is determined that W data to be detected do not satisfy the target model, then determine each dimension included in the target model as an abnormal dimension.
步骤503,针对任一异常维度,确定所述异常维度在各目标模型中被确定为异常维度的异常概率。Step 503: For any abnormal dimension, determine the abnormal probability that the abnormal dimension is determined to be an abnormal dimension in each target model.
步骤504,根据所述异常概率确定所述异常维度对应的待检测数据是否为异常数据。Step 504: Determine whether the data to be detected corresponding to the abnormal dimension is abnormal data according to the abnormality probability.
在步骤501中,将W个分布在W个维度的待检测数据输入至各目标维度对应的各目标模型中。本发明实施例对W个维度不作限制。In step 501, W data to be detected distributed in W dimensions are input into each target model corresponding to each target dimension. The embodiment of the present invention does not limit W dimensions.
一种可能的方式,W个维度中至少包括确定目标模型的M个维度。例如,W个维度和确定目标模型的M个维度完全相同,也是分布在维度A、维度B、维度C、维度D和维度E的5个维度。例如,W个维度中除了确定目标模型的M个维度之外,还包括其他维度。One possible way is that the W dimensions include at least M dimensions that determine the target model. For example, the W dimensions are exactly the same as the M dimensions of the determined target model, and they are also distributed in the five dimensions of dimension A, dimension B, dimension C, dimension D and dimension E. For example, in addition to the M dimensions that determine the target model, the W dimensions also include other dimensions.
另一种可能的方式,W个维度中可能包括M个维度中的一部分维度。例如在确定的各目标模型中,由于测试集的测试剔除了一部分目标模型,导致剩下来的各目标模型中均不包含某些维度,那么待测试数据的维度中就不必要包含该维度。例如,在之前确定的5个维度对应的5个目标模型 中,只有3个目标模型符合测试集的测试,因此仅将这3个目标模型用于待检测数据的检测。而这3个目标模型中仅包括维度A、维度B、维度C和维度D4个维度。那么待检测数据分布的W个维度也可以仅为维度A、维度B、维度C和维度D4个维度,即不包含维度E。Another possible way is that the W dimensions may include some of the M dimensions. For example, among the determined target models, since the test set test eliminates part of the target model, the remaining target models do not contain certain dimensions, so the dimensions of the data to be tested do not need to include this dimension. For example, among the five target models corresponding to the previously determined five dimensions, only three target models meet the test set, so only these three target models are used for the detection of the data to be detected. These three target models only include 4 dimensions: dimension A, dimension B, dimension C and dimension D. Then the W dimensions of the data distribution to be detected can also be only dimension A, dimension B, dimension C and dimension D, that is, dimension E is not included.
一种可能的待检测数据的示例如表4所示。An example of possible data to be detected is shown in Table 4.
表4Table 4
Figure PCTCN2022121926-appb-000021
Figure PCTCN2022121926-appb-000021
将W个待检测数据分别输入至各目标模型中,例如输入至3个目标模型中。3个目标模型分别为:The W data to be detected are input into each target model respectively, for example, into three target models. The three target models are:
目标模型1:A=B+C+D;Target model 1: A=B+C+D;
目标模型2:B=C+D;Target model 2: B=C+D;
目标模型3:C=A+B。Target model 3: C=A+B.
在步骤502中,针对任一目标模型,若确定W个待检测数据不满足所述目标模型,则将所述目标模型中包含的各维度确定为异常维度。In step 502, for any target model, if it is determined that W data to be detected do not satisfy the target model, then each dimension included in the target model is determined as an abnormal dimension.
例如,针对目标模型1,将表4所示的待检测数据输入至目标模型1,确定平均绝对误差率不符合预设阈值,则将目标模型1中包含的维度A、维度B、维度C和维度D确定为异常维度。针对目标模型2,将表4所示的待检测数据输入至目标模型2,确定平均绝对误差率符合预设阈值,则不进行任何操作。针对目标模型3,将表4所示的待检测数据输入至目标模型3,确定平均绝对误差率不符合预设阈值,则将目标模型3中包含的维度A、维度B、维度C确定为异常维度。For example, for target model 1, input the data to be detected shown in Table 4 into target model 1, and determine that the average absolute error rate does not meet the preset threshold, then add dimension A, dimension B, dimension C and Dimension D is determined to be an abnormal dimension. For target model 2, input the data to be detected shown in Table 4 into target model 2. If it is determined that the average absolute error rate meets the preset threshold, no operation will be performed. For target model 3, input the data to be detected shown in Table 4 into target model 3. If it is determined that the average absolute error rate does not meet the preset threshold, then dimension A, dimension B, and dimension C included in target model 3 are determined to be abnormal. dimensions.
综上,被确定为异常维度的有维度A、维度B、维度C和维度D。To sum up, the dimensions identified as abnormal dimensions include dimension A, dimension B, dimension C and dimension D.
在步骤503中,针对任一异常维度,确定所述异常维度在各目标模型中被确定为异常维度的异常概率。In step 503, for any abnormal dimension, determine the abnormal probability that the abnormal dimension is determined to be an abnormal dimension in each target model.
一种可能的方式,确定该异常维度在被确定为异常维度的目标模型中出现的次数与该异常维度在各目标模型中出现的次数的比例,根据比例确定该异常维度的异常概率。One possible way is to determine the ratio of the number of times the abnormal dimension appears in the target model determined to be the abnormal dimension to the number of times the abnormal dimension appears in each target model, and determine the abnormal probability of the abnormal dimension based on the ratio.
例如,针对维度A,其在各目标模型中出现了2次,其中在两个目标模型中(目标模型1和目标模型3)被确定为异常维度,因此,维度A的异常概率为2/2=1。针对维度B,其在各目标模型中出现了3次,其中在2个目标模型中(目标模型1和目标模型3)被确定为异常维度,因此,维度B的异常概率为2/3。针对维度C,其在各目标模型中出现了3次,其中在2个目标模型中(目标模型1和目标模型3)被确定为异常维度,因此,维度C的异常概率为2/3。针对维度D,其在各目标模型中出现了2次,其中在1个目标模型中(目标模型1)被确定为异常维度,因此,维度D的异常概率为1/2。For example, dimension A appears twice in each target model, and is determined to be an abnormal dimension in two target models (target model 1 and target model 3). Therefore, the abnormality probability of dimension A is 2/2. =1. Dimension B appears three times in each target model, and is determined to be an abnormal dimension in two target models (target model 1 and target model 3). Therefore, the abnormality probability of dimension B is 2/3. Dimension C appears three times in each target model, and is determined to be an abnormal dimension in two target models (target model 1 and target model 3). Therefore, the abnormality probability of dimension C is 2/3. Dimension D appears twice in each target model, and one target model (target model 1) is determined to be an abnormal dimension. Therefore, the abnormality probability of dimension D is 1/2.
一种可能的方式,将绝对等式和约等式得到的概率值赋予不同的权重,从而更加准确地定位异常数据。例如,目标模型1为绝对等式,则目标模型1确定为异常维度的情况中,该异常维度在被确定为异常维度的目标模型中出现的次数乘以1;目标模型3为约等式,则目标模型3确定为异常维度的情况中,该异常维度在被确定为异常维度的目标模型中出现的次数乘以0.8。One possible way is to assign different weights to the probability values obtained by the absolute equation and the reduced equation, so as to locate abnormal data more accurately. For example, if target model 1 is an absolute equation, then if target model 1 is determined to be an abnormal dimension, the number of times the abnormal dimension appears in the target model determined to be an abnormal dimension is multiplied by 1; target model 3 is an approximate equation, If target model 3 is determined to be an abnormal dimension, the number of times the abnormal dimension appears in the target model determined to be an abnormal dimension is multiplied by 0.8.
在步骤504中,根据所述异常概率确定所述异常维度对应的待检测数据是否为异常数据。In step 504, it is determined according to the abnormality probability whether the data to be detected corresponding to the abnormal dimension is abnormal data.
将任一异常维度的异常概率与预设阈值进行比较,若大于预设阈值,则确定该异常维度对应的待检测数据是异常数据;若不大于预设阈值,则确定该异常维度对应的待检测数据不是异常数据。这里预设阈值的确定可根据本领域技术人员的经验和需求进行设置。对此不作限制。或者,将异常概率的前N位的异常维度对应的待检测数据确定为异常数据。Compare the abnormality probability of any abnormal dimension with the preset threshold. If it is greater than the preset threshold, it is determined that the data to be detected corresponding to the abnormal dimension is abnormal data; if it is not greater than the preset threshold, it is determined that the data to be detected corresponding to the abnormal dimension is abnormal data. Detection data is not abnormal data. The determination of the preset threshold here can be set based on the experience and needs of those skilled in the art. There are no restrictions on this. Or, determine the data to be detected corresponding to the abnormal dimensions of the top N digits of abnormality probability as abnormal data.
例如,将维度A对应的待检测数据100.5确定为异常数据。For example, the data to be detected 100.5 corresponding to dimension A is determined as abnormal data.
通过将W个分布在W个维度的待检测数据输入至各目标维度对应的各目标模型中,确定异常维度。进而确定异常维度在各目标模型中被确定为异常维度的异常概率,根据异常概率确定所述异 常维度对应的待检测数据是否为异常数据。不仅实现了检测出W个待检测数据中存在异常数据,还能够准确定位到这W个待检测数据中,哪个维度的数据为异常数据。从而实现了异常数据的自动化准确定位,无需人工再行查看。By inputting W data to be detected distributed in W dimensions into each target model corresponding to each target dimension, the abnormal dimension is determined. Then determine the abnormal probability of the abnormal dimension being determined as an abnormal dimension in each target model, and determine whether the data to be detected corresponding to the abnormal dimension is abnormal data based on the abnormal probability. It not only detects the presence of abnormal data in W data to be detected, but also accurately locates which dimension of data in the W data to be detected is abnormal data. This achieves automatic and accurate positioning of abnormal data, eliminating the need for manual review.
本发明实施例还提供另一种异常数据的检测方法,即将所述目标模型中包含的各维度确定为异常维度之后,还包括:针对任一异常维度,获取所述异常维度对应的各历史数据;通过将所述异常维度对应的待检测数据和所述各历史数据进行聚类,确定所述异常维度对应的待检测数据的异常得分;根据所述异常概率确定所述异常维度对应的待检测数据是否为异常数据,包括:根据所述异常概率和所述异常得分确定所述异常维度对应的待检测数据是否为异常数据。The embodiment of the present invention also provides another abnormal data detection method, that is, after determining each dimension included in the target model as an abnormal dimension, it also includes: for any abnormal dimension, obtaining each historical data corresponding to the abnormal dimension. ; Determine the abnormality score of the data to be detected corresponding to the abnormal dimension by clustering the data to be detected corresponding to the abnormal dimension and each historical data; determine the data to be detected corresponding to the abnormal dimension according to the abnormality probability Whether the data is abnormal data includes: determining whether the data to be detected corresponding to the abnormal dimension is abnormal data based on the abnormality probability and the abnormality score.
上述方法是针对任一异常维度的,例如针对维度A的待检测数据,同时获取维度A的历史数据,将维度A的待检测数据和历史数据进行聚类,得到维度A的异常得分。The above method is for any anomaly dimension. For example, for the data to be detected in dimension A, the historical data of dimension A is obtained at the same time. The data to be detected and the historical data of dimension A are clustered to obtain the anomaly score of dimension A.
通过获取异常维度的历史数据,将异常维度对应的待检测数据和各历史数据进行聚类,得到异常维度对应的待检测数据的异常得分,结合异常概率和异常得分确定该异常维度对应的待检测数据是否为异常数据。两种判断方法结合,既考虑到了该异常维度被确定为异常维度的概率,又考虑到了该异常维度的历史上的数据的情况,增加了确定异常数据的准确性。By obtaining the historical data of the abnormal dimension, cluster the data to be detected corresponding to the abnormal dimension and each historical data, and obtain the abnormal score of the data to be detected corresponding to the abnormal dimension. Combine the abnormal probability and the abnormal score to determine the data to be detected corresponding to the abnormal dimension. Whether the data is abnormal data. The combination of the two judgment methods not only takes into account the probability that the abnormal dimension is determined to be an abnormal dimension, but also takes into account the historical data of the abnormal dimension, increasing the accuracy of determining abnormal data.
本发明实施例对聚类得到异常得分的方法不作具体限制。The embodiment of the present invention does not specifically limit the method of clustering to obtain anomaly scores.
一种可能的方式,采用k-means的方式进行聚类,将任一异常维度的待检测数据与各历史数据之间的距离确定为异常得分。例如,维度A的待检测数据与各历史数据之间的距离均较远,则相似度小,异常得分小。One possible way is to use k-means for clustering, and determine the distance between the data to be detected in any abnormal dimension and each historical data as the anomaly score. For example, if the distance between the data to be detected in dimension A and each historical data is relatively long, the similarity will be small and the anomaly score will be small.
另一种可能的方式,通过构造孤立二叉树的方式,得到任一异常维度的异常得分。具体为:对所述异常维度对应的待检测数据和所述各历史数据构造孤立二叉树;计算在所述孤立二叉树中所述异常维度对应的待检测数据的异常得分。Another possible way is to obtain the anomaly score of any abnormal dimension by constructing an isolated binary tree. Specifically, the method includes: constructing an isolated binary tree for the data to be detected corresponding to the abnormal dimension and each historical data; and calculating the anomaly score of the data to be detected corresponding to the abnormal dimension in the isolated binary tree.
举个例子,针对维度A,获取维度A的历史数据为:19.49,20.23,25.34,49.12,36.66。针对待检测数据和历史数据(19.49,20.23,25.34,49.12,36.66,100.5)使用集成学习的方法,迭代N次(例如100),每次构建一颗孤立二叉树,基于决策树算法,随机对待检测数据和历史数据进行切割,每次切割可以产生一个独立的叶子节点,通过该方式不断切割出新的叶子节点直至树达到指定高度或者无法再切割则算法结束。For example, for dimension A, the historical data obtained for dimension A are: 19.49, 20.23, 25.34, 49.12, 36.66. Use the integrated learning method for the data to be detected and historical data (19.49, 20.23, 25.34, 49.12, 36.66, 100.5), iterate N times (for example, 100), and build an isolated binary tree each time. Based on the decision tree algorithm, the detection is randomly processed The data and historical data are cut. Each cut can produce an independent leaf node. In this way, new leaf nodes are continuously cut until the tree reaches the specified height or cannot be cut anymore, and the algorithm ends.
具体构造孤立二叉树的步骤如下,(1)首先在所有样本数据的最小值和最大值(19.49和100.5)之间,随机一个分割点,假设随机值为60.2。(2)将样本中大于分割点值60.2的数据节点放在树的右边分支,小于等于60.2的数据节点放到树的左边分支。(3)在分支的基础上重复(1)和(2)步骤,直到将所有数据节点都随机分割形成孤立的叶子节点或者树达到指定高度则结束。The specific steps to construct an isolated binary tree are as follows: (1) First, randomly select a split point between the minimum and maximum values (19.49 and 100.5) of all sample data, assuming that the random value is 60.2. (2) Put the data nodes in the sample that are greater than the split point value 60.2 on the right branch of the tree, and the data nodes that are less than or equal to 60.2 are placed on the left branch of the tree. (3) Repeat steps (1) and (2) on the basis of branches until all data nodes are randomly divided to form isolated leaf nodes or the tree reaches the specified height.
按上述步骤随机构造出第一棵孤立树如图6所示,5次随机分割点为(60.2,34,42.2,22.5,20)。The first isolated tree is randomly constructed according to the above steps, as shown in Figure 6, and the five random split points are (60.2, 34, 42.2, 22.5, 20).
计算每个叶子节点的PathLength为h(x):Calculate the PathLength of each leaf node as h(x):
h(x)=e+c(T.size)    公式10h(x)=e+c(T.size) Formula 10
其中e为叶子节点从树的根节点到叶节点的过程中经历的边的个数,即split次数,T.size表示和样本x同在一个叶子结点样本的个数,C(T.size)可以看做一个修正值,表示T.size个样本构建一个二叉树的平均路径长度。where e is the number of edges that the leaf node has experienced in the process from the root node of the tree to the leaf node, that is, the number of splits, T.size represents the number of samples at the same leaf node as the sample x, C(T.size ) can be regarded as a correction value, indicating the average path length of a binary tree constructed by T.size samples.
Figure PCTCN2022121926-appb-000022
Figure PCTCN2022121926-appb-000022
其中o为欧拉常数0.5772156649;以计算100.5这个样本的PathLength为例,e为从根节点到100.5节点的边个数=1,T.size=1,因此100.5这个几点的PathLength=1+c(1),代入上面计算c(n)的公式即可。Among them, o is Euler's constant 0.5772156649; taking the calculation of the PathLength of the sample 100.5 as an example, e is the number of edges from the root node to the 100.5 node = 1, T.size = 1, so the PathLength of the point 100.5 = 1 + c (1), just substitute the above formula for calculating c(n).
为了保证随机性,从而保证异常得分的准确性,按照上述方法随机迭代N次(默认为100),构造100棵随机的孤立二叉树,在每棵树上都计算100.5这个节点的PathLength即h(x)。进而根据 如下公式计算孤立森林异常得分值:In order to ensure randomness and thus the accuracy of the anomaly score, follow the above method to randomly iterate N times (default is 100), construct 100 random isolated binary trees, and calculate the PathLength of the node 100.5 on each tree, that is, h(x ). Then calculate the isolated forest anomaly score value according to the following formula:
Figure PCTCN2022121926-appb-000023
Figure PCTCN2022121926-appb-000023
其中n为样本数量,E(h(x))为样本在100棵孤立树上PathLength的均值,c(n)为构建n个样本的树的平均路径长度。在上述示例中n为样本数6,根据上面计算c(n)的公式计算c(6)结果。最终计算得出维度A对应的待检测数据即样本数据100.5这个节点的异常得分。Where n is the number of samples, E(h(x)) is the average PathLength of the samples on 100 isolated trees, and c(n) is the average path length of the tree building n samples. In the above example, n is the number of samples 6, and the c(6) result is calculated according to the formula for calculating c(n) above. Finally, the anomaly score of the node to be detected corresponding to dimension A, that is, sample data 100.5, is calculated.
之后,根据该异常维度对应的异常概率和所述异常得分确定所述异常维度对应的待检测数据是否为异常数据。Afterwards, it is determined whether the data to be detected corresponding to the abnormal dimension is abnormal data based on the abnormal probability corresponding to the abnormal dimension and the abnormal score.
通过将异常维度对应的待检测数据和各历史数据构造孤立二叉树,提高了确定异常数据的准确性。By constructing an isolated binary tree from the data to be detected corresponding to the abnormal dimension and each historical data, the accuracy of determining abnormal data is improved.
基于相同的技术构思,图7示例性的示出了本发明实施例提供的一种异常数据检测装置的结构,该结构可以执行异常数据检测的流程。Based on the same technical concept, FIG. 7 exemplarily shows the structure of an abnormal data detection device provided by an embodiment of the present invention, which can perform the process of abnormal data detection.
如图7所示,该装置具体包括:As shown in Figure 7, the device specifically includes:
处理单元701,用于: Processing unit 701, used for:
将W个分布在W个维度的待检测数据输入至各目标维度对应的各目标模型中;所述任一目标维度对应的目标模型是通过去除了强影响点数据后的样本数据得到的;所述强影响点数据是指对所述目标模型的准确性的影响度不满足预设条件的样本数据;W data to be detected distributed in W dimensions are input into each target model corresponding to each target dimension; the target model corresponding to any target dimension is obtained by removing the sample data from the strong influence point data; so The strong influence point data refers to sample data whose impact on the accuracy of the target model does not meet the preset conditions;
针对任一目标模型,若确定W个待检测数据不满足所述目标模型,则将所述目标模型中包含的各维度确定为异常维度;For any target model, if it is determined that W data to be detected do not satisfy the target model, then each dimension included in the target model is determined as an abnormal dimension;
针对任一异常维度,确定所述异常维度在各目标模型中被确定为异常维度的异常概率;For any abnormal dimension, determine the abnormal probability that the abnormal dimension is determined to be an abnormal dimension in each target model;
根据所述异常概率确定所述异常维度对应的待检测数据是否为异常数据。Determine whether the data to be detected corresponding to the abnormal dimension is abnormal data according to the abnormality probability.
在一些实施例中,所述处理单元701还用于:In some embodiments, the processing unit 701 is also used to:
针对任一异常维度,获取所述异常维度对应的各历史数据;For any abnormal dimension, obtain each historical data corresponding to the abnormal dimension;
通过将所述异常维度对应的待检测数据和所述各历史数据进行聚类,确定所述异常维度对应的待检测数据的异常得分;Determine the anomaly score of the data to be detected corresponding to the abnormal dimension by clustering the data to be detected corresponding to the abnormal dimension and each historical data;
所述处理单元701具体用于:The processing unit 701 is specifically used for:
根据所述异常概率和所述异常得分确定所述异常维度对应的待检测数据是否为异常数据。Determine whether the data to be detected corresponding to the abnormal dimension is abnormal data based on the abnormality probability and the abnormality score.
在一些实施例中,所述处理单元701具体用于:In some embodiments, the processing unit 701 is specifically used to:
对所述异常维度对应的待检测数据和所述各历史数据构造孤立二叉树;Construct an isolated binary tree for the data to be detected corresponding to the abnormal dimension and each of the historical data;
计算在所述孤立二叉树中所述异常维度对应的待检测数据的异常得分。Calculate the anomaly score of the data to be detected corresponding to the abnormal dimension in the isolated binary tree.
在一些实施例中,所述处理单元701具体用于:In some embodiments, the processing unit 701 is specifically used to:
获取分布在M个维度的初始n组样本数据;其中,每一组样本数据具有M个维度;Obtain initial n groups of sample data distributed in M dimensions; where each group of sample data has M dimensions;
针对M个维度中的目标维度,根据所述初始n组样本数据的相关系数从M个维度中选取与所述目标维度存在相关关系的K个自变量维度;所述目标维度为所述M个维度中的任一维度;For the target dimension among the M dimensions, select K independent variable dimensions that are correlated with the target dimension from the M dimensions according to the correlation coefficients of the initial n sets of sample data; the target dimension is the M any of the dimensions;
根据分布在所述目标维度和所述K个自变量维度的目标n组样本数据确定所述目标模型;所述目标模型用于表征所述目标维度和所述K个自变量维度之间满足的关系。The target model is determined based on the target n groups of sample data distributed in the target dimension and the K independent variable dimensions; the target model is used to characterize the satisfaction between the target dimension and the K independent variable dimensions. relation.
在一些实施例中,所述处理单元701具体用于:In some embodiments, the processing unit 701 is specifically used to:
获取分布在所述目标维度和所述K个自变量维度的目标n组样本数据;Obtain target n groups of sample data distributed in the target dimension and the K independent variable dimensions;
根据所述目标n组样本数据和除候选组样本数据以外的n-1组样本数据确定所述候选组样本数据对所述目标模型的准确性的影响度;所述候选组样本数据为所述目标n组样本数据中的任一组样本数据;The influence degree of the candidate group sample data on the accuracy of the target model is determined according to the target n group of sample data and the n-1 group of sample data except the candidate group sample data; the candidate group sample data is the Any group of sample data in the target n groups of sample data;
根据n个候选组样本数据的影响度,从所述目标n组样本数据中去除w组强影响点数据;1≤w<n;According to the influence degree of n candidate group sample data, w groups of strong influence point data are removed from the target n group of sample data; 1≤w<n;
根据保留的样本数据确定所述目标模型。The target model is determined based on the retained sample data.
在一些实施例中,所述处理单元701具体用于:In some embodiments, the processing unit 701 is specifically used to:
对所述目标n组样本数据进行拟合,得到第一拟合模型的第一拟合系数;Fit the target n groups of sample data to obtain the first fitting coefficient of the first fitting model;
对除所述候选组样本数据以外的n-1组样本数据进行拟合,得到第二拟合模型的第二拟合系数;Fit n-1 groups of sample data other than the candidate group sample data to obtain the second fitting coefficient of the second fitting model;
根据所述第一拟合系数、所述第二拟合系数、所述目标模型中包含的自变量维度的数量和所述第一拟合模型的均方误差确定所述影响度。The degree of influence is determined based on the first fitting coefficient, the second fitting coefficient, the number of independent variable dimensions included in the target model, and the mean square error of the first fitting model.
在一些实施例中,所述处理单元701具体用于:In some embodiments, the processing unit 701 is specifically used to:
针对任一候选组样本数据,若所述候选组样本数据对所述目标模型的准确性的影响度大于(p,n-p-1)自由度的F分布的第一个四分之一位,则确定所述候选组样本数据为强影响点数据;其中,p为所述目标模型中包含的自变量维度的数量;For any candidate group sample data, if the impact of the candidate group sample data on the accuracy of the target model is greater than the first quarter of the F distribution of (p, n-p-1) degrees of freedom, then Determine that the candidate group sample data is strong influence point data; where p is the number of independent variable dimensions included in the target model;
从所述目标n组样本数据中去除所述强影响点数据。Remove the strong influence point data from the target n groups of sample data.
在一些实施例中,所述处理单元701还用于:In some embodiments, the processing unit 701 is also used to:
将测试数据输入所述目标模型中进行测试;得到所述目标模型的平均绝对误差率;Input test data into the target model for testing; obtain the average absolute error rate of the target model;
确定所述目标模型对保留的样本数据进行拟合的拟合度参数和所述平均绝对误差率分别满足预设阈值。It is determined that the fitting degree parameter and the average absolute error rate of the target model for fitting the retained sample data respectively meet a preset threshold.
基于相同的技术构思,本申请实施例提供了一种计算机设备,如图8所示,包括至少一个处理器801,以及与至少一个处理器连接的存储器802,本申请实施例中不限定处理器801与存储器802之间的具体连接介质,图8中处理器801和存储器802之间通过总线连接为例。总线可以分为地址总线、数据总线、控制总线等。Based on the same technical concept, the embodiment of the present application provides a computer device, as shown in Figure 8, including at least one processor 801, and a memory 802 connected to the at least one processor. The processor is not limited in the embodiment of the present application. As for the specific connection medium between 801 and memory 802, the connection between processor 801 and memory 802 through a bus in Figure 8 is taken as an example. The bus can be divided into address bus, data bus, control bus, etc.
在本申请实施例中,存储器802存储有可被至少一个处理器801执行的指令,至少一个处理器801通过执行存储器802存储的指令,可以执行上述异常数据检测方法的步骤。In this embodiment of the present application, the memory 802 stores instructions that can be executed by at least one processor 801. At least one processor 801 can execute the steps of the above abnormal data detection method by executing the instructions stored in the memory 802.
其中,处理器801是计算机设备的控制中心,可以利用各种接口和线路连接计算机设备的各个部分,通过运行或执行存储在存储器802内的指令以及调用存储在存储器802内的数据,从而进行异常数据检测。在一些实施例中,处理器801可包括一个或多个处理单元,处理器801可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器801中。在一些实施例中,处理器801和存储器802可以在同一芯片上实现,在一些实施例中,它们也可以在独立的芯片上分别实现。Among them, the processor 801 is the control center of the computer equipment. It can use various interfaces and lines to connect various parts of the computer equipment, and perform abnormal operations by running or executing instructions stored in the memory 802 and calling data stored in the memory 802. Data detection. In some embodiments, the processor 801 may include one or more processing units, and the processor 801 may integrate an application processor and a modem processor, where the application processor mainly processes the operating system, user interface, application programs, etc., The modem processor primarily handles wireless communications. It can be understood that the above modem processor may not be integrated into the processor 801. In some embodiments, the processor 801 and the memory 802 can be implemented on the same chip, and in some embodiments, they can also be implemented on separate chips.
处理器801可以是通用处理器,例如中央处理器(CPU)、数字信号处理器、专用集成电路(ApplicatioK Specific IKtegrated Circuit,ASIC)、现场可编程门阵列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件,可以实现或者执行本申请实施例中公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。The processor 801 may be a general processor, such as a central processing unit (CPU), a digital signal processor, an application specific integrated circuit (ASIC), a field programmable gate array or other programmable logic devices, discrete gates or transistors. Logic devices and discrete hardware components can implement or execute the methods, steps and logical block diagrams disclosed in the embodiments of this application. A general-purpose processor may be a microprocessor or any conventional processor, etc. The steps of the methods disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware processor for execution, or can be executed by a combination of hardware and software modules in the processor.
存储器802作为一种非易失性计算机可读存储介质,可用于存储非易失性软件程序、非易失性计算机可执行程序以及模块。存储器802可以包括至少一种类型的存储介质,例如可以包括闪存、硬盘、多媒体卡、卡型存储器、随机访问存储器(RaKdom Access Memory,RAM)、静态随机访问存储器(Static RaKdom Access Memory,SRAM)、可编程只读存储器(Programmable Read OKly Memory,PROM)、只读存储器(Read OKly Memory,ROM)、带电可擦除可编程只读存储器(Electrically Erasable Programmable Read-OKly Memory,EEPROM)、磁性存储器、磁盘、光盘等等。存储器802是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。本申请实施例中的存储器802还可以是电路或者其它任意能够实现存储功能的装置,用于存储程序指令和/或数据。As a non-volatile computer-readable storage medium, the memory 802 can be used to store non-volatile software programs, non-volatile computer executable programs and modules. The memory 802 may include at least one type of storage medium, for example, may include flash memory, hard disk, multimedia card, card-type memory, random access memory (RaKdom Access Memory, RAM), static random access memory (Static RaKdom Access Memory, SRAM), Programmable Read OKly Memory (PROM), Read OKly Memory (ROM), Electrically Erasable Programmable Read-OKly Memory (EEPROM), Magnetic Memory, Disk , CD, etc. Memory 802 is, but is not limited to, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory 802 in the embodiment of the present application can also be a circuit or any other device capable of realizing a storage function, used to store program instructions and/or data.
基于相同的技术构思,本发明实施例还提供一种计算机可读存储介质,计算机可读存储介质存储有计算机可执行程序,计算机可执行程序用于使计算机执行上述任一方式所列的异常数据检测的方法。Based on the same technical concept, embodiments of the present invention also provide a computer-readable storage medium. The computer-readable storage medium stores a computer executable program. The computer executable program is used to cause the computer to execute the abnormal data listed in any of the above methods. detection method.
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且, 本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will understand that embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
本申请是参照根据本申请的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the present application. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine, such that the instructions executed by the processor of the computer or other programmable data processing device produce a use A device for realizing the functions specified in one process or multiple processes of the flowchart and/or one block or multiple blocks of the block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions The device implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device. Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the present application without departing from the spirit and scope of the present application. In this way, if these modifications and variations of the present application fall within the scope of the claims of the present application and equivalent technologies, the present application is also intended to include these modifications and variations.

Claims (11)

  1. 一种异常数据检测方法,其特征在于,包括:An abnormal data detection method, characterized by including:
    将W个分布在W个维度的待检测数据输入至各目标维度对应的各目标模型中;所述任一目标维度对应的目标模型是通过去除了强影响点数据后的样本数据得到的;所述强影响点数据是指对所述目标模型的准确性的影响度不满足预设条件的样本数据;W data to be detected distributed in W dimensions are input into each target model corresponding to each target dimension; the target model corresponding to any target dimension is obtained by removing the sample data from the strong influence point data; so The strong influence point data refers to sample data whose impact on the accuracy of the target model does not meet the preset conditions;
    针对任一目标模型,若确定W个待检测数据不满足所述目标模型,则将所述目标模型中包含的各维度确定为异常维度;For any target model, if it is determined that W data to be detected do not satisfy the target model, then each dimension included in the target model is determined as an abnormal dimension;
    针对任一异常维度,确定所述异常维度在各目标模型中被确定为异常维度的异常概率;For any abnormal dimension, determine the abnormal probability that the abnormal dimension is determined to be an abnormal dimension in each target model;
    根据所述异常概率确定所述异常维度对应的待检测数据是否为异常数据。Determine whether the data to be detected corresponding to the abnormal dimension is abnormal data according to the abnormality probability.
  2. 如权利要求1所述的方法,其特征在于,将所述目标模型中包含的各维度确定为异常维度之后,还包括:The method according to claim 1, characterized in that after determining each dimension included in the target model as an abnormal dimension, it further includes:
    针对任一异常维度,获取所述异常维度对应的各历史数据;For any abnormal dimension, obtain each historical data corresponding to the abnormal dimension;
    通过将所述异常维度对应的待检测数据和所述各历史数据进行聚类,确定所述异常维度对应的待检测数据的异常得分;Determine the anomaly score of the data to be detected corresponding to the abnormal dimension by clustering the data to be detected corresponding to the abnormal dimension and each historical data;
    根据所述异常概率确定所述异常维度对应的待检测数据是否为异常数据,包括:Determining whether the data to be detected corresponding to the abnormal dimension is abnormal data according to the abnormality probability includes:
    根据所述异常概率和所述异常得分确定所述异常维度对应的待检测数据是否为异常数据。Determine whether the data to be detected corresponding to the abnormal dimension is abnormal data based on the abnormality probability and the abnormality score.
  3. 如权利要求2所述的方法,其特征在于,通过将所述异常维度对应的待检测数据和所述各历史数据进行聚类,确定所述异常维度对应的待检测数据的异常得分,包括:The method of claim 2, wherein the anomaly score of the data to be detected corresponding to the abnormal dimension is determined by clustering the data to be detected corresponding to the abnormal dimension and the historical data, including:
    对所述异常维度对应的待检测数据和所述各历史数据构造孤立二叉树;Construct an isolated binary tree for the data to be detected corresponding to the abnormal dimension and each of the historical data;
    计算在所述孤立二叉树中所述异常维度对应的待检测数据的异常得分。Calculate the anomaly score of the data to be detected corresponding to the abnormal dimension in the isolated binary tree.
  4. 如权利要求1所述的方法,其特征在于,通过如下方式确定所述任一目标维度对应的目标模型,包括:The method according to claim 1, characterized in that the target model corresponding to any target dimension is determined in the following manner, including:
    获取分布在M个维度的初始n组样本数据;其中,每一组样本数据具有M个维度;Obtain initial n groups of sample data distributed in M dimensions; where each group of sample data has M dimensions;
    针对M个维度中的目标维度,根据所述初始n组样本数据的相关系数从M个维度中选取与所述目标维度存在相关关系的K个自变量维度;所述目标维度为所述M个维度中的任一维度;For the target dimension among the M dimensions, select K independent variable dimensions that are correlated with the target dimension from the M dimensions according to the correlation coefficients of the initial n groups of sample data; the target dimension is the M any of the dimensions;
    根据分布在所述目标维度和所述K个自变量维度的目标n组样本数据确定所述目标模型;所述目标模型用于表征所述目标维度和所述K个自变量维度之间满足的关系。The target model is determined based on the target n groups of sample data distributed in the target dimension and the K independent variable dimensions; the target model is used to characterize the satisfaction between the target dimension and the K independent variable dimensions. relation.
  5. 如权利要求4所述的方法,其特征在于,根据分布在所述目标维度和所述K个自变量维度的目标n组样本数据确定所述目标模型,包括:The method of claim 4, wherein determining the target model based on target n groups of sample data distributed in the target dimension and the K independent variable dimensions includes:
    获取分布在所述目标维度和所述K个自变量维度的目标n组样本数据;Obtain target n groups of sample data distributed in the target dimension and the K independent variable dimensions;
    根据所述目标n组样本数据和除候选组样本数据以外的n-1组样本数据确定所述候选组样本数据对所述目标模型的准确性的影响度;所述候选组样本数据为所述目标n组样本数据中的任一组样本数据;The influence degree of the candidate group sample data on the accuracy of the target model is determined according to the target n group of sample data and the n-1 group of sample data except the candidate group sample data; the candidate group sample data is the Any group of sample data in the target n groups of sample data;
    根据n个候选组样本数据的影响度,从所述目标n组样本数据中去除w组强影响点数据;1≤w<n;According to the influence degree of n candidate group sample data, w groups of strong influence point data are removed from the target n group of sample data; 1≤w<n;
    根据保留的样本数据确定所述目标模型。The target model is determined based on the retained sample data.
  6. 如权利要求5所述的方法,其特征在于,根据所述目标n组样本数据和除候选组样本数据以外的n-1组样本数据确定所述候选组样本数据对所述目标模型的准确性的影响度,包括:The method of claim 5, wherein the accuracy of the candidate group sample data for the target model is determined based on the target n groups of sample data and n-1 groups of sample data except the candidate group sample data. influence, including:
    对所述目标n组样本数据进行拟合,得到第一拟合模型的第一拟合系数;Fit the target n groups of sample data to obtain the first fitting coefficient of the first fitting model;
    对除所述候选组样本数据以外的n-1组样本数据进行拟合,得到第二拟合模型的第二拟合系数;Fit n-1 groups of sample data other than the candidate group sample data to obtain the second fitting coefficient of the second fitting model;
    根据所述第一拟合系数、所述第二拟合系数、所述目标模型中包含的自变量维度的数量和所述第一拟合模型的均方误差确定所述影响度。The degree of influence is determined based on the first fitting coefficient, the second fitting coefficient, the number of independent variable dimensions included in the target model, and the mean square error of the first fitting model.
  7. 如权利要求5所述的方法,其特征在于,根据n个候选组样本数据的影响度,从所述目标n组样本数据中去除w组强影响点数据,包括:The method according to claim 5, characterized in that, based on the influence degree of n candidate groups of sample data, removing w groups of strong influence point data from the target n groups of sample data includes:
    针对任一候选组样本数据,若所述候选组样本数据对所述目标模型的准确性的影响度大于(p,n-p-1)自由度的F分布的第一个四分之一位,则确定所述候选组样本数据为强影响点数据;其中,p为所述目标模型中包含的自变量维度的数量;For any candidate group sample data, if the impact of the candidate group sample data on the accuracy of the target model is greater than the first quarter of the F distribution of (p, n-p-1) degrees of freedom, then Determine that the candidate group sample data is strong influence point data; where p is the number of independent variable dimensions included in the target model;
    从所述目标n组样本数据中去除所述强影响点数据。Remove the strong influence point data from the target n groups of sample data.
  8. 如权利要求5所述的方法,其特征在于,根据保留的样本数据确定所述目标模型之后,还包括:The method of claim 5, wherein after determining the target model based on the retained sample data, it further includes:
    将测试数据输入所述目标模型中进行测试;得到所述目标模型的平均绝对误差率;Input test data into the target model for testing; obtain the average absolute error rate of the target model;
    确定所述目标模型对保留的样本数据进行拟合的拟合度参数和所述平均绝对误差率分别满足预设阈值。It is determined that the fitting degree parameter and the average absolute error rate of the target model for fitting the retained sample data respectively meet a preset threshold.
  9. 一种异常数据检测装置,其特征在于,包括:An abnormal data detection device, characterized by including:
    处理单元,用于:Processing unit for:
    将W个分布在W个维度的待检测数据输入至各目标维度对应的各目标模型中;所述任一目标维度对应的目标模型是通过去除了强影响点数据后的样本数据得到的;所述强影响点数据是指对所述目标模型的准确性的影响度不满足预设条件的样本数据;W data to be detected distributed in W dimensions are input into each target model corresponding to each target dimension; the target model corresponding to any target dimension is obtained by removing the sample data from the strong influence point data; so The strong influence point data refers to sample data whose impact on the accuracy of the target model does not meet the preset conditions;
    针对任一目标模型,若确定W个待检测数据不满足所述目标模型,则将所述目标模型中包含的各维度确定为异常维度;For any target model, if it is determined that W data to be detected do not satisfy the target model, then each dimension included in the target model is determined as an abnormal dimension;
    针对任一异常维度,确定所述异常维度在各目标模型中被确定为异常维度的异常概率;For any abnormal dimension, determine the abnormal probability that the abnormal dimension is determined to be an abnormal dimension in each target model;
    根据所述异常概率确定所述异常维度对应的待检测数据是否为异常数据。Determine whether the data to be detected corresponding to the abnormal dimension is abnormal data according to the abnormality probability.
  10. 一种计算设备,其特征在于,包括:A computing device, characterized by including:
    存储器,用于存储计算机程序;Memory, used to store computer programs;
    处理器,用于调用所述存储器中存储的计算机程序,按照获得的程序执行权利要求1至8任一项所述的方法。A processor, configured to call a computer program stored in the memory, and execute the method according to any one of claims 1 to 8 according to the obtained program.
  11. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机可执行程序,所述计算机可执行程序用于使计算机执行权利要求1至8任一项所述的方法。A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer-executable program, and the computer-executable program is used to cause a computer to execute the method described in any one of claims 1 to 8.
PCT/CN2022/121926 2022-08-18 2022-09-27 Anomalous data detection method and apparatus WO2024036709A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210992301.1A CN115357764A (en) 2022-08-18 2022-08-18 Abnormal data detection method and device
CN202210992301.1 2022-08-18

Publications (1)

Publication Number Publication Date
WO2024036709A1 true WO2024036709A1 (en) 2024-02-22

Family

ID=84003477

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/121926 WO2024036709A1 (en) 2022-08-18 2022-09-27 Anomalous data detection method and apparatus

Country Status (2)

Country Link
CN (1) CN115357764A (en)
WO (1) WO2024036709A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117786281A (en) * 2024-02-23 2024-03-29 中国海洋大学 Optimization calculation method for deposition rate and error of deposit columnar sample

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117235548B (en) * 2023-11-15 2024-02-27 山东济宁运河煤矿有限责任公司 Coal quality data processing method and intelligent system based on laser firing
CN117648657B (en) * 2023-12-13 2024-05-14 青岛市建筑设计研究院集团股份有限公司 Urban planning multi-source data optimization processing method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948669A (en) * 2019-03-04 2019-06-28 腾讯科技(深圳)有限公司 A kind of abnormal deviation data examination method and device
CN112733897A (en) * 2020-12-30 2021-04-30 胜斗士(上海)科技技术发展有限公司 Method and equipment for determining abnormal reason of multi-dimensional sample data
WO2021109314A1 (en) * 2019-12-06 2021-06-10 网宿科技股份有限公司 Method, system and device for detecting abnormal data
CN114297936A (en) * 2021-12-31 2022-04-08 深圳前海微众银行股份有限公司 Data anomaly detection method and device
CN114595210A (en) * 2020-11-20 2022-06-07 中国移动通信集团广东有限公司 Multi-dimensional data anomaly detection method and device and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948669A (en) * 2019-03-04 2019-06-28 腾讯科技(深圳)有限公司 A kind of abnormal deviation data examination method and device
WO2021109314A1 (en) * 2019-12-06 2021-06-10 网宿科技股份有限公司 Method, system and device for detecting abnormal data
CN114595210A (en) * 2020-11-20 2022-06-07 中国移动通信集团广东有限公司 Multi-dimensional data anomaly detection method and device and electronic equipment
CN112733897A (en) * 2020-12-30 2021-04-30 胜斗士(上海)科技技术发展有限公司 Method and equipment for determining abnormal reason of multi-dimensional sample data
CN114297936A (en) * 2021-12-31 2022-04-08 深圳前海微众银行股份有限公司 Data anomaly detection method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117786281A (en) * 2024-02-23 2024-03-29 中国海洋大学 Optimization calculation method for deposition rate and error of deposit columnar sample

Also Published As

Publication number Publication date
CN115357764A (en) 2022-11-18

Similar Documents

Publication Publication Date Title
WO2024036709A1 (en) Anomalous data detection method and apparatus
Du Jardin Dynamics of firm financial evolution and bankruptcy prediction
US20060230018A1 (en) Mahalanobis distance genetic algorithm (MDGA) method and system
CN110442516B (en) Information processing method, apparatus, and computer-readable storage medium
CN111222976B (en) Risk prediction method and device based on network map data of two parties and electronic equipment
CN107168995B (en) Data processing method and server
CN111385602A (en) Video auditing method, medium and computer equipment based on multi-level and multi-model
CN110826618A (en) Personal credit risk assessment method based on random forest
CN114241779B (en) Short-time prediction method, computer and storage medium for urban expressway traffic flow
CN110634060A (en) User credit risk assessment method, system, device and storage medium
Shu et al. Performance assessment of kernel density clustering for gene expression profile data
EP3929928A1 (en) Associating pedigree scores and similarity scores for plant feature prediction
CN110751400B (en) Risk assessment method and device
CN111723010B (en) Software BUG classification method based on sparse cost matrix
CN110837853A (en) Rapid classification model construction method
CN115437960A (en) Regression test case sequencing method, device, equipment and storage medium
CN116089801A (en) Medical data missing value repairing method based on multiple confidence degrees
CN111026661B (en) Comprehensive testing method and system for software usability
CN114820074A (en) Target user group prediction model construction method based on machine learning
CN113657441A (en) Classification algorithm based on weighted Pearson correlation coefficient and combined with feature screening
CN111652384B (en) Balancing method for data volume distribution and data processing method
CN114170000A (en) Credit card user risk category identification method, device, computer equipment and medium
TWM602677U (en) Risk evaluation model building system
CN109308565B (en) Crowd performance grade identification method and device, storage medium and computer equipment
CN111898666A (en) Random forest algorithm and module population combined data variable selection method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22955484

Country of ref document: EP

Kind code of ref document: A1