CN105824785A - Rapid abnormal point detection method based on penalized regression - Google Patents

Rapid abnormal point detection method based on penalized regression Download PDF

Info

Publication number
CN105824785A
CN105824785A CN201610141620.6A CN201610141620A CN105824785A CN 105824785 A CN105824785 A CN 105824785A CN 201610141620 A CN201610141620 A CN 201610141620A CN 105824785 A CN105824785 A CN 105824785A
Authority
CN
China
Prior art keywords
component
parameter vector
eta
point
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610141620.6A
Other languages
Chinese (zh)
Inventor
宋允全
张青华
渐令
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Petroleum East China
Original Assignee
China University of Petroleum East China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Petroleum East China filed Critical China University of Petroleum East China
Priority to CN201610141620.6A priority Critical patent/CN105824785A/en
Publication of CN105824785A publication Critical patent/CN105824785A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis

Abstract

The invention relates to a rapid abnormal point detection method based on penalized regression. The rapid abnormal point detection method comprises the following steps: firstly, judging whether an endogenous explanatory variable exists or not in a linear regression model, when the endogenous explanatory variable does not exist, according to the variance law of a data point, constructing a penalized weighted least square objective function of a standard variance, performing selection and estimation on the standard variance, and inspecting a different variance according to the selection and estimation result of the standard variance so as to detect an abnormal point; when the endogenous explanatory variable exists, according to the mean law of data points, constructing a mean drifting model, according to the mean drifting model, constructing a penalized fusion generalized moment objective function, performing selection and estimation of mean drifting parameters, and according to the estimation result of the mean drifting parameters, detecting the abnormal points. According to the rapid abnormal point detection method disclosed by the invention, test statistics do not need to be constructed, and the distribution of the test statistics does not need to be calculated, complicated operations such as maximum likelihood estimation are avoided, the situations of abnormal points of all data can be given in one step, the problem that a conventional method is possible to be invalid under two phenomena of concealing and drowning when a plurality of abnormal points exist is solved, the run time for detection is shortened, and the data processing efficiency is improved.

Description

Rapid abnormal point detecting method based on penalized regression
Technical field
The invention belongs to data mining and machine learning field, relate to data mining and the method for data process, specifically, relate to a kind of rapid abnormal point detecting method based on penalized regression.
Background technology
During being analyzed data processing, people are frequently encountered abnormal data.Abnormal data is a problem the most common in analysis of statistical data.In theory, exceptional value is the very important factor affecting Quality of Statistical Data, and they will have serious impact to estimation, deduction and Model Selection.In application, the process to abnormal data is the most valuable in some field, such as in the complete field of network, it is possible to use abnormal data digs according to the Deviant Behavior analyzing in network;The fraudulent trading of the credit card, the manipulative behavior of stock market, the false quotation of accounting information, swindle loan etc. can be identified at financial field Outlier mining.Therefore, in recent years about a theoretic discussion always hot issue of exceptional value.
For common linear regression model (LRM), traditional abnormal point detecting method based on data detection model Yu the classical diagnosis amount of mean-shifted models, it is simple and effective when an only abnormity point, and in the case of some is special, they also create some compellent empirical results.But, in place of they there is also some shortcomings: when (1) has Multiple outliers, traditional method is all data point detection one by one, and when data point is the biggest, amount of calculation will become the biggest.(2) when there being Multiple outliers, cover and flood the existence of the two phenomenon making traditional method be to lose efficacy in some cases.(3) in model, the number of unknown parameter has exceeded the number of sample size, and this makes parameter estimation and hypothesis testing become complex, even " unrecognizable ".(4) most of traditional methods are required for constructing statistic of test and calculating its distribution function, and the distribution function of statistic of test is difficult to ask, and have can not seek its distribution function the most at all.
Under big data background, in the processing procedure of data, in order to improve the quality of statistical data, need to utilize certain effective method to the pseudo-data rejecting in statistical data, reach the purpose eliminated the false and retained the true.Variable selection is exactly a class common method.Variable selection is a kind of technology picking out all correlated characteristics or variable from big measure feature or variable, thus sets up a sane model.In substantial amounts of Variable Selection, particularly come into one's own is Variable Selection based on punishment thought, such as Lasso, SCAD, elasticnet, adaptiveLasso and Dantzigselector.Above-mentioned Variable Selection is it is generally required to hypothesized model has openness and exophytic.In higher-dimension regression model, substantial amounts of explanatory variable there will be endogenous explanatory variable unavoidably.The existence of endogenous explanatory variable makes common penalized least-squares method not be harmonious, and then obtains erroneous decision.
In view of the tradition shortcoming of abnormity point detection method and variable selection advantage in data handling, propose a kind of based on punishment technology and also be not required to construct the detection method of the abnormity point situation that statistic of test just can provide all data simultaneously and just seem particularly necessary.Outlier detection based on penalty method is a brand-new research field and has important practical value, but the most still lack a ripe technical scheme, it is thus desirable to provide can be practical under with and without endogenous explanatory variable quick abnormal point detecting method, the mass data in big data system can be processed by approximation method on the premise of ensureing testing result precision.
Summary of the invention
Present invention aims to existing traditional abnormal point detecting method need construct statistic of test and can only progressively detect each data point, the above-mentioned deficiency such as computationally intensive; provide a kind of rapid abnormal point detecting method based on penalized regression; the method combines the method for High dimensional data analysis and penalized regression; reduce amount of calculation; reduce the operation time, and then the target of outlier detection efficiency is greatly improved.
According to one embodiment of the invention, it is provided that a kind of rapid abnormal point detecting method based on penalized regression, containing following steps: (one) utilizes metadata acquisition tool to gather data to be tested pointDraw the scatterplot of data to be tested point, in scatterplot, the data point of 90%-95% data point linear regression model (LRM) Y=X β+ε near same straight line represents, the vector that wherein Y is constituted by response variable, X is the matrix that explanatory variable is constituted, ε is random error, meet E (ε)=0Judge whether linear regression model (LRM) Y=X β+ε exists endogenous explanatory variable.
(2) when linear regression model (LRM) does not exists endogenous explanatory variable, according to the variance rule of the data point gathered, Sparse parameter vector γ=I-σ is constructed-1Structure weighted least-squares loss function, weighted least-squares loss function structure punishment weighted least-squares object function is combined by the penalty of component in Sparse parameter vector γ, optimize the punishment weighted least-squares object function about Sparse parameter vector γ, carry out selection and the estimation of Sparse parameter vector γ, the component of variance corresponding to component being not equal to zero in the estimation of Sparse parameter vector γ is Singular variance, data to be tested corresponding to Singular variance are abnormity point, are completed the detection of abnormity point by test for heteroscedasticity.Owing to Singular variance is abnormal, variance vectors σ2The component of middle 90%-95% is identical, and the component of 5%-10% is different, and data to be tested are standardized, so corresponding variance vectors σ2The component of middle 90%-95% is 1, and the component of only 5%-10% is not 1, therefore standard deviation vector σ=(σ1,…,σn)TThe component of middle 90%-95% is 1, and the component of only 5%-10% is not 1, therefore Sparse parameter vector γ=1-σ-1The component of middle 90%-95% is 0, and the component of only 5%-10% is not 0.
(3) when linear regression model (LRM) exists endogenous explanatory variable, according to the average rule of the data point gathered, mean-shifted models y=X β+η+ε, wherein error term ε~N (0, σ are constructed2I), average drifting parameter vector η=(η1,…,ηn)T;Generalized moment loss function is merged according to average drifting parameter vector η structure, generalized moment object function is merged by the penalty structure punishment merging the component that generalized moment loss function combines average drifting parameter vector η, optimize the punishment about average drifting parameter vector η and merge generalized moment object function, carry out selection and the estimation of average drifting parameter vector η, the estimation of average drifting parameter vector ηData to be tested point corresponding to the component of middle non-zero is abnormity point, by checking the estimation of average drifting parameter vector ηThe component of middle non-zero completes the detection of abnormity point.If the i-th component η of average drifting parameter vector ηiSignificantly be not equal to zero, then the average of explanation i-th data to be tested point has drift really, thus data point (xi,yi) do not meet set equation of linear regressionThen i-th point is abnormity point;Owing to abnormity point is abnormal, both in data to be tested point, the data point of only 5%-10% was abnormity point, therefore in average drifting parameter vector η, the component of only 5%-10% is not zero, and the component of 90%-95% is zero, and this explanation average drifting parameter vector η is sparse.
In learning method according to embodiments of the present invention, in step (), it is judged that whether linear regression model (LRM) exist concretely comprising the following steps of endogenous explanatory variable:
(1) given explanatory variable X, by linear regression model (LRM) design conditions expectation E (ε | X);
(2) whether Rule of judgment expectation E (ε | X) is zero, if conditional expectation E (ε | X) it is zero, linear regression model (LRM) does not then exist endogenous explanatory variable, if conditional expectation E (ε | X) be not zero, then there is endogenous explanatory variable in linear regression model (LRM).
In detection method according to embodiments of the present invention, in step (two), when there is not endogenous explanatory variable, concretely comprising the following steps of detection abnormity point:
(1) definition standard variance vector is σ=(σ1,…,σn)T, in standard deviation vector, the component of 90%-95% is 1, and the component of only 5%-10% is not 1;
(2) note I=(1 ..., 1)T, σ-1=(1/ σ1,…,1/σn)T, utilize conversionStructure Sparse parameter vector γ=I-σ-1, Sparse parameter vector γ=1-σ-1The component of middle 90%-95% is 0, and the component of only 5%-10% is not 0;
(3) structure weighted least-squares loss function
(4) penalty of component in Sparse parameter vector γ is introduced
(5) penalty structure punishment weighted least-squares object function Q (β, the σ of component in Sparse parameter vector γ is combined by weighted least-squares loss function;λ):
Q ( β , σ ; λ ) = 1 2 n Σ i = 1 n ( y i - x i T β σ i ) 2 + Σ j = 1 n P λ ( | 1 - 1 σ j | ) - - - ( 1 )
In formula, β is nuisance paremetric, its weighted least square of nuisance paremetric βReplace;λ represents adjustment parameter;
(6) conversion is introduced:And introduce mark:
Y ^ * = ( y 1 - x 1 T β ^ , · · · , y n - x n T β ^ ) T , X ^ 1 * = ( y 1 - x 1 T β ^ , 0 , · · · , 0 ) T , · · · , X ^ n * = ( 0 , · · · , 0 , y n - x n T β ^ ) T , Punishment weighted least-squares object function Q (β, σ;λ) it is reduced to:
Q ( β ^ , γ ; λ ) = 1 2 n | | Y ^ * - X ^ * γ | | 2 + Σ j = 1 n P λ ( | γ j | ) - - - ( 2 )
(7) BIC information criterion is utilized to select punishment weighted least-squares object function Q (β, σ;Adjustment parameter lambda optimum in λ);
(8) utilize KKT condition that the optimization of punishment weighted least-squares object function is converted into saddle point system, utilize conjugate gradient algorithms to solve the optimization of punishment weighted least-squares object function, Sparse parameter vector γ is selected and estimates;
(9) according to σiAnd γiDuality relationObtain selection and the estimation of standard variance σ, the component of the standard variance σ corresponding to component being not equal to zero in the estimation of Sparse parameter vector γ is Singular variance, the component being i.e. not equal to 1 in the estimation of standard variance σ is Singular variance, data to be tested corresponding to Singular variance are abnormity point, by inspection Singular variance, complete outlier detection.
In detection method according to embodiments of the present invention, in step (three), when there is endogenous explanatory variable, concretely comprising the following steps of detection abnormity point:
(1) average drifting parameter vector η being introduced the linear regression model (LRM) in step (), construct mean-shifted models, mean-shifted models is expressed as:
Y=X β+η+ε (3)
Wherein, error term ε~N (0, σ2I), average drifting parameter vector η=(η1,…,ηn)T
(2) obtain instrumental variable vector W, by mean-shifted models obtain correspondence subject leader:
E [g (Y, X β+η) | W]=0 (4)
Wherein, g () is known binary function, takes g (t1,t2)=t1-t2
(3) by B-batten or two different collection of the conversion of Fourier progression Construct Tool variable vector W:
F=(f1(W),…,fp(W))T(5)
H=(h1(W),…,hp(W))T(6);
(4) identification condition was constructed according to two collection of subject leader and the conversion of instrumental variable vector W:
E [g (Y, X β+η) F]=0 (7)
E [g (Y, X β+η) H]=0 (8);
(5) indicative function of each component in average drifting parameter vector η is introducedIndicative function structure according to crossing each component in identification condition and average drifting parameter vector η merges generalized moment loss function LFGMM(η):
L F G M M ( η ) = Σ j = 1 n I ( η j ≠ 0 ) { ω j 1 [ 1 n Σ i = 1 n g ( Y i , X i β + η ) f j ( W i ) ] 2 + ω j 2 [ 1 n Σ i = 1 n g ( Y i , X i β + η ) h j ( W i ) ] 2 } - - - ( 9 )
Wherein, ωj1And ωj2For given power;
In order to express conveniently, make Vi(η)=(Fi(η)T,Hi(η)T)T, then generalized moment loss function L is mergedFGMM(η) matrix form is:
L F G M M ( η ) = [ 1 n Σ i = 1 n g ( Y i , X i β + η ) V i ( η ) ] T J ( η ) [ 1 n Σ i = 1 n g ( Y i , X i β + η ) V i ( η ) ] - - - ( 10 )
Wherein,(l1,…,lr) it is the labelling that in average drifting parameter vector η, nonzero component is corresponding;
(6) penalty P of the component of average drifting parameter vector η is introducedλ(|ηj|);
(7) according to merging generalized moment loss function LFGMM(η) with penalty P of each component in average drifting parameter vector ηλ(||ηj|) structure punishment fusion generalized moment object function QFGMM(η):
Q F G M M ( η ) = L F G M M ( η ) + Σ j = 1 n P λ ( | η j | ) - - - ( 11 )
Wherein, Pλ() is penalty, and parameter lambda is for adjusting parameter;
(8) BIC information criterion is utilized to select punishment to merge generalized moment object function QFGMM(η) adjustment parameter lambda optimum in;
(9) orderRepresenting a smoothing kernel function, wherein, F (t) is a twice differentiable cumulative distribution function;(10) h is worked asn→0+, smoothing kernel functionConverge onTherefore use Smoothing Technique smoothing kernel functionReplace merging generalized moment loss function LFGMM(η) indicative function inAnd then obtain smooth fusion generalized moment loss function LK;In conjunction with about the penalty of average drifting parameter η and then obtain smooth punishment and merge generalized moment object function QK:
Q K ( η ) = L K ( η ) + Σ j = 1 n P λ ( | η j | ) - - - ( 12 ) ;
(11) iteration coordinate descent is utilized to optimize smooth fusion generalized moment object function QK, average drifting parameter vector η is selected and estimates, the estimation of average drifting parameter vector ηData to be tested point corresponding to the component of middle non-zero is abnormity point, by checking the estimation of average drifting parameter vector ηThe component of middle non-zero, completes the detection of abnormity point.
nullThe rapid abnormal point detecting method based on penalized regression that the present invention proposes,First determine whether whether linear regression model (LRM) exists endogenous explanatory variable,When there is not endogenous explanatory variable,Variance rule according to data point,Build the punishment weighted least-squares object function of standard variance,Standard variance is selected and estimates,Selection according to standard variance and estimated result inspection Singular variance,Thus carry out the detection of abnormity point,When there is endogenous explanatory variable,Average rule according to data point,Structure mean-shifted models,Build punishment according to mean-shifted models and merge generalized moment object function,Carry out selection and the estimation of average drifting parameter,Estimated result according to average drifting parameter carries out the detection of abnormity point,Need not construct statistic of test and ask it to be distributed,Avoid the complicated computings such as such as maximal possibility estimation,A step can provide the abnormity point situation of all data,The process being applicable to low-dimensional data is simultaneously applicable to the process of high dimensional data,Expand range.Compared with prior art; by rapid abnormal point detecting method based on penalized regression according to embodiments of the present invention; when can solve the problem that Multiple outliers, traditional method is being covered and is being flooded the problem that may lose efficacy under both phenomenons, saves the operation time of detection, improves the efficiency that data process.The rapid abnormal point detecting method based on penalized regression that the present invention proposes, it is also possible to carry out the optimization of function to achieve the objective easily by existing optimized algorithm and corresponding software, performs simple, easy to operate.
Accompanying drawing explanation
Accompanying drawing 1 is present invention rapid abnormal based on penalized regression point detecting method schematic diagram.
Rapid abnormal point detecting method flow chart based on penalized regression during endogenous explanatory variable is there is not in accompanying drawing 2 for the present invention.
Rapid abnormal point detecting method flow chart based on penalized regression when accompanying drawing 3 exists endogenous explanatory variable for the present invention.
Accompanying drawing 4 is the inventive method and the outlier detection result of traditional method in the case of abnormity point proportion is 5%.
Accompanying drawing 5 is the inventive method and the outlier detection result of traditional method in the case of abnormity point proportion is 10%.
Detailed description of the invention
In order to make the purpose of the present invention, technical scheme and advantage clearer, below in conjunction with accompanying drawing, embodiments of the present invention is further illustrated.
Being illustrated in figure 1 the schematic diagram of a kind of based on penalized regression the rapid abnormal point detecting method provided according to embodiments of the present invention, this detection method contains following steps:
(1) metadata acquisition tool is utilized to gather data to be tested pointDraw the scatterplot of data to be tested point, in scatterplot, the data point of 90%-95% data point linear regression model (LRM) Y=X β+ε near same straight line represents, the vector that wherein Y is constituted by response variable, X is the matrix that explanatory variable is constituted, ε is random error, meet E (ε)=0Judge whether linear regression model (LRM) Y=X β+ε exists endogenous explanatory variable.
(2) when linear regression model (LRM) does not exists endogenous explanatory variable, according to the variance rule of the data point gathered, Sparse parameter vector γ=I-σ is constructed-1, owing to Singular variance is abnormal, variance vectors σ2The component of middle 90%-95% is identical, and the component of 5%-10% is different, and data to be tested are standardized, so corresponding variance vectors σ2The component of middle 90%-95% is 1, and the component of only 5%-10% is not 1, therefore standard deviation vector σ=(σ1,…,σn)TThe component of middle 90%-95% is 1, and the component of only 5%-10% is not 1, therefore Sparse parameter vector γ=1-σ-1The component of middle 90%-95% is 0, and the component of only 5%-10% is not 0.Structure weighted least-squares loss function, weighted least-squares loss function structure punishment weighted least-squares object function is combined by the penalty of component in Sparse parameter vector γ, optimize the punishment weighted least-squares object function about Sparse parameter vector γ, carry out selection and the estimation of Sparse parameter vector γ, the component of variance corresponding to component being not equal to zero in the estimation of Sparse parameter vector γ is Singular variance, data to be tested corresponding to Singular variance are abnormity point, are completed the detection of abnormity point by test for heteroscedasticity.
(3) when linear regression model (LRM) exists endogenous explanatory variable, according to the average rule of the data point gathered, mean-shifted models y=X β+η+ε, wherein error term ε~N (0, σ are constructed2I), average drifting parameter vector η=(η1,…,ηn)T;If the i-th component η of average drifting parameter vector ηiSignificantly be not equal to zero, then the average of explanation i-th data to be tested point has drift really, thus data point (xi,yi) do not meet set equation of linear regressionThen i-th point is abnormity point;Owing to abnormity point is abnormal, both in data to be tested point, the data point of only 5%-10% was abnormity point, therefore in average drifting parameter vector η, the component of only 5%-10% is not zero, and the component of 90%-95% is zero, and this explanation average drifting parameter vector η is sparse.Generalized moment loss function is merged according to average drifting parameter vector η structure, generalized moment object function is merged by the penalty structure punishment merging the component that generalized moment loss function combines average drifting parameter vector η, optimize the punishment about average drifting parameter vector η and merge generalized moment object function, carry out selection and the estimation of average drifting parameter vector η, the estimation of average drifting parameter vector ηData to be tested point corresponding to the component of middle non-zero is abnormity point, by checking the estimation of average drifting parameter vector ηThe component of middle non-zero completes the detection of abnormity point.
Embodiment one: as in figure 2 it is shown, a kind of rapid abnormal point detecting method based on penalized regression, this detection method contains following steps:
Step one: utilize metadata acquisition tool such as data acquisition unit to generate n=100 data to be tested point.The concrete generating mode of testing data point is: set p0For the ratio shared by abnormity point in data to be tested point, in order to obtain 100p0Individual abnormity point, orderFromIn randomly draw 100p0Individual component, then this 100p0Individual component each be multiplied by random standard deviation criteria ω~Unif ([1.5,3.5]), use this 100p0Individual component and remaining n-100p0Component generates parameter vector σ=(σ1,…,σn), and then obtain the oblique variance matrix that regression error ε is corresponding
After obtaining data to be tested point, drawing the scatterplot of data to be tested point, in scatterplot, the data point of 90%-95% data point linear regression model (LRM) Y=X β+ε near same straight line represents, wherein ε~N (0, ∑).Judging whether these 100 the testing data points generated in linear regression model (LRM) exist endogenous explanatory variable, it concretely comprises the following steps:
(1) given explanatory variable X, explanatory variable X obtains by by under type, makes ρ=0.5,ThenWherein, U=(Uij)100×10,By linear regression model (LRM) design conditions expectation E (ε | X).
(2) being generated process by data point it will be seen that explanatory variable X and regression error variable ε are separate, therefore, then there is not endogenous explanatory variable in conditional expectation E (ε | X)=0 in linear regression model (LRM).
Step 2: detect abnormity point, it concretely comprises the following steps:
(1) definition standard variance vector is σ=(σ1,…,σn)T, standard deviation vector σ=(σ1,…,σn)TThe component of middle 90%-95% is 1, and the component of only 5%-10% is not 1.
(2) note I=(1 ..., 1)T, σ-1=(1/ σ1,…,1/σn)T, utilize conversionStructure Sparse parameter vector γ=I-σ-1, Sparse parameter vector γ=1-σ-1The component of middle 90%-95% is 0, and the component of only 5%-10% is not 0.
(3) structure weighted least-squares loss function
(4) penalty of component in Sparse parameter vector γ is introducedWherein penalty Pλ() has multiple choices, it is contemplated that SCAD penalty meets the oracle character of variable selection, and penalty uses SCAD penalty in the present embodiment, and its expression is:
Wherein, by Bayes's viewpoint and combine practical experience value of parameter a in reality performs and be taken as 3.7.
(5) penalty structure punishment weighted least-squares object function Q (β, the σ of component in Sparse parameter vector γ is combined by weighted least-squares loss function;λ):
Q ( β , σ ; λ ) = 1 2 n Σ i = 1 n ( y i - x i T β σ i ) 2 + Σ j = 1 n P λ ( | 1 - 1 σ j | ) - - - ( 1 )
In formula, β is nuisance paremetric, its weighted least square of nuisance paremetric βReplace;σ is standard variance vector;λ represents the adjustment parameter in penalty.
(6) conversion is introduced:And introduce mark:
Y ^ * = ( y 1 - x 1 T β ^ , · · · , y n - x n T β ^ ) T , X ^ 1 * = ( y 1 - x 1 T β ^ , 0 , · · · , 0 ) T , · · · , X ^ n * = ( 0 , · · · , 0 , y n - x n T β ^ ) T , Punishment weighted least-squares object function Q (β, σ;λ) it is reduced to:
Q ( β ^ , γ ; λ ) = 1 2 n | | Y ^ * - X ^ * γ | | 2 + Σ j = 1 n P λ ( | γ j | ) - - - ( 2 )
(7) BIC information criterion is utilized to select punishment weighted least-squares object function Q (β, σ;Adjustment parameter lambda optimum in λ).
(8) utilize KKT condition that the optimization of punishment weighted least-squares object function is converted into saddle point system, utilize conjugate gradient algorithms to solve the optimization of punishment weighted least-squares object function, Sparse parameter vector γ is selected and estimates.
(9) according to σiAnd γiDuality relationObtain selection and the estimation of standard variance σ, the component of the standard variance σ corresponding to component being not equal to zero in the estimation of Sparse parameter vector γ is Singular variance, the component being i.e. not equal to 1 in the estimation of standard variance σ is Singular variance, data to be tested corresponding to Singular variance are abnormity point, by inspection Singular variance, complete outlier detection.
Embodiment two: as it is shown on figure 3, a kind of rapid abnormal point detecting method based on penalized regression, this detection method contains following steps:
Step one: utilizing data acquisition unit to obtain 100 data to be tested points, the concrete acquisition mode of data to be tested point is: utilize Fourier basic function as corresponding instrumental variable:WithObtain 200 explanatory variables, wherein require first five explanatory variable (X1,X2,X3,X4,X5) it is important.Then explanatory variable is divided into two classes: endogenous explanatory variable and external explanatory variable.If XjThen it is designated as endogenous explanatory variableIf XjThen it is designated as external explanatory variableAssumeWithMeet following two formulas respectivelyWithWherein { ε, u1,…,upIn }, each variable is N (0,1), average drifting parameter vector η=(η1,…,η56,…,η1516,…,η100)=(0 ..., 0,10 ..., 10,0 ..., 0), F=(F1,…,Fp)TWith H=(H1,…,HP)TIt is three dimensional tool variable W=(W1,W2,W3)T~N3(0,I3) a conversion.Obtain 52 endogenous explanatory variable (X in a manner described1,X2,X3,X6,…,X52), so (X in important explanatory variable1,X2,X3) it is endogenous explanatory variable, and (X4,X5) it is external explanatory variable.After obtaining data to be tested point, drawing the scatterplot of data to be tested point, in scatterplot, the data point of 90%-95% data point linear regression model (LRM) Y=X β+ε near same straight line represents, in figure parameters vector, component meets β=(β1,…,β5)=(5 ,-4,7 ,-2,1.5), βj=0,6≤j≤200.Judging whether 100 random samples generated in linear regression model (LRM) exist endogenous explanatory variable, it concretely comprises the following steps:
(1) given explanatory variable X, by linear regression model (LRM) design conditions expectation E (ε | X);
(2) not being independent owing to the acquisition process of data understands explanatory variable X and regression error, therefore, then there is endogenous explanatory variable in linear regression model (LRM) in conditional expectation E (ε | X) ≠ 0.
Step 2: detect abnormity point, it concretely comprises the following steps:
(1) by average drifting parameter vector η=(η1,…,η56,…,η1516,…,η100) introduce the linear regression model (LRM) in step (), construct mean-shifted models, mean-shifted models is expressed as:
Y=X β+η+ε (3).
(2) obtain instrumental variable vector W, by mean-shifted models obtain correspondence subject leader:
E [g (Y, X β+η) | W]=0 (4)
Wherein, g () is known binary function, takes g (t in the present embodiment1,t2)=t1-t2
(3) by B-batten or two different collection of the conversion of Fourier progression Construct Tool variable vector W:
F=(f1(W),…,fp(W))T(5)
H=(h1(W),…,hp(W))T(6)。
(4) identification condition was constructed according to two collection F and H of subject leader and the conversion of instrumental variable vector W:
E [g (Y, X β+η) F]=0 (7)
E [g (Y, X β+η) H]=0 (8).
(5) indicative function of each component in average drifting parameter vector η is introducedIndicative function structure according to crossing each component in identification condition and average drifting parameter vector η merges generalized moment loss function LFGMM(η):
L F G M M ( η ) = Σ j = 1 n I ( η j ≠ 0 ) { ω j 1 [ 1 n Σ i = 1 n g ( Y i , X i β + η ) f j ( W i ) ] 2 + ω j 2 [ 1 n Σ i = 1 n g ( Y i , X i β + η ) h j ( W i ) ] 2 } - - - ( 9 )
Wherein ωj1And ωj2For given power.
In order to express conveniently, make Vi(η)=(Fi(η)T,Hi(η)T)T, then generalized moment loss function L is mergedFGMM(η) matrix form is:
L F G M M ( η ) = [ 1 n Σ i = 1 n g ( Y i , X i β + η ) V i ( η ) ] T J ( η ) [ 1 n Σ i = 1 n g ( Y i , X i β + η ) V i ( η ) ] - - - ( 10 )
Wherein(l1,…,lr) it is the labelling that in average drifting parameter vector η, nonzero component is corresponding.
(6) penalty P of the component of average drifting parameter vector η is introducedλ(|ηj|), wherein penalty Pλ() has multiple choices, it is contemplated that SCAD penalty meets the oracle character of variable selection, and penalty uses SCAD penalty in the present embodiment, and its expression is:
Wherein, by Bayes's viewpoint and combine practical experience value of parameter a in reality performs and be taken as 3.7.
(7) according to merging generalized moment loss function LFGMM(η) and penalty P of component of average drifting parameter vector ηλ(|ηj|) structure punishment fusion generalized moment object function QFGMM(η):
Q F G M M ( η ) = L F G M M ( η ) + Σ j = 1 n P λ ( | η j | ) - - - ( 11 )
Wherein Pλ() is penalty, and parameter lambda is for adjusting parameter.
(8) BIC information criterion is utilized to select punishment to merge generalized moment object function QFGMM(η) adjustment parameter lambda optimum in.
(9) orderRepresenting a smoothing kernel function, wherein, F (t) is a logistic cumulative distribution function, and its expression is:
(10) when utilizing Smoothing Technique to approximate indicative function, the value of smoothing parameter h is taken as 0.1, is so utilized respectively smoothing kernel function for institute is important in average drifting parameterApproximate replacement to replace merging generalized moment loss function LFGMM(η) indicative function inAnd then obtain smooth fusion generalized moment loss function LK.In conjunction with about the penalty of average drifting parameter η and then obtain smooth punishment and merge generalized moment object function QK:
Q K ( η ) = L K ( η ) + Σ j = 1 n P λ ( | η j | ) - - - ( 12 ) .
(11) iteration coordinate descent is utilized to optimize smooth fusion generalized moment object function QK, average drifting parameter vector η is selected and estimates, the estimation of average drifting parameter vector ηData to be tested point corresponding to the component of middle non-zero is abnormity point, by checking the estimation of average drifting parameter vector ηThe component of middle non-zero, completes the detection of abnormity point.The concrete result that performs isWith
In order to compare, abnormal point detecting method traditional when there is endogenous explanatory variable and and method based on penalized least-squares substantially lost efficacy.Penalized least-squares method is used to illustrate in this embodiment, penalized least-squares method being estimated as of the average drifting parameter obtained
By the execution result of the present embodiment it may be clearly seen that, when there is endogenous explanatory variable, usual penalized least-squares method is the most no longer harmonious, and therefore the abnormal point detecting method of abnormal point detecting method based on penalized least-squares and traditional structure statistic of test is no longer valid.And combine from the average rule of data to be tested and obtain about the punishment of average drifting parameter punishing that merging generalized moment method can successfully identify all of abnormity point, therefore proposed by the invention having had based on the abnormal point detecting method punishing fusion generalized moment estimation is significantly increased, and there is wider array of range than existing abnormal point detecting method.
In the case of the ratio of total data shared by abnormity point is respectively two kinds of 0.05 and 0.10, perform the most traditional abnormal point detecting method respectively: residual errorOuter studentized residuals (ri), F checks (Fi), likelihood ratio test (LRi), t checks (ti), and Score inspection (SCi).For method the most proposed by the invention and traditional method performance in outlier detection, consider following three standards: averagely cover the ratio (M) of the real normal point that probability i.e. detects, averagely flooding probability i.e. normal point and be identified as the ratio (S) of abnormity point, the ratio (JD) of simulation is covered in associating discrimination that is 0.Preferably situation should be M ≈ 0, S ≈ 0 and JD ≈ 0.
Under above three standard, Fig. 4 and Fig. 5 sets forth method (HTOD) and the result of traditional six kinds of methods that the present invention proposes.From result shown in above-mentioned figure it will be clear that traditional method needs construct statistic of test and ask it to be distributed, and data point abnormal conditions can only be provided according to mode progressively, therefore the time is run longer inefficient, the more important thing is, when there is Multiple outliers, traditional method is being covered and to flood accuracy of detection under both phenomenons the lowest.And the method that the present invention proposes is because need not construct statistic of test and ask it to be distributed, and provide the abnormity point situation of all data to be tested point with only needing a step, the most significantly save the operation time, the more important thing is, affected substantially without by covering and flooding phenomenon when there is Multiple outliers, the precision of outlier detection is therefore greatly improved.
Above-described embodiment is used for explaining the present invention rather than limiting the invention, in the protection domain of spirit and claims of the present invention, and any modifications and changes that the present invention is made, both fall within protection scope of the present invention.

Claims (4)

1. a rapid abnormal point detecting method based on penalized regression, it is characterised in that: containing following steps:
(1) metadata acquisition tool is utilized to gather data to be tested pointDraw the scatterplot of data to be tested point, in scatterplot, the data point of 90%-95% data point linear regression model (LRM) Y=X β+ε near same straight line represents, the vector that wherein Y is constituted by response variable, X is the matrix that explanatory variable is constituted, ε is random error, meet E (ε)=0Judge whether linear regression model (LRM) Y=X β+ε exists endogenous explanatory variable;
(2) when linear regression model (LRM) does not exists endogenous explanatory variable, according to the variance rule of the data point gathered, Sparse parameter vector γ=I-σ is constructed-1Structure weighted least-squares loss function, weighted least-squares loss function structure punishment weighted least-squares object function is combined by the penalty of component in Sparse parameter vector γ, optimize the punishment weighted least-squares object function about Sparse parameter vector γ, carry out selection and the estimation of Sparse parameter vector γ, the component of variance corresponding to component being not equal to zero in the estimation of Sparse parameter vector γ is Singular variance, data to be tested corresponding to Singular variance are abnormity point, by inspection Singular variance, complete the detection of abnormity point;
(3) when linear regression model (LRM) exists endogenous explanatory variable, according to the average rule of the data point gathered, mean-shifted models y=X β+η+ε, wherein error term ε~N (0, σ are constructed2I), average drifting parameter vector η=(η1,…,ηn)T;Generalized moment loss function is merged according to average drifting parameter vector η structure, generalized moment object function is merged by the penalty structure punishment merging the component that generalized moment loss function combines average drifting parameter vector η, optimize the punishment about average drifting parameter vector η and merge generalized moment object function, carry out selection and the estimation of average drifting parameter vector η, the estimation of average drifting parameter vector ηData to be tested point corresponding to the component of middle non-zero is abnormity point, by checking the estimation of average drifting parameter vector ηThe component of middle non-zero, completes the detection of abnormity point.
Rapid abnormal point detecting method based on penalized regression the most according to claim 1, it is characterised in that: in step (), it is judged that whether linear regression model (LRM) exist concretely comprising the following steps of endogenous explanatory variable:
(1) given explanatory variable X, by linear regression model (LRM) design conditions expectation E (ε | X);
(2) whether Rule of judgment expectation E (ε | X) is zero, if conditional expectation E (ε | X) it is zero, linear regression model (LRM) does not then exist endogenous explanatory variable, if conditional expectation E (ε | X) be not zero, then there is endogenous explanatory variable in linear regression model (LRM).
Rapid abnormal point detecting method based on penalized regression the most according to claim 1, it is characterised in that: in step (two), when there is not endogenous explanatory variable, concretely comprising the following steps of detection abnormity point:
(1) definition standard variance vector is σ=(σ1,…,σn)T, standard deviation vector σ=(σ1,…,σn)TThe component of middle 90%-95% is 1, and the component of only 5%-10% is not 1;
(2) note I=(1 ..., 1)T, σ-1=(1/ σ1,…,1/σn)T, utilize conversionI=1 ..., n constructs Sparse parameter vector γ=I-σ-1, Sparse parameter vector γ=1-σ-1The component of middle 90%-95% is 0, and the component of only 5%-10% is not 0;
(3) structure weighted least-squares loss function
(4) penalty of component in Sparse parameter vector γ is introduced
(5) penalty structure punishment weighted least-squares object function Q (β, the σ of component in Sparse parameter vector γ is combined by weighted least-squares loss function;λ):
Q ( β , σ ; λ ) = 1 2 n Σ i = 1 n ( y i - x i T β σ i ) 2 + Σ j = 1 n P λ ( | 1 - 1 σ j | ) - - - ( 1 )
In formula, β is nuisance paremetric, its weighted least square of nuisance paremetric βReplace;λ represents adjustment parameter;
(6) conversion is introduced:I=1 ..., n, γ=(γ1,…,γn)T, and introduce mark:
Punishment weighted least-squares object function Q (β, σ;λ) it is reduced to:
Q ( β ^ , γ ; λ ) = 1 2 n | | Y ^ * - X ^ * γ | | 2 + Σ j = 1 n P λ ( | γ j | ) - - - ( 2 )
(7) BIC information criterion is utilized to select punishment weighted least-squares object function Q (β, σ;Adjustment parameter lambda optimum in λ);
(8) utilize KKT condition that the optimization of punishment weighted least-squares object function is converted into saddle point system, utilize conjugate gradient algorithms to solve the optimization of punishment weighted least-squares object function, Sparse parameter vector γ is selected and estimates;
(9) according to σiAnd γiDuality relationObtaining selection and the estimation of standard variance σ, the component of the standard variance σ corresponding to component being not equal to zero in the estimation of Sparse parameter vector γ is Singular variance, i.e. the estimation of standard variance σIn to be not equal to the component of 1 be Singular variance, the data to be tested corresponding to Singular variance are abnormity point, by inspection Singular variance, complete outlier detection.
Rapid abnormal point detecting method based on penalized regression the most according to claim 1, it is characterised in that: in step (three), when there is endogenous explanatory variable, concretely comprising the following steps of detection abnormity point:
(1) average drifting parameter vector η being introduced the linear regression model (LRM) in step (), construct mean-shifted models, mean-shifted models is expressed as:
Y=X β+η+ε (3)
Wherein, error term ε~N (0, σ2I), average drifting parameter vector η=(η1,…,ηn)T
(2) obtain instrumental variable vector W, by mean-shifted models obtain correspondence subject leader:
E [g (Y, X β+η) | W]=0 (4)
Wherein, g () is known binary function, takes g (t1,t2)=t1-t2
(3) by B-batten or two different collection of the conversion of Fourier progression Construct Tool variable vector W:
F=(f1(W),…,fp(W))T(5)
H=(h1(W),…,hp(W))T(6);
(4) identification condition was constructed according to two collection of subject leader and the conversion of instrumental variable vector W:
E [g (Y, X β+η) F]=0 (7)
E [g (Y, X β+η) H]=0 (8);
(5) introducing the indicative function of each component in average drifting parameter vector η, the indicative function structure according to crossing identification condition and each component of average drifting parameter vector η merges generalized moment loss function LFGMM(η):
L F G M M ( η ) = Σ j = 1 n I ( η j ≠ 0 ) { ω j 1 [ 1 n Σ i = 1 n g ( Y i , X i β + η ) f j ( W i ) ] 2 + ω j 2 [ 1 n Σ i = 1 n g ( Y i , X i β + η ) h j ( W i ) ] 2 } - - - ( 9 )
Wherein, ωj1And ωj2For given power, in order to express conveniently, make Vi(η)=(Fi(η)T,Hi(η)T)T, then generalized moment loss function L is mergedFGMM(η) matrix form is:
L F G M M ( η ) = [ 1 n Σ i = 1 n g ( Y i , X i β + η ) V i ( η ) ] T J ( η ) [ 1 n Σ i = 1 n g ( Y i , X i β + η ) V i ( η ) ] - - - ( 9 )
Wherein,(l1,…,lr) it is the labelling that in average drifting parameter vector η, nonzero component is corresponding;
(6) penalty P of each component in average drifting parameter vector η is introducedλ(|ηj|);
(7) according to merging generalized moment loss function LFGMM(η) with penalty P of each component in average drifting parameter vector ηλ(|ηj|) structure punishment fusion generalized moment object function QFGMM(η):
Q F G M M ( η ) = L F G M M ( η ) + Σ j = 1 n P λ ( | η j | ) - - - ( 11 )
Wherein, Pλ() is penalty, and parameter lambda is for adjusting parameter;
(8) BIC information criterion is utilized to select punishment to merge generalized moment object function QFGMM(η) adjustment parameter lambda optimum in;
(9) orderRepresenting a smoothing kernel function, wherein, F (t) is a twice differentiable cumulative distribution function;
(10) h is worked asn→0+, smoothing kernel functionConverge onTherefore use Smoothing Technique smoothing kernel functionReplace merging generalized moment loss function LFGMM(η) indicative function inAnd then obtain smooth fusion generalized moment loss function LK;In conjunction with about the penalty of average drifting parameter η and then obtain smooth punishment and merge generalized moment object function QK:
Q K ( η ) = L K ( η ) + Σ j = 1 n P λ ( | η j | ) - - - ( 12 ) ;
(11) iteration coordinate descent is utilized to optimize smooth fusion generalized moment object function QK, average drifting parameter vector η is selected and estimates, the estimation of average drifting parameter vector ηData to be tested point corresponding to the component of middle non-zero is abnormity point, by checking the estimation of average drifting parameter vector ηThe component of middle non-zero, completes the detection of abnormity point.
CN201610141620.6A 2016-03-11 2016-03-11 Rapid abnormal point detection method based on penalized regression Pending CN105824785A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610141620.6A CN105824785A (en) 2016-03-11 2016-03-11 Rapid abnormal point detection method based on penalized regression

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610141620.6A CN105824785A (en) 2016-03-11 2016-03-11 Rapid abnormal point detection method based on penalized regression

Publications (1)

Publication Number Publication Date
CN105824785A true CN105824785A (en) 2016-08-03

Family

ID=56987183

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610141620.6A Pending CN105824785A (en) 2016-03-11 2016-03-11 Rapid abnormal point detection method based on penalized regression

Country Status (1)

Country Link
CN (1) CN105824785A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106327340A (en) * 2016-08-04 2017-01-11 中国银联股份有限公司 Method and device for detecting abnormal node set in financial network
CN110717543A (en) * 2019-10-14 2020-01-21 北京工业大学 Double-window concept drift detection method based on sample distribution statistical test
CN111696099A (en) * 2020-06-16 2020-09-22 北京大学 General outlier likelihood estimation method based on image edge consistency

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101122881A (en) * 2007-09-20 2008-02-13 福建星网锐捷网络有限公司 CPU abnormal point positioning diagnosis method based MIPS structure
CN103150354A (en) * 2013-01-30 2013-06-12 王少夫 Data mining algorithm based on rough set
CN104361611A (en) * 2014-11-18 2015-02-18 南京信息工程大学 Group sparsity robust PCA-based moving object detecting method
JP2015114916A (en) * 2013-12-12 2015-06-22 日本電信電話株式会社 Data analysis device and method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101122881A (en) * 2007-09-20 2008-02-13 福建星网锐捷网络有限公司 CPU abnormal point positioning diagnosis method based MIPS structure
CN103150354A (en) * 2013-01-30 2013-06-12 王少夫 Data mining algorithm based on rough set
JP2015114916A (en) * 2013-12-12 2015-06-22 日本電信電話株式会社 Data analysis device and method
CN104361611A (en) * 2014-11-18 2015-02-18 南京信息工程大学 Group sparsity robust PCA-based moving object detecting method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘遵雄 等: "SCAR惩罚逻辑回归的财务预警模型", 《统计与信息论坛》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106327340A (en) * 2016-08-04 2017-01-11 中国银联股份有限公司 Method and device for detecting abnormal node set in financial network
CN106327340B (en) * 2016-08-04 2022-01-07 中国银联股份有限公司 Abnormal node set detection method and device for financial network
CN110717543A (en) * 2019-10-14 2020-01-21 北京工业大学 Double-window concept drift detection method based on sample distribution statistical test
CN110717543B (en) * 2019-10-14 2023-09-19 北京工业大学 Double window concept drift detection method based on sample distribution statistical test
CN111696099A (en) * 2020-06-16 2020-09-22 北京大学 General outlier likelihood estimation method based on image edge consistency
CN111696099B (en) * 2020-06-16 2022-09-27 北京大学 General outlier likelihood estimation method based on image edge consistency

Similar Documents

Publication Publication Date Title
CN109034194B (en) Transaction fraud behavior deep detection method based on feature differentiation
CN109891508B (en) Single cell type detection method, device, apparatus and storage medium
WO2017143921A1 (en) Multi-sampling model training method and device
CN103838835B (en) A kind of network sensitive video detection method
CN111178675A (en) LR-Bagging algorithm-based electric charge recycling risk prediction method, system, storage medium and computer equipment
CN103473786A (en) Gray level image segmentation method based on multi-objective fuzzy clustering
CN107885928A (en) Consider the stepstress acceleration Degradation Reliability analysis method of measurement error
CN109886284B (en) Fraud detection method and system based on hierarchical clustering
CN113516228B (en) Network anomaly detection method based on deep neural network
CN106681305A (en) Online fault diagnosing method for Fast RVM (relevance vector machine) sewage treatment
CN110276116B (en) Coal mine water inrush source distinguishing method and system
CN106228190A (en) Decision tree method of discrimination for resident's exception water
CN104318241A (en) Local density spectral clustering similarity measurement algorithm based on Self-tuning
CN110084812A (en) A kind of terahertz image defect inspection method, device, system and storage medium
CN104850868A (en) Customer segmentation method based on k-means and neural network cluster
CN112948932A (en) Surrounding rock grade prediction method based on TSP forecast data and XGboost algorithm
CN105824785A (en) Rapid abnormal point detection method based on penalized regression
CN107832789A (en) Characteristic weighing k nearest neighbor method for diagnosing faults based on the conversion of average influence Value Data
CN102324007A (en) Method for detecting abnormality based on data mining
CN108491719A (en) A kind of Android malware detection methods improving NB Algorithm
CN111639882A (en) Deep learning-based power utilization risk judgment method
CN110956543A (en) Method for detecting abnormal transaction
Rofik et al. The Optimization of Credit Scoring Model Using Stacking Ensemble Learning and Oversampling Techniques
CN106778252B (en) Intrusion detection method based on rough set theory and WAODE algorithm
CN110210154B (en) Method for judging similarity of measuring points representing dam performance state by using dam measuring point data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160803

RJ01 Rejection of invention patent application after publication