CN108038211A

CN108038211A - A kind of unsupervised relation data method for detecting abnormality based on context

Info

Publication number: CN108038211A
Application number: CN201711379664.3A
Authority: CN
Inventors: 孟凡; 葛笑天; 王皓; 陈烜松; 高阳
Original assignee: Audit Office Of Jiangsu Province; Nanjing University
Current assignee: Audit Office Of Jiangsu Province; Nanjing University
Priority date: 2017-12-13
Filing date: 2017-12-13
Publication date: 2018-05-15

Abstract

The invention discloses a kind of unsupervised relation data method for detecting abnormality based on context, comprise the following steps：Multi-source relational data table is merged and pre-processed；The Intra dependence values of computation attribute value；The Inter dependence values of computation attribute value；According to Intra Feature Dependences and Inter Feature Dependences, computation attribute dependence graph structure；Context property collection is calculated with the reverse life cycle algorithm of heuristic iterative；Calculated with improved Category Attributes distance study algorithm and be based on context property similarity matrix；Reference sample Candidate Set is selected according to reference sample selector and further calculates reference sample；Calculating is based on context data Outlier factor sequence；Calculated relationship data exception Candidate Set, and determine abnormal data.The present invention can automatically excavate and using the potential structure and relation between relational data attribute, so as to further carry out unsupervised anomaly detection in the case where lacking the guidance of priori domain knowledge.

Description

A kind of unsupervised relation data method for detecting abnormality based on context

Technical field

The present invention relates to a kind of unsupervised relation data method for detecting abnormality, is specifically a kind of being directed to based on context The method that relational data carries out unsupervised anomaly detection.

Background technology

Unsupervised anomaly detection method is widely used in different practical application scenes, such as government data abnormality detection, The fields such as audit, commercial fraud detection, medical records anomaly analysis.Compared with the method for detecting abnormality based on supervised learning, nothing Unsupervised anomaly detection method can pass through data-driven (Data- in the case where weak domain knowledge or priori instruct deficiency Driven mode), it is spontaneous to be learnt from specified data set, so as to further be carried out using the knowledge to learn follow-up Anomaly data detection.

Unsupervised anomaly detection method at this stage can be divided mainly into based on distance, based on cluster, based on probabilistic model and Based on comentropy scheduling algorithm, its mainly by the distance between sample or measuring similarity, as it is main it is abnormal because Son, so as to further produce the intensity of anomaly measurement to original sample.Due to not being between sample and attribute in practical applications Simple independent same distribution structure, but relied on and dependency relation there is potential, thus it is sufficiently effective excavate it is potential Relation on attributes (context mechanism) will be helpful to further improve the precision of abnormality detection.

The content of the invention

Goal of the invention：In view of the deficiencies of the prior art, it is an object of the present invention to provide a kind of nothing based on context mechanism Unsupervised anomaly detection method.

Technical solution：The present invention is by handling different relationship type input samples, the category by using dependent with distribution Property rely on detection method, the attribute structure and dependence concentrated to data-oriented carry out digging evidence, by integrated multiple and different Unsupervised learning submodule, ultimately forms a kind of novel unsupervised relation data method for detecting abnormality based on context.Should Method comprises the following steps：

(1) multi-source relational data table is merged and pre-processed；

(2) according to the Intra dependence values of obtained relationship type attribute computation attribute value；

(3) according to the Inter dependence values of obtained relationship type attribute computation attribute value；

(4) the Intra dependences value and Inter dependence values of gained are calculated according to step (2) and (3), computation attribute, which relies on, to close Architecture；

(5) counted with the reverse life cycle algorithm RBE of a kind of heuristic iterative (Recursive Backward Elimination) Count hereafter property set in, i.e., the algorithm is by ordered retrieval strategy, and iteration removes redundant attributes, so as to calculate near optimal Attribute set；

(6) with improved Category Attributes distance study algorithm DILCA (DIstance Learning for Categorical Attributes), combine context property value and behavior property value, calculate the phase based on context property Like degree matrix, i.e. the algorithm is mainly using between the conditional probability of context property and behavior property two property values of calculating Similarity and distance；

(7) reference sample Candidate Set is selected according to reference sample selector, mainly includes two kinds of algorithms：Random k samples choosing Select algorithm and center k sample selection algorithms；

(8) final reference sample is further calculated according to the reference sample Candidate Set of selection；

(9) according to the similarity matrix and reference sample based on context property being calculated, calculate based on up and down The data exception factor sequence of literary relation；

(10) calculated relationship data exception Candidate Set, and determine abnormal data.

Beneficial effect：Compared with prior art, the present invention its remarkable advantage is：It is proposed a kind of novel method, this method The potential structure between relational data attribute can be automatically excavated and utilize in the case where lacking the guidance of priori domain knowledge And relation, so as to further carry out unsupervised anomaly detection.

Brief description of the drawings

Fig. 1 is the composition structure chart of the method for the present invention.

Fig. 2 is the key submodule composition structure chart in the method for the present invention.

Fig. 3 is the flow chart of the method for the present invention.

Fig. 4 is the multi-source fusion and pre-processing structure figure of the method for the present invention.

Fig. 5 is the example explanatory drawin of the method for the present invention.

Embodiment

As shown in Figure 1, the method for the present invention is detected, calculated based on context property distance metric automatically comprising context property Method, based on context property value measuring similarity matrix and classical unsupervised anomaly detection algorithm core library module.

As shown in Fig. 2, key submodule detects, based on context category automatically comprising context property in the method for the present invention The components such as property distance metric algorithm.

The method of the present invention flow is as shown in figure 3, method detailed description is as follows：

Step 1, for multiple relation tables in different relevant databases, merged and pre-processed, wherein wrapping Include and multiple relational data tables are connected by master, foreign key relationship, the process such as the null value in single table are removed or are backfilled. The fusion of more relational datas and pretreatment detailed process are as shown in Figure 4；

Step 2, according to fusion and pretreated attribute of a relation has been completed, calculate in the attribute of each property value (Intra) dependence value.Equation below shows in the attribute of same property value (Intra) dependence computational methods, wherein δ () represents that the dependence to single property value in same attribute is measured, and wherein m represents the mode in single attribute, p (m) tables Show the probability that m occurs, p (v_i) represent single property value v_iProbability, base (m) represents that the benchmark of single property value deviates journey Degree, dev (v_i) represent single property value v_iRelative to the departure degree of mode m, algorithm circular is as follows：

Base (m)=1-p (m)

Step 3, according to completed fusion and pretreated attribute of a relation, calculate different attribute value attribute between (Inter) dependence value, wherein v_iAnd v_jTwo property values are represented respectively,Represent to different attribute value in same attribute Between dependence measure, specific formula is as follows：

Step 4, calculated to rely in the attribute of gained (Intra) according to step 2 and step 3 and relied between attribute (Inter) Integrated, can be expressed as below, wherein A (v_i,v_j) represent two property values between degree of dependence：

And then the dependence graph structure between attribute, wherein A can be calculated_nAnd A_mTwo attributes, δ are represented respectively^* (A_n) represent to single attribute A_nThe measurement of degree of dependence,Represent respectively to attribute A_nAnd A_mBetween degree of dependence degree Amount, A^*(A_n,A_m) represented for integrity attribute dependence after joint, formula specific as follows：

Step 5, at the reverse life cycle algorithm RBE of classical heuristic recurrence (Recursive Backward Elimination) On the basis of calculate context property collection, i.e. the algorithm by the method for iteration removal, is calculated by ordered retrieval strategy Near optimal attribute set.Wherein A_nAnd A_mTwo attributes, J (T are represented respectively_c) represent to optimize context property study letter Number, T_CRepresent by acquiring obtained context property set, specific study formula is as follows：

Step 6, conventional discrete type property value distance study algorithm DILCA (DIstance Learning for are improved Categorical Attributes), combine the context property and behavior property automatically detected, calculate based on up and down Literary attributes similarity matrix, which is mainly the context property and behavior property for utilizing and having acquired, general by design conditions Rate calculates the similarity and distance between two property values.Wherein y_iAnd y_jThe row of two relational data sample points is represented respectively For property value, X represents context property set, x_kFor context property value, P (y_i|x_k) represent in context property value x_kBar Y under part_iThe probability of generation, P (y_j|x_k) represent in context property value x_kUnder the conditions of y_jThe probability of generation, the tool of two property values Body similarity distance calculation formula is as follows：

Step 7, reference subset is selected according to reference sample selector, mainly includes two kinds of algorithms：Random k samples choosing Select algorithm and center k sample selection algorithms.Wherein random k sample selection algorithms refer to select k sample at random from data set This, as with reference to Candidate Set；Center k sample selection algorithms, into k cluster, and select respectively simply by by cluster data K nearest sample point of distance sample center is used as and refers to Candidate Set.

Step 8, sample refers to Candidate Set according to caused by step 7, by further being counted with reference to Candidate Set to the sample Final reference sample is calculated, i.e., using the method for frequency, average is calculated to the Candidate Set property value frequency of generation, so that The nearest reference sample of this Attributes Frequency average of further chosen distance, is selected as final reference sample；

Step 9, according to being calculated based on context property similarity matrix and identified final reference sample This, calculating is based on context data Outlier factor sequence, that is, provides the Outlier factor sequence of all samples in data set, Outlier factor is calculated using OF () function, wherein d represents sample to be tested, and ref represents reference sample,Represent Sample to be tested is in A_iValue is mapped to value in the distance matrix corresponding to step 6 in attribute,Represent reference sample In A_iValue is mapped to value in the distance matrix corresponding to step 6, publicity specific as follows in attribute：

Step 10, the sample Outlier factor provided according to step 9 sorts, and determines relation data exception Candidate Set, so that into One step determines abnormal data.

As shown in being illustrated Fig. 5, its process is briefly as follows：

(1) input is two relational data tables, and wherein table 1 has 5 attributes respectively, and table 2 has 7 attributes；

(2) in pretreatment stage, master, external key (A are passed through₁Attribute) two tables are associated and merged, and will wherein Null value reject etc. pretreatment operation, form base table；

(3) for the base table after fusion, according to step 2 and 3 computational methods, the dependence of Intra property values is calculated respectively Matrix and Inter property values rely on matrix；

(4) according to step 4, the Intra property values calculated is relied on into matrix and Inter property values rely on matrix and carry out Joint, calculates the Feature Dependence relation representing matrix of the base table；

(5) according to step 5, context property is calculated, being computed this exemplary context property is respectively: {A₂,A₃, A₄,A₅}；

(6) according to step 6, it is 32 × 32 property value similarity distance matrixs that can further calculate size, i.e., is originally showing In example, the Category Attributes value that base table has is 32 altogether；

(7) according to algorithm in step 7, reference sample Candidate Set is selected, have selected 20 candidate's ginsengs in this example altogether Examine sample；

(8) according to algorithm in step 8, final reference sample is further calculated；

(9) according to step 9, calculate the Outlier factor sequence of base table in this example, i.e., by calculate in base table sample with Distance based on context between reference sample, further determines that Outlier factor sequence, and Outlier factor sequence is such as in this example Under：

(10) according to step 10, it is { d to determine the exceptional sample in this example base table₇₆,d₂₄₉}。

Demonstration effect explanation：It can be shown that by this example, compared with traditional abnormality detection, this method can lack first Test under the guidance of domain knowledge, by using potential between unsupervised learning automatic mining and utilization relational data attribute Structure and relation, that is, obtain context property, effective to measure the distance between sample and similarity, and obtains good reality Border effect.

Claims

1. a kind of unsupervised relation data method for detecting abnormality based on context, it is characterised in that this method includes following step Suddenly：

(1) multi-source relational data table is merged and pre-processed；

(4) the Intra dependences value and Inter dependence values of gained, computation attribute dependence knot are calculated according to step (2) and (3) Structure；

(5) context property collection is calculated with a kind of reverse life cycle algorithm of heuristic iterative；

(6) with improved Category Attributes distance study algorithm, combine context property value and behavior property value, calculate based on up and down The similarity matrix of literary attribute；

(7) reference sample Candidate Set is selected according to reference sample selector；

(8) reference sample is further calculated according to the reference sample Candidate Set of selection；

(9) according to the similarity matrix and reference sample based on context property being calculated, calculate and closed based on context The data exception factor sequence of system；

A kind of 2. unsupervised relation data method for detecting abnormality based on context according to claim 1, it is characterised in that In the step (5), the reverse life cycle algorithm of heuristic iterative removes redundant attributes by ordered retrieval strategy, iteration, from And calculate near optimal attribute set.

A kind of 3. unsupervised relation data method for detecting abnormality based on context according to claim 1, it is characterised in that In the step (6), improved Category Attributes distance study algorithm utilizes context property value and behavior property value, passes through calculating Conditional probability calculates the similarity and distance between the context property value and behavior property value.

A kind of 4. unsupervised relation data method for detecting abnormality based on context according to claim 1, it is characterised in that In the step (7), including two kinds of algorithms：Random k sample selection algorithms and center k sample selection algorithms.

A kind of 5. unsupervised relation data method for detecting abnormality based on context according to claim 4, it is characterised in that The random k sample selection algorithms refer to select k sample at random from data set, as with reference to sample Candidate Set；In described Heart k sample selection algorithms simply by by cluster data into k cluster, and chosen distance center of a sample nearest k respectively Sample point is used as and refers to sample Candidate Set.