CN108038211A - A kind of unsupervised relation data method for detecting abnormality based on context - Google Patents
A kind of unsupervised relation data method for detecting abnormality based on context Download PDFInfo
- Publication number
- CN108038211A CN108038211A CN201711379664.3A CN201711379664A CN108038211A CN 108038211 A CN108038211 A CN 108038211A CN 201711379664 A CN201711379664 A CN 201711379664A CN 108038211 A CN108038211 A CN 108038211A
- Authority
- CN
- China
- Prior art keywords
- context
- sample
- data
- attribute
- calculated
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24564—Applying rules; Deductive queries
- G06F16/24566—Recursive queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
Abstract
The invention discloses a kind of unsupervised relation data method for detecting abnormality based on context, comprise the following steps:Multi-source relational data table is merged and pre-processed;The Intra dependence values of computation attribute value;The Inter dependence values of computation attribute value;According to Intra Feature Dependences and Inter Feature Dependences, computation attribute dependence graph structure;Context property collection is calculated with the reverse life cycle algorithm of heuristic iterative;Calculated with improved Category Attributes distance study algorithm and be based on context property similarity matrix;Reference sample Candidate Set is selected according to reference sample selector and further calculates reference sample;Calculating is based on context data Outlier factor sequence;Calculated relationship data exception Candidate Set, and determine abnormal data.The present invention can automatically excavate and using the potential structure and relation between relational data attribute, so as to further carry out unsupervised anomaly detection in the case where lacking the guidance of priori domain knowledge.
Description
Technical field
The present invention relates to a kind of unsupervised relation data method for detecting abnormality, is specifically a kind of being directed to based on context
The method that relational data carries out unsupervised anomaly detection.
Background technology
Unsupervised anomaly detection method is widely used in different practical application scenes, such as government data abnormality detection,
The fields such as audit, commercial fraud detection, medical records anomaly analysis.Compared with the method for detecting abnormality based on supervised learning, nothing
Unsupervised anomaly detection method can pass through data-driven (Data- in the case where weak domain knowledge or priori instruct deficiency
Driven mode), it is spontaneous to be learnt from specified data set, so as to further be carried out using the knowledge to learn follow-up
Anomaly data detection.
Unsupervised anomaly detection method at this stage can be divided mainly into based on distance, based on cluster, based on probabilistic model and
Based on comentropy scheduling algorithm, its mainly by the distance between sample or measuring similarity, as it is main it is abnormal because
Son, so as to further produce the intensity of anomaly measurement to original sample.Due to not being between sample and attribute in practical applications
Simple independent same distribution structure, but relied on and dependency relation there is potential, thus it is sufficiently effective excavate it is potential
Relation on attributes (context mechanism) will be helpful to further improve the precision of abnormality detection.
The content of the invention
Goal of the invention:In view of the deficiencies of the prior art, it is an object of the present invention to provide a kind of nothing based on context mechanism
Unsupervised anomaly detection method.
Technical solution:The present invention is by handling different relationship type input samples, the category by using dependent with distribution
Property rely on detection method, the attribute structure and dependence concentrated to data-oriented carry out digging evidence, by integrated multiple and different
Unsupervised learning submodule, ultimately forms a kind of novel unsupervised relation data method for detecting abnormality based on context.Should
Method comprises the following steps:
(1) multi-source relational data table is merged and pre-processed;
(2) according to the Intra dependence values of obtained relationship type attribute computation attribute value;
(3) according to the Inter dependence values of obtained relationship type attribute computation attribute value;
(4) the Intra dependences value and Inter dependence values of gained are calculated according to step (2) and (3), computation attribute, which relies on, to close
Architecture;
(5) counted with the reverse life cycle algorithm RBE of a kind of heuristic iterative (Recursive Backward Elimination)
Count hereafter property set in, i.e., the algorithm is by ordered retrieval strategy, and iteration removes redundant attributes, so as to calculate near optimal
Attribute set;
(6) with improved Category Attributes distance study algorithm DILCA (DIstance Learning for
Categorical Attributes), combine context property value and behavior property value, calculate the phase based on context property
Like degree matrix, i.e. the algorithm is mainly using between the conditional probability of context property and behavior property two property values of calculating
Similarity and distance;
(7) reference sample Candidate Set is selected according to reference sample selector, mainly includes two kinds of algorithms:Random k samples choosing
Select algorithm and center k sample selection algorithms;
(8) final reference sample is further calculated according to the reference sample Candidate Set of selection;
(9) according to the similarity matrix and reference sample based on context property being calculated, calculate based on up and down
The data exception factor sequence of literary relation;
(10) calculated relationship data exception Candidate Set, and determine abnormal data.
Beneficial effect:Compared with prior art, the present invention its remarkable advantage is:It is proposed a kind of novel method, this method
The potential structure between relational data attribute can be automatically excavated and utilize in the case where lacking the guidance of priori domain knowledge
And relation, so as to further carry out unsupervised anomaly detection.
Brief description of the drawings
Fig. 1 is the composition structure chart of the method for the present invention.
Fig. 2 is the key submodule composition structure chart in the method for the present invention.
Fig. 3 is the flow chart of the method for the present invention.
Fig. 4 is the multi-source fusion and pre-processing structure figure of the method for the present invention.
Fig. 5 is the example explanatory drawin of the method for the present invention.
Embodiment
As shown in Figure 1, the method for the present invention is detected, calculated based on context property distance metric automatically comprising context property
Method, based on context property value measuring similarity matrix and classical unsupervised anomaly detection algorithm core library module.
As shown in Fig. 2, key submodule detects, based on context category automatically comprising context property in the method for the present invention
The components such as property distance metric algorithm.
The method of the present invention flow is as shown in figure 3, method detailed description is as follows:
Step 1, for multiple relation tables in different relevant databases, merged and pre-processed, wherein wrapping
Include and multiple relational data tables are connected by master, foreign key relationship, the process such as the null value in single table are removed or are backfilled.
The fusion of more relational datas and pretreatment detailed process are as shown in Figure 4;
Step 2, according to fusion and pretreated attribute of a relation has been completed, calculate in the attribute of each property value
(Intra) dependence value.Equation below shows in the attribute of same property value (Intra) dependence computational methods, wherein δ
() represents that the dependence to single property value in same attribute is measured, and wherein m represents the mode in single attribute, p (m) tables
Show the probability that m occurs, p (vi) represent single property value viProbability, base (m) represents that the benchmark of single property value deviates journey
Degree, dev (vi) represent single property value viRelative to the departure degree of mode m, algorithm circular is as follows:
Base (m)=1-p (m)
Step 3, according to completed fusion and pretreated attribute of a relation, calculate different attribute value attribute between
(Inter) dependence value, wherein viAnd vjTwo property values are represented respectively,Represent to different attribute value in same attribute
Between dependence measure, specific formula is as follows:
Step 4, calculated to rely in the attribute of gained (Intra) according to step 2 and step 3 and relied between attribute (Inter)
Integrated, can be expressed as below, wherein A (vi,vj) represent two property values between degree of dependence:
And then the dependence graph structure between attribute, wherein A can be calculatednAnd AmTwo attributes, δ are represented respectively*
(An) represent to single attribute AnThe measurement of degree of dependence,Represent respectively to attribute AnAnd AmBetween degree of dependence degree
Amount, A*(An,Am) represented for integrity attribute dependence after joint, formula specific as follows:
Step 5, at the reverse life cycle algorithm RBE of classical heuristic recurrence (Recursive Backward Elimination)
On the basis of calculate context property collection, i.e. the algorithm by the method for iteration removal, is calculated by ordered retrieval strategy
Near optimal attribute set.Wherein AnAnd AmTwo attributes, J (T are represented respectivelyc) represent to optimize context property study letter
Number, TCRepresent by acquiring obtained context property set, specific study formula is as follows:
Step 6, conventional discrete type property value distance study algorithm DILCA (DIstance Learning for are improved
Categorical Attributes), combine the context property and behavior property automatically detected, calculate based on up and down
Literary attributes similarity matrix, which is mainly the context property and behavior property for utilizing and having acquired, general by design conditions
Rate calculates the similarity and distance between two property values.Wherein yiAnd yjThe row of two relational data sample points is represented respectively
For property value, X represents context property set, xkFor context property value, P (yi|xk) represent in context property value xkBar
Y under partiThe probability of generation, P (yj|xk) represent in context property value xkUnder the conditions of yjThe probability of generation, the tool of two property values
Body similarity distance calculation formula is as follows:
Step 7, reference subset is selected according to reference sample selector, mainly includes two kinds of algorithms:Random k samples choosing
Select algorithm and center k sample selection algorithms.Wherein random k sample selection algorithms refer to select k sample at random from data set
This, as with reference to Candidate Set;Center k sample selection algorithms, into k cluster, and select respectively simply by by cluster data
K nearest sample point of distance sample center is used as and refers to Candidate Set.
Step 8, sample refers to Candidate Set according to caused by step 7, by further being counted with reference to Candidate Set to the sample
Final reference sample is calculated, i.e., using the method for frequency, average is calculated to the Candidate Set property value frequency of generation, so that
The nearest reference sample of this Attributes Frequency average of further chosen distance, is selected as final reference sample;
Step 9, according to being calculated based on context property similarity matrix and identified final reference sample
This, calculating is based on context data Outlier factor sequence, that is, provides the Outlier factor sequence of all samples in data set,
Outlier factor is calculated using OF () function, wherein d represents sample to be tested, and ref represents reference sample,Represent
Sample to be tested is in AiValue is mapped to value in the distance matrix corresponding to step 6 in attribute,Represent reference sample
In AiValue is mapped to value in the distance matrix corresponding to step 6, publicity specific as follows in attribute:
Step 10, the sample Outlier factor provided according to step 9 sorts, and determines relation data exception Candidate Set, so that into
One step determines abnormal data.
As shown in being illustrated Fig. 5, its process is briefly as follows:
(1) input is two relational data tables, and wherein table 1 has 5 attributes respectively, and table 2 has 7 attributes;
(2) in pretreatment stage, master, external key (A are passed through1Attribute) two tables are associated and merged, and will wherein
Null value reject etc. pretreatment operation, form base table;
(3) for the base table after fusion, according to step 2 and 3 computational methods, the dependence of Intra property values is calculated respectively
Matrix and Inter property values rely on matrix;
(4) according to step 4, the Intra property values calculated is relied on into matrix and Inter property values rely on matrix and carry out
Joint, calculates the Feature Dependence relation representing matrix of the base table;
(5) according to step 5, context property is calculated, being computed this exemplary context property is respectively: {A2,A3,
A4,A5};
(6) according to step 6, it is 32 × 32 property value similarity distance matrixs that can further calculate size, i.e., is originally showing
In example, the Category Attributes value that base table has is 32 altogether;
(7) according to algorithm in step 7, reference sample Candidate Set is selected, have selected 20 candidate's ginsengs in this example altogether
Examine sample;
(8) according to algorithm in step 8, final reference sample is further calculated;
(9) according to step 9, calculate the Outlier factor sequence of base table in this example, i.e., by calculate in base table sample with
Distance based on context between reference sample, further determines that Outlier factor sequence, and Outlier factor sequence is such as in this example
Under:
(10) according to step 10, it is { d to determine the exceptional sample in this example base table76,d249}。
Demonstration effect explanation:It can be shown that by this example, compared with traditional abnormality detection, this method can lack first
Test under the guidance of domain knowledge, by using potential between unsupervised learning automatic mining and utilization relational data attribute
Structure and relation, that is, obtain context property, effective to measure the distance between sample and similarity, and obtains good reality
Border effect.
Claims (5)
1. a kind of unsupervised relation data method for detecting abnormality based on context, it is characterised in that this method includes following step
Suddenly:
(1) multi-source relational data table is merged and pre-processed;
(2) according to the Intra dependence values of obtained relationship type attribute computation attribute value;
(3) according to the Inter dependence values of obtained relationship type attribute computation attribute value;
(4) the Intra dependences value and Inter dependence values of gained, computation attribute dependence knot are calculated according to step (2) and (3)
Structure;
(5) context property collection is calculated with a kind of reverse life cycle algorithm of heuristic iterative;
(6) with improved Category Attributes distance study algorithm, combine context property value and behavior property value, calculate based on up and down
The similarity matrix of literary attribute;
(7) reference sample Candidate Set is selected according to reference sample selector;
(8) reference sample is further calculated according to the reference sample Candidate Set of selection;
(9) according to the similarity matrix and reference sample based on context property being calculated, calculate and closed based on context
The data exception factor sequence of system;
(10) calculated relationship data exception Candidate Set, and determine abnormal data.
A kind of 2. unsupervised relation data method for detecting abnormality based on context according to claim 1, it is characterised in that
In the step (5), the reverse life cycle algorithm of heuristic iterative removes redundant attributes by ordered retrieval strategy, iteration, from
And calculate near optimal attribute set.
A kind of 3. unsupervised relation data method for detecting abnormality based on context according to claim 1, it is characterised in that
In the step (6), improved Category Attributes distance study algorithm utilizes context property value and behavior property value, passes through calculating
Conditional probability calculates the similarity and distance between the context property value and behavior property value.
A kind of 4. unsupervised relation data method for detecting abnormality based on context according to claim 1, it is characterised in that
In the step (7), including two kinds of algorithms:Random k sample selection algorithms and center k sample selection algorithms.
A kind of 5. unsupervised relation data method for detecting abnormality based on context according to claim 4, it is characterised in that
The random k sample selection algorithms refer to select k sample at random from data set, as with reference to sample Candidate Set;In described
Heart k sample selection algorithms simply by by cluster data into k cluster, and chosen distance center of a sample nearest k respectively
Sample point is used as and refers to sample Candidate Set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711379664.3A CN108038211A (en) | 2017-12-13 | 2017-12-13 | A kind of unsupervised relation data method for detecting abnormality based on context |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711379664.3A CN108038211A (en) | 2017-12-13 | 2017-12-13 | A kind of unsupervised relation data method for detecting abnormality based on context |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108038211A true CN108038211A (en) | 2018-05-15 |
Family
ID=62100168
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711379664.3A Pending CN108038211A (en) | 2017-12-13 | 2017-12-13 | A kind of unsupervised relation data method for detecting abnormality based on context |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108038211A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109583358A (en) * | 2018-11-26 | 2019-04-05 | 广东智源信息技术有限公司 | A kind of Medical Surveillance fast accurate enforcement approach |
WO2020078059A1 (en) * | 2018-10-17 | 2020-04-23 | 阿里巴巴集团控股有限公司 | Interpretation feature determination method and device for anomaly detection |
CN111415167A (en) * | 2020-02-19 | 2020-07-14 | 同济大学 | Network fraud transaction detection method and device, computer storage medium and terminal |
CN113239024A (en) * | 2021-04-22 | 2021-08-10 | 辽宁工程技术大学 | Bank abnormal data detection method based on outlier detection |
CN116089504A (en) * | 2023-04-10 | 2023-05-09 | 北京宽客进化科技有限公司 | Relational form data generation method and system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101561878A (en) * | 2009-05-31 | 2009-10-21 | 河海大学 | Unsupervised anomaly detection method and system based on improved CURE clustering algorithm |
CN103559420A (en) * | 2013-11-20 | 2014-02-05 | 苏州大学 | Building method and device of anomaly detection training set |
-
2017
- 2017-12-13 CN CN201711379664.3A patent/CN108038211A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101561878A (en) * | 2009-05-31 | 2009-10-21 | 河海大学 | Unsupervised anomaly detection method and system based on improved CURE clustering algorithm |
CN103559420A (en) * | 2013-11-20 | 2014-02-05 | 苏州大学 | Building method and device of anomaly detection training set |
Non-Patent Citations (3)
Title |
---|
DINO IENCO 等: "A Semisupervised Approach to the Detection and Characterization of Outliers in Categorical Data", 《IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS》 * |
GUANSONG PANG 等: "Unsupervised Feature Selection for Outlier Detection by Modelling Hierarchical Value-Feature Couplings", 《2016 IEEE 16TH INTERNATIONAL CONFERENCE ON DATA MINING》 * |
YEN-CHENG LU 等: "An Unsupervised Approach to Anomaly Detection in Music Datasets", 《2016 ACM》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020078059A1 (en) * | 2018-10-17 | 2020-04-23 | 阿里巴巴集团控股有限公司 | Interpretation feature determination method and device for anomaly detection |
CN109583358A (en) * | 2018-11-26 | 2019-04-05 | 广东智源信息技术有限公司 | A kind of Medical Surveillance fast accurate enforcement approach |
CN111415167A (en) * | 2020-02-19 | 2020-07-14 | 同济大学 | Network fraud transaction detection method and device, computer storage medium and terminal |
CN113239024A (en) * | 2021-04-22 | 2021-08-10 | 辽宁工程技术大学 | Bank abnormal data detection method based on outlier detection |
CN113239024B (en) * | 2021-04-22 | 2023-11-07 | 辽宁工程技术大学 | Bank abnormal data detection method based on outlier detection |
CN116089504A (en) * | 2023-04-10 | 2023-05-09 | 北京宽客进化科技有限公司 | Relational form data generation method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108038211A (en) | A kind of unsupervised relation data method for detecting abnormality based on context | |
CN111931868B (en) | Time series data abnormity detection method and device | |
CN107577605A (en) | A kind of feature clustering system of selection of software-oriented failure prediction | |
CN107844414A (en) | A kind of spanned item mesh based on defect report analysis, parallelization defect positioning method | |
CN112508105A (en) | Method for detecting and retrieving faults of oil extraction machine | |
CN107273234A (en) | A kind of time series data rejecting outliers and bearing calibration based on EEMD | |
CN109543693A (en) | Weak labeling data noise reduction method based on regularization label propagation | |
CN116467674A (en) | Intelligent fault processing fusion updating system and method for power distribution network | |
CN111126865B (en) | Technology maturity judging method and system based on technology big data | |
CN115358481A (en) | Early warning and identification method, system and device for enterprise ex-situ migration | |
CN115345458A (en) | Business process compliance checking method, computer equipment and readable storage medium | |
Nagwani et al. | A data mining model to predict software bug complexity using bug estimation and clustering | |
Subali et al. | A new model for measuring the complexity of SQL commands | |
CN106296401A (en) | A kind of Strong association rule method for digging understood for stock market's operation logic | |
CN104111887A (en) | Software fault prediction system and method based on Logistic model | |
CN109614074A (en) | Approximate adder reliability degree calculation method based on probability transfer matrix model | |
CN113570437A (en) | Product recommendation method and device | |
Hao et al. | The research and analysis in decision tree algorithm based on C4. 5 algorithm | |
CN113641733B (en) | Real-time intelligent estimation method for river cross section flow | |
Malik et al. | A comprehensive approach towards data preprocessing techniques & association rules | |
CN116910526A (en) | Model training method, device, communication equipment and readable storage medium | |
CN115587333A (en) | Failure analysis fault point prediction method and system based on multi-classification model | |
CN113554079B (en) | Power load abnormal data detection method and system based on secondary detection method | |
Mahammad et al. | Machine Learning Approach to Predict Asthma Prevalence with Decision Trees | |
CN106294092B (en) | Semi-automatic log analysis method and system based on ontology knowledge base |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20180515 |