CN108038211A - A kind of unsupervised relation data method for detecting abnormality based on context - Google Patents

A kind of unsupervised relation data method for detecting abnormality based on context Download PDF

Info

Publication number
CN108038211A
CN108038211A CN201711379664.3A CN201711379664A CN108038211A CN 108038211 A CN108038211 A CN 108038211A CN 201711379664 A CN201711379664 A CN 201711379664A CN 108038211 A CN108038211 A CN 108038211A
Authority
CN
China
Prior art keywords
context
sample
data
attribute
calculated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711379664.3A
Other languages
Chinese (zh)
Inventor
孟凡
葛笑天
王皓
陈烜松
高阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Audit Office Of Jiangsu Province
Nanjing University
Original Assignee
Audit Office Of Jiangsu Province
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Audit Office Of Jiangsu Province, Nanjing University filed Critical Audit Office Of Jiangsu Province
Priority to CN201711379664.3A priority Critical patent/CN108038211A/en
Publication of CN108038211A publication Critical patent/CN108038211A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24564Applying rules; Deductive queries
    • G06F16/24566Recursive queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Abstract

The invention discloses a kind of unsupervised relation data method for detecting abnormality based on context, comprise the following steps:Multi-source relational data table is merged and pre-processed;The Intra dependence values of computation attribute value;The Inter dependence values of computation attribute value;According to Intra Feature Dependences and Inter Feature Dependences, computation attribute dependence graph structure;Context property collection is calculated with the reverse life cycle algorithm of heuristic iterative;Calculated with improved Category Attributes distance study algorithm and be based on context property similarity matrix;Reference sample Candidate Set is selected according to reference sample selector and further calculates reference sample;Calculating is based on context data Outlier factor sequence;Calculated relationship data exception Candidate Set, and determine abnormal data.The present invention can automatically excavate and using the potential structure and relation between relational data attribute, so as to further carry out unsupervised anomaly detection in the case where lacking the guidance of priori domain knowledge.

Description

A kind of unsupervised relation data method for detecting abnormality based on context
Technical field
The present invention relates to a kind of unsupervised relation data method for detecting abnormality, is specifically a kind of being directed to based on context The method that relational data carries out unsupervised anomaly detection.
Background technology
Unsupervised anomaly detection method is widely used in different practical application scenes, such as government data abnormality detection, The fields such as audit, commercial fraud detection, medical records anomaly analysis.Compared with the method for detecting abnormality based on supervised learning, nothing Unsupervised anomaly detection method can pass through data-driven (Data- in the case where weak domain knowledge or priori instruct deficiency Driven mode), it is spontaneous to be learnt from specified data set, so as to further be carried out using the knowledge to learn follow-up Anomaly data detection.
Unsupervised anomaly detection method at this stage can be divided mainly into based on distance, based on cluster, based on probabilistic model and Based on comentropy scheduling algorithm, its mainly by the distance between sample or measuring similarity, as it is main it is abnormal because Son, so as to further produce the intensity of anomaly measurement to original sample.Due to not being between sample and attribute in practical applications Simple independent same distribution structure, but relied on and dependency relation there is potential, thus it is sufficiently effective excavate it is potential Relation on attributes (context mechanism) will be helpful to further improve the precision of abnormality detection.
The content of the invention
Goal of the invention:In view of the deficiencies of the prior art, it is an object of the present invention to provide a kind of nothing based on context mechanism Unsupervised anomaly detection method.
Technical solution:The present invention is by handling different relationship type input samples, the category by using dependent with distribution Property rely on detection method, the attribute structure and dependence concentrated to data-oriented carry out digging evidence, by integrated multiple and different Unsupervised learning submodule, ultimately forms a kind of novel unsupervised relation data method for detecting abnormality based on context.Should Method comprises the following steps:
(1) multi-source relational data table is merged and pre-processed;
(2) according to the Intra dependence values of obtained relationship type attribute computation attribute value;
(3) according to the Inter dependence values of obtained relationship type attribute computation attribute value;
(4) the Intra dependences value and Inter dependence values of gained are calculated according to step (2) and (3), computation attribute, which relies on, to close Architecture;
(5) counted with the reverse life cycle algorithm RBE of a kind of heuristic iterative (Recursive Backward Elimination) Count hereafter property set in, i.e., the algorithm is by ordered retrieval strategy, and iteration removes redundant attributes, so as to calculate near optimal Attribute set;
(6) with improved Category Attributes distance study algorithm DILCA (DIstance Learning for Categorical Attributes), combine context property value and behavior property value, calculate the phase based on context property Like degree matrix, i.e. the algorithm is mainly using between the conditional probability of context property and behavior property two property values of calculating Similarity and distance;
(7) reference sample Candidate Set is selected according to reference sample selector, mainly includes two kinds of algorithms:Random k samples choosing Select algorithm and center k sample selection algorithms;
(8) final reference sample is further calculated according to the reference sample Candidate Set of selection;
(9) according to the similarity matrix and reference sample based on context property being calculated, calculate based on up and down The data exception factor sequence of literary relation;
(10) calculated relationship data exception Candidate Set, and determine abnormal data.
Beneficial effect:Compared with prior art, the present invention its remarkable advantage is:It is proposed a kind of novel method, this method The potential structure between relational data attribute can be automatically excavated and utilize in the case where lacking the guidance of priori domain knowledge And relation, so as to further carry out unsupervised anomaly detection.
Brief description of the drawings
Fig. 1 is the composition structure chart of the method for the present invention.
Fig. 2 is the key submodule composition structure chart in the method for the present invention.
Fig. 3 is the flow chart of the method for the present invention.
Fig. 4 is the multi-source fusion and pre-processing structure figure of the method for the present invention.
Fig. 5 is the example explanatory drawin of the method for the present invention.
Embodiment
As shown in Figure 1, the method for the present invention is detected, calculated based on context property distance metric automatically comprising context property Method, based on context property value measuring similarity matrix and classical unsupervised anomaly detection algorithm core library module.
As shown in Fig. 2, key submodule detects, based on context category automatically comprising context property in the method for the present invention The components such as property distance metric algorithm.
The method of the present invention flow is as shown in figure 3, method detailed description is as follows:
Step 1, for multiple relation tables in different relevant databases, merged and pre-processed, wherein wrapping Include and multiple relational data tables are connected by master, foreign key relationship, the process such as the null value in single table are removed or are backfilled. The fusion of more relational datas and pretreatment detailed process are as shown in Figure 4;
Step 2, according to fusion and pretreated attribute of a relation has been completed, calculate in the attribute of each property value (Intra) dependence value.Equation below shows in the attribute of same property value (Intra) dependence computational methods, wherein δ () represents that the dependence to single property value in same attribute is measured, and wherein m represents the mode in single attribute, p (m) tables Show the probability that m occurs, p (vi) represent single property value viProbability, base (m) represents that the benchmark of single property value deviates journey Degree, dev (vi) represent single property value viRelative to the departure degree of mode m, algorithm circular is as follows:
Base (m)=1-p (m)
Step 3, according to completed fusion and pretreated attribute of a relation, calculate different attribute value attribute between (Inter) dependence value, wherein viAnd vjTwo property values are represented respectively,Represent to different attribute value in same attribute Between dependence measure, specific formula is as follows:
Step 4, calculated to rely in the attribute of gained (Intra) according to step 2 and step 3 and relied between attribute (Inter) Integrated, can be expressed as below, wherein A (vi,vj) represent two property values between degree of dependence:
And then the dependence graph structure between attribute, wherein A can be calculatednAnd AmTwo attributes, δ are represented respectively* (An) represent to single attribute AnThe measurement of degree of dependence,Represent respectively to attribute AnAnd AmBetween degree of dependence degree Amount, A*(An,Am) represented for integrity attribute dependence after joint, formula specific as follows:
Step 5, at the reverse life cycle algorithm RBE of classical heuristic recurrence (Recursive Backward Elimination) On the basis of calculate context property collection, i.e. the algorithm by the method for iteration removal, is calculated by ordered retrieval strategy Near optimal attribute set.Wherein AnAnd AmTwo attributes, J (T are represented respectivelyc) represent to optimize context property study letter Number, TCRepresent by acquiring obtained context property set, specific study formula is as follows:
Step 6, conventional discrete type property value distance study algorithm DILCA (DIstance Learning for are improved Categorical Attributes), combine the context property and behavior property automatically detected, calculate based on up and down Literary attributes similarity matrix, which is mainly the context property and behavior property for utilizing and having acquired, general by design conditions Rate calculates the similarity and distance between two property values.Wherein yiAnd yjThe row of two relational data sample points is represented respectively For property value, X represents context property set, xkFor context property value, P (yi|xk) represent in context property value xkBar Y under partiThe probability of generation, P (yj|xk) represent in context property value xkUnder the conditions of yjThe probability of generation, the tool of two property values Body similarity distance calculation formula is as follows:
Step 7, reference subset is selected according to reference sample selector, mainly includes two kinds of algorithms:Random k samples choosing Select algorithm and center k sample selection algorithms.Wherein random k sample selection algorithms refer to select k sample at random from data set This, as with reference to Candidate Set;Center k sample selection algorithms, into k cluster, and select respectively simply by by cluster data K nearest sample point of distance sample center is used as and refers to Candidate Set.
Step 8, sample refers to Candidate Set according to caused by step 7, by further being counted with reference to Candidate Set to the sample Final reference sample is calculated, i.e., using the method for frequency, average is calculated to the Candidate Set property value frequency of generation, so that The nearest reference sample of this Attributes Frequency average of further chosen distance, is selected as final reference sample;
Step 9, according to being calculated based on context property similarity matrix and identified final reference sample This, calculating is based on context data Outlier factor sequence, that is, provides the Outlier factor sequence of all samples in data set, Outlier factor is calculated using OF () function, wherein d represents sample to be tested, and ref represents reference sample,Represent Sample to be tested is in AiValue is mapped to value in the distance matrix corresponding to step 6 in attribute,Represent reference sample In AiValue is mapped to value in the distance matrix corresponding to step 6, publicity specific as follows in attribute:
Step 10, the sample Outlier factor provided according to step 9 sorts, and determines relation data exception Candidate Set, so that into One step determines abnormal data.
As shown in being illustrated Fig. 5, its process is briefly as follows:
(1) input is two relational data tables, and wherein table 1 has 5 attributes respectively, and table 2 has 7 attributes;
(2) in pretreatment stage, master, external key (A are passed through1Attribute) two tables are associated and merged, and will wherein Null value reject etc. pretreatment operation, form base table;
(3) for the base table after fusion, according to step 2 and 3 computational methods, the dependence of Intra property values is calculated respectively Matrix and Inter property values rely on matrix;
(4) according to step 4, the Intra property values calculated is relied on into matrix and Inter property values rely on matrix and carry out Joint, calculates the Feature Dependence relation representing matrix of the base table;
(5) according to step 5, context property is calculated, being computed this exemplary context property is respectively: {A2,A3, A4,A5};
(6) according to step 6, it is 32 × 32 property value similarity distance matrixs that can further calculate size, i.e., is originally showing In example, the Category Attributes value that base table has is 32 altogether;
(7) according to algorithm in step 7, reference sample Candidate Set is selected, have selected 20 candidate's ginsengs in this example altogether Examine sample;
(8) according to algorithm in step 8, final reference sample is further calculated;
(9) according to step 9, calculate the Outlier factor sequence of base table in this example, i.e., by calculate in base table sample with Distance based on context between reference sample, further determines that Outlier factor sequence, and Outlier factor sequence is such as in this example Under:
(10) according to step 10, it is { d to determine the exceptional sample in this example base table76,d249}。
Demonstration effect explanation:It can be shown that by this example, compared with traditional abnormality detection, this method can lack first Test under the guidance of domain knowledge, by using potential between unsupervised learning automatic mining and utilization relational data attribute Structure and relation, that is, obtain context property, effective to measure the distance between sample and similarity, and obtains good reality Border effect.

Claims (5)

1. a kind of unsupervised relation data method for detecting abnormality based on context, it is characterised in that this method includes following step Suddenly:
(1) multi-source relational data table is merged and pre-processed;
(2) according to the Intra dependence values of obtained relationship type attribute computation attribute value;
(3) according to the Inter dependence values of obtained relationship type attribute computation attribute value;
(4) the Intra dependences value and Inter dependence values of gained, computation attribute dependence knot are calculated according to step (2) and (3) Structure;
(5) context property collection is calculated with a kind of reverse life cycle algorithm of heuristic iterative;
(6) with improved Category Attributes distance study algorithm, combine context property value and behavior property value, calculate based on up and down The similarity matrix of literary attribute;
(7) reference sample Candidate Set is selected according to reference sample selector;
(8) reference sample is further calculated according to the reference sample Candidate Set of selection;
(9) according to the similarity matrix and reference sample based on context property being calculated, calculate and closed based on context The data exception factor sequence of system;
(10) calculated relationship data exception Candidate Set, and determine abnormal data.
A kind of 2. unsupervised relation data method for detecting abnormality based on context according to claim 1, it is characterised in that In the step (5), the reverse life cycle algorithm of heuristic iterative removes redundant attributes by ordered retrieval strategy, iteration, from And calculate near optimal attribute set.
A kind of 3. unsupervised relation data method for detecting abnormality based on context according to claim 1, it is characterised in that In the step (6), improved Category Attributes distance study algorithm utilizes context property value and behavior property value, passes through calculating Conditional probability calculates the similarity and distance between the context property value and behavior property value.
A kind of 4. unsupervised relation data method for detecting abnormality based on context according to claim 1, it is characterised in that In the step (7), including two kinds of algorithms:Random k sample selection algorithms and center k sample selection algorithms.
A kind of 5. unsupervised relation data method for detecting abnormality based on context according to claim 4, it is characterised in that The random k sample selection algorithms refer to select k sample at random from data set, as with reference to sample Candidate Set;In described Heart k sample selection algorithms simply by by cluster data into k cluster, and chosen distance center of a sample nearest k respectively Sample point is used as and refers to sample Candidate Set.
CN201711379664.3A 2017-12-13 2017-12-13 A kind of unsupervised relation data method for detecting abnormality based on context Pending CN108038211A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711379664.3A CN108038211A (en) 2017-12-13 2017-12-13 A kind of unsupervised relation data method for detecting abnormality based on context

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711379664.3A CN108038211A (en) 2017-12-13 2017-12-13 A kind of unsupervised relation data method for detecting abnormality based on context

Publications (1)

Publication Number Publication Date
CN108038211A true CN108038211A (en) 2018-05-15

Family

ID=62100168

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711379664.3A Pending CN108038211A (en) 2017-12-13 2017-12-13 A kind of unsupervised relation data method for detecting abnormality based on context

Country Status (1)

Country Link
CN (1) CN108038211A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109583358A (en) * 2018-11-26 2019-04-05 广东智源信息技术有限公司 A kind of Medical Surveillance fast accurate enforcement approach
WO2020078059A1 (en) * 2018-10-17 2020-04-23 阿里巴巴集团控股有限公司 Interpretation feature determination method and device for anomaly detection
CN111415167A (en) * 2020-02-19 2020-07-14 同济大学 Network fraud transaction detection method and device, computer storage medium and terminal
CN113239024A (en) * 2021-04-22 2021-08-10 辽宁工程技术大学 Bank abnormal data detection method based on outlier detection
CN116089504A (en) * 2023-04-10 2023-05-09 北京宽客进化科技有限公司 Relational form data generation method and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101561878A (en) * 2009-05-31 2009-10-21 河海大学 Unsupervised anomaly detection method and system based on improved CURE clustering algorithm
CN103559420A (en) * 2013-11-20 2014-02-05 苏州大学 Building method and device of anomaly detection training set

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101561878A (en) * 2009-05-31 2009-10-21 河海大学 Unsupervised anomaly detection method and system based on improved CURE clustering algorithm
CN103559420A (en) * 2013-11-20 2014-02-05 苏州大学 Building method and device of anomaly detection training set

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DINO IENCO 等: "A Semisupervised Approach to the Detection and Characterization of Outliers in Categorical Data", 《IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS》 *
GUANSONG PANG 等: "Unsupervised Feature Selection for Outlier Detection by Modelling Hierarchical Value-Feature Couplings", 《2016 IEEE 16TH INTERNATIONAL CONFERENCE ON DATA MINING》 *
YEN-CHENG LU 等: "An Unsupervised Approach to Anomaly Detection in Music Datasets", 《2016 ACM》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020078059A1 (en) * 2018-10-17 2020-04-23 阿里巴巴集团控股有限公司 Interpretation feature determination method and device for anomaly detection
CN109583358A (en) * 2018-11-26 2019-04-05 广东智源信息技术有限公司 A kind of Medical Surveillance fast accurate enforcement approach
CN111415167A (en) * 2020-02-19 2020-07-14 同济大学 Network fraud transaction detection method and device, computer storage medium and terminal
CN113239024A (en) * 2021-04-22 2021-08-10 辽宁工程技术大学 Bank abnormal data detection method based on outlier detection
CN113239024B (en) * 2021-04-22 2023-11-07 辽宁工程技术大学 Bank abnormal data detection method based on outlier detection
CN116089504A (en) * 2023-04-10 2023-05-09 北京宽客进化科技有限公司 Relational form data generation method and system

Similar Documents

Publication Publication Date Title
CN108038211A (en) A kind of unsupervised relation data method for detecting abnormality based on context
CN111931868B (en) Time series data abnormity detection method and device
CN107577605A (en) A kind of feature clustering system of selection of software-oriented failure prediction
CN107844414A (en) A kind of spanned item mesh based on defect report analysis, parallelization defect positioning method
CN112508105A (en) Method for detecting and retrieving faults of oil extraction machine
CN107273234A (en) A kind of time series data rejecting outliers and bearing calibration based on EEMD
CN109543693A (en) Weak labeling data noise reduction method based on regularization label propagation
CN116467674A (en) Intelligent fault processing fusion updating system and method for power distribution network
CN111126865B (en) Technology maturity judging method and system based on technology big data
CN115358481A (en) Early warning and identification method, system and device for enterprise ex-situ migration
CN115345458A (en) Business process compliance checking method, computer equipment and readable storage medium
Nagwani et al. A data mining model to predict software bug complexity using bug estimation and clustering
Subali et al. A new model for measuring the complexity of SQL commands
CN106296401A (en) A kind of Strong association rule method for digging understood for stock market's operation logic
CN104111887A (en) Software fault prediction system and method based on Logistic model
CN109614074A (en) Approximate adder reliability degree calculation method based on probability transfer matrix model
CN113570437A (en) Product recommendation method and device
Hao et al. The research and analysis in decision tree algorithm based on C4. 5 algorithm
CN113641733B (en) Real-time intelligent estimation method for river cross section flow
Malik et al. A comprehensive approach towards data preprocessing techniques & association rules
CN116910526A (en) Model training method, device, communication equipment and readable storage medium
CN115587333A (en) Failure analysis fault point prediction method and system based on multi-classification model
CN113554079B (en) Power load abnormal data detection method and system based on secondary detection method
Mahammad et al. Machine Learning Approach to Predict Asthma Prevalence with Decision Trees
CN106294092B (en) Semi-automatic log analysis method and system based on ontology knowledge base

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180515