CN108549669A - A kind of outlier detection method towards big data - Google Patents

A kind of outlier detection method towards big data Download PDF

Info

Publication number
CN108549669A
CN108549669A CN201810249198.5A CN201810249198A CN108549669A CN 108549669 A CN108549669 A CN 108549669A CN 201810249198 A CN201810249198 A CN 201810249198A CN 108549669 A CN108549669 A CN 108549669A
Authority
CN
China
Prior art keywords
data
data set
attribute
tuple
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810249198.5A
Other languages
Chinese (zh)
Inventor
徐小龙
崇卫之
段卫华
贾佳
刘大勇
胥备
王俊昌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201810249198.5A priority Critical patent/CN108549669A/en
Publication of CN108549669A publication Critical patent/CN108549669A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Abstract

The outlier detection method towards big data that the invention discloses a kind of, this method is using quantity of the combination of each data tuple all properties value in data set in entire data set as the feature of the data tuple, since this feature is that quantity calculating from the combination of all properties value in data tuple in entire data set is got, so this feature can react the difference degree between the data tuple and whole data set comprehensively and accurately, it whereby it can be detected which data characteristics differs markedly from whole data set, that is, be used as outlier.And the present invention is based on the Importance of attribute sex knowledge of rough set to carry out dimensionality reduction to high dimensional data so that High Dimensional Data Set can be effectively treated in the invention.The method of the present invention has higher outlier detection accuracy rate, and simple and practicable, is not required to the distribution of data in master data set, and domain knowledge need not also train estimation model on data set, a large amount of time is saved for outlier detection.

Description

A kind of outlier detection method towards big data
Technical field
The outlier detection method towards big data that the present invention relates to a kind of, belongs to Data Preprocessing Technology field.
Background technology
Recently as the high speed development of information technology, global metadata is lasting to be increased with astonishing speed, we The big data epoch are marched toward in the world.How valuable data or information obtained from complicated data, it has also become Ren Menguan The focus of note.Outlier detection is an important directions of data mining.Outlier detection focuses mainly on one in data set Fraction object, compared with remainder data in data set, this sub-fraction object does not meet the universal model of data set, we are just The data of this part are referred to as outlier.Outliers Detection is exactly the data mining skill for finding unconventional pattern from mass data Art.The purpose of detection outlier is to eliminate the noise of raw data set or find that initial data concentrates potential valuable letter Breath.It is widely used in quality control, fault detect, financial fraud, Web Log Analysis, medical treatment, environmental science, smart city Equal fields.In many scientific domains, Outlier Data may bring new inspiration to us, is found so as to cause new knowledge It is developed with new application.Therefore there is highly important theory significance and actual application value for the detection of outlier.Mesh The preceding detection and analysis to outlier has been developed as a vital task in data mining and data management.And it is traditional from Group's point detecting method generally existing Detection accuracy is low, cannot handle extensive high-dimensional data set.
Therefore, it is badly in need of a kind of algorithm that outlier detection accuracy rate is high, and algorithm can preferably be suitable for extensive number According to the environment of collection.
Invention content
The technical problem to be solved by the present invention is to:A kind of outlier detection method towards big data is provided, number is defined The object that data set is differed markedly from according to feature is outlier, and this method has higher Detection accuracy, and uses and divide The outlier detection method of cloth adapts to extensive High Dimensional Data Set.
The present invention uses following technical scheme to solve above-mentioned technical problem:
A kind of outlier detection method towards big data, includes the following steps:
Step 1, setting data set D includes m row n column datas, and one data tuple of each behavior is each to be classified as a category Property, each data tuple D of scan data set DjAnd j is numbered successively, obtain new data set D1=(j, Dj), j= 1,…,m;
Step 2, according to the definition of " division " in rough set, scan data set D1, and identical data tuple is divided into A kind of U/IND (A)={ C1,C2,…,Ct, wherein U indicates that domain, A indicate the set of all properties composition, CkIt indicates k-th Classification, k=1 ..., t, t indicate all classification numbers, the corresponding attribute of each attribute in identical data tuple i.e. some data tuple Value is identical as the corresponding attribute value of same alike result in another data tuple, counts each classification CkThe quantity of middle data tuple, And calculate Knowledge entropy E (A) of all properties to domain U;
Step 3, an attribute A is chosen successivelyi, by attribute A in data set D1iThe row of corresponding attribute value one remove, for Identical data tuple is divided into a kind of U/IND (A- { A by remaining data seti)={ C1,C2,…,Ct, each point of statistics The quantity of data tuple in class, and calculate and remove attribute AiKnowledge entropy E (A- { A of the remaining attribute to domain U afterwardsi), i=1 ..., N, while computation attribute AiImportance of Attributes;It sorts from big to small to the Importance of Attributes of all properties, in data set D1 The corresponding attribute of p Importance of Attributes, constitutes new data set D2, p before choosing<n;
Step 4, scan data set D2 is combined all properties value in data set D2, and counts each attribute value group Close the quantity in entire data set D2;
Step 5, scan data set D2, for data sets the attribute value of each data tuples of D2 be combined, obtain each number According to the corresponding attribute value composite set of tuple;
Step 6, according to step 4 and step 5, by the corresponding attribute value composite set of each data tuple, each attribute value Combine quantity in entire data set D2, as the feature vector of each data tuple, by the feature vector of each data tuple it With the characteristic value as each data tuple;
Step 7, the characteristic value of each data tuple is ranked up from small to large, the corresponding data tuple of preceding q characteristic value The as outlier of data set D, q<m.
As a preferred embodiment of the present invention, Knowledge entropy E (A) calculation formula of all properties to domain U described in step 2 It is as follows:
Wherein, CkIndicate that k-th of classification, k=1 ..., t, t indicate that all classification numbers, U indicate domain.
As a preferred embodiment of the present invention, attribute A described in step 3iImportance of Attributes calculation formula it is as follows:
Sig(Ai)=E (A)-E (A- { Ai})
Wherein, Sig (Ai) indicate attribute AiImportance of Attributes, E (A) indicate all properties to the Knowledge entropy of domain U, E (A-{Ai) indicate to remove attribute AiKnowledge entropy of the remaining attribute to domain U afterwards.
As a preferred embodiment of the present invention, described p, q are preset positive integer.
As a preferred embodiment of the present invention, the step 4, step 5 are limited without sequence.
The present invention has the following technical effects using above technical scheme is compared with the prior art:
1, the suitable big data of the present invention, at present most methods be small-sized missing data collection is handled on single machine, however Now with information-based development, data volume sharp increase, data dimension is huge, and large-scale dataset is handled obviously on single machine It is improper;The present invention can be based on Map-Reduce programming models and realize the outlier in distributed data processing platform Detect large-scale data set.
2, the method for the present invention is simple and practicable, is not required to the distribution of data in master data set, domain knowledge, also need not be Training estimation model on data set, a large amount of time is saved for outlier detection.
3, the method for the present invention carries out importance calculating according to Importance of attribute sex knowledge in rough set to each attribute.
4, the method for the present invention uses UCI machine learning data, and the outlier inspection of different dimensions is carried out in multiple data sets It surveys, the results showed that the present invention has higher outlier detection accuracy rate.
Description of the drawings
Fig. 1 is a kind of algorithm sequence diagram of the outlier detection method towards big data of the present invention.
Specific implementation mode
Embodiments of the present invention are described below in detail, the example of the embodiment is shown in the accompanying drawings.Below by The embodiment being described with reference to the drawings is exemplary, and is only used for explaining the present invention, and is not construed as limiting the claims.
The present invention is a kind of improvement property and comprehensive method, and there are one new definition to outlier for this method:Data are special The object that sign differs markedly from data set is used as outlier, and the thinking of the algorithm is to identify data set using the feature of data In outlier, in data set each data tuple have the feature of oneself, this feature depend on entire data set. This feature comes from the combination of all properties value in each data tuple, and algorithm calculates all properties value in each data tuple Combination in entire data set quantity as the data tuple feature vector of oneself.Feature vector is added again to obtain every number According to the characteristic value of tuple, finally the characteristic value of each attribute tuple is ranked up, the smaller explanation data tuple of characteristic value is got over Different from whole data set, that is, it is used as outlier.And the present invention is based on the Importance of attribute sex knowledge of rough set to high dimensional data Carry out dimensionality reduction operation so that High Dimensional Data Set can be effectively treated in the invention.
Technical scheme of the present invention for ease of understanding, below first to data tuple characteristic model according to the present invention, category The property parallelization of importance and algorithm based on Map-Reduce is briefly introduced.
One, data tuple characteristic model
It is D to enable raw data set, and D has m row n column datas, i.e. data set D to have m data tuples, has n per data tuple A attribute, then data set D can be defined as:
D={ A1,A2,…,An}
Wherein, Ai(1≤i≤n) indicates the i-th Column Properties of data set D.
Data tuple is denoted as in data set:
Dj={ Vj(Ai)|1≤j≤m,1≤i≤n}
Wherein, DjIndicate j-th of data tuple of data set D, Vj(Ai) indicate the i-th Column Properties of jth row of data set D In value, that is, D in j-th of tuple ith attribute value.
Data tuple DjMiddle attribute value composite set is defined as:
Sj=C (y, u) | 1≤y≤n, 1≤u≤y }
Wherein, C (y, u) is data tuple DjMiddle attribute value combination, i.e., choose u unordered categories from y complete attribute values The combination of property value.
Data tuple DjFeature vector and characteristic value be:
Wherein,For data tuple DjAttribute value composite set SjIn the combination of first attribute value in entire data set The attribute value combination number, L SjThe quantity of middle attribute value combination.FjFor data tuple DjFeature vector, ZjFor data Tuple DjCharacteristic value.
The main target of algorithm is exactly to pass through characteristic value ZjTo detect outlier.
Two, Importance of Attributes
Since the dimension of data set may be very high, need to carry out dimensionality reduction behaviour to data set to simplify computation complexity Make, dimensionality reduction operation is carried out to data set using concept " Importance of Attributes " important in rough set, chooses important attribute to examine Survey outlier.
Divisions of the defined attribute collection A to domain U:
U/IND (A)={ C1,C2…,Ct}
Defining Knowledge entropy is:
Wherein, CkIndicate that k-th of classification, t indicate total classification number.
To arbitrary Ai∈ A, we are by AiImportance of Attributes be defined as:
Sig(Ai)=E (A)-E (A- { Ai})
By Importance of attribute sex knowledge in rough set theory, the importance of each attribute in data set can be calculated, The importance of these attributes is ranked up again, an appropriate properties sequence is obtained and carries out outlier detection.
Three, parallelization of the algorithm based on Map-Reduce
Map-Reduce is a kind of parallelisation procedure design framework, is current cloud computing platform calculating mould the most popular Type.Its basic thought is to large-scale dataset, using the strategy divided and rule.Map-Reduce calculates data with Key/ Value formats carry out operation.Map-Reduce realizes that the core of parallelization is the two operations of Map and Reduce, Map- Segmentation of Data Set first at the small documents of many same sizes and is distributed to different nodes by Reduce Computational frames, each node Map calculating is carried out, and result of calculation is ranked up merging, the Value of identical Key, which is placed in identity set, carries out Reduce It calculates.
The present invention provides the algorithm based on Map-Reduce programming frameworks, to realize that the distributed of the algorithm is run.Such as Fig. 1 Shown, this algorithm is broadly divided into 4 stages, then judges the data set to each data tuple marker number in data set first Whether dimension is excessively high, and dimensionality reduction operation is carried out if dimension height, and outlier detection is directly carried out if dimension is acceptable.
Stage 1, algorithm scan data set D simultaneously number for each data tuple label is unique, export each data tuple Number and data tuple, these records constitute the destination file in the stage.
Stage 2, algorithm are that data set carries out dimensionality reduction operation.The stage is divided into two modules and carries out Importance of Attributes to data set It calculates.
Module 1,1 destination file of sweep phase find out division U/IND (A)={ Cs of the property set A in data set1, C2,…,Ct}.Calculate each set CkThe quantity of data tuple in (1≤k≤t), it is defeated so that Knowledge entropy E (A) is calculated Go out Knowledge entropy E (A) and obtains destination file.
1 destination file of module 2, sweep phase 1 and 2 module of stage
Attribute A is chosen successivelyi(Ai∈ A) it calculates, it finds out and removes attribute AiDivision U/ of the remaining property set in data set afterwards IND(A-{Ai)={ C1,C2,…,Ct, calculate each set CkThe quantity of data tuple in (1≤k≤t), to calculate To Knowledge entropy E (A- { Ai), then computation attribute importance Sig (Ai).All properties sequence of importance is obtained, and chooses first p Important attribute finally retains initial data and concentrates this p attribute, obtains destination file.
Stage 3, the stage are divided into the feature that 3 modules calculate data tuple, and module 1 and module 2 can be carried out at the same time.
Module 1,2 module of algorithm sweep phase, 2 destination file calculate combination C (y, u) group of each attribute value in data tuple At the attribute value composite set S of data tuplej, output data member group # and Sj, these, which are recorded, forms its destination file.
Module 2,2 module of algorithm sweep phase, 2 destination file count each data in 2 module of all stage, 2 destination file The quantity of the quantity of all properties value combination C (y, u) of tuple, output C (y, u) and C (y, u), these records constitute the mould The destination file of block.
The destination file of 3 module 2 of module 3, the destination file of 3 module 1 of algorithm sweep phase and stage, according to 3 mould of stage All properties value combines in each data tuple of the destination file of 3 module 1 of statistical result calculation stages of the destination file of block 2 Quantity of the C (y, u) in 2 module of stage, 2 destination file, the feature vector F as each data tuplej(1≤j≤m).It calculates Its feature vector and as data tuple final characteristic value Zj(1≤j≤m).Output data member group # and the number According to the characteristic value of tuple, these records constitute the destination file of the module.
Stage 4,3 module 3 of algorithm sweep phase destination file, its data tuple is arranged using the method for statistics Sequence, q minimum characteristic value before finding out, their feature are different from global feature.
Stage 1, flag data tuple.
Input:Data set D
Output:The data set D of number j and the data tuple composition of the data tuple of data set D1
Map<Object,Text,Text,Text>
Input:Key=offset, value=Dj
1.FOR each<key,value>DO
2.Outkey:key
Outvalue:value
Reduce<Text,Text,Text,Text>
1.FOR each in valuelist DO
2.ADD j into tuple
3.Outkey:j
Outvalue:value
Algorithm scan data set D in this stage is each data tuple number j (1≤j≤m) in data set D, output Number j and DjForm result data collection D1
Stage 2, the stage are divided into two modules and carry out Importance of Attributes calculating to data set.
Module 1, calculation knowledge entropy E (A)
Input:Data set D1
Output:Data set D2
Map<Object,Text,Text,Text>
Input:
1.FOR each<key,value>DO
2.Outkey:j
Outvalue:value
Reduce<Text,Text,Text,Text>
1.FOR each in valuelist DO
IF value in Map<value,i>
ADD value into set Map
ELSE
ADD value into set Map
I=i++
FOR Map<value,i>DO
E (A) +=(i/m) * log2(i/m)
2.Outkey:null
Outvalue:E(A)
This module finds out division U/IND (A)={ Cs of the property set A in data set1,C2,…,Ct, knowledge is calculated Entropy E (A), output result data collection D2
Module 2, computation attribute importance Sig (a) simultaneously carry out dimensionality reduction.
Input:Data set D1And D2And parameter p
Output:Only retain the data set D of important attribute3
Map<Object,Text,Text,Text>
Input:Key=offset,
1.FOR each<key,value>DO
SearchSet(E(A))
Map<Object,Text,Text,Text>
Input:Key=offset,
1.FOR each<key,value>DO
2.Outkey:j
Outvalue:value-{Ai}
Reduce<Text,Text,Text,Text>
1.FOR each in valuelist DO
IF value in Map<value,i>
ADD value into set Map
ELSE
ADD value into set Map
I=i++
FOR Map<value,i>DO
E(A-{Ai) +=(i/m) * log2(i/m)
Sig(Ai)=E (A)-E (A- { Ai})
AddSig(Ai)into Map<Ai,Sig(Ai)>
2.Outkey:null
Outvalue:Map<Ai,Sig(Ai)>
Map<Object,Text,Text,Text>
Input:Key=offset,
1.FOR each<key,value>DO
FOR int n-p DO
FOR Map<Ai,Sig(Ai)>DO
Value=value- { Ai}
2.Outkey:j
Outvalue:value
In this module, attribute A is chosen successively with a Map-Reduce programi(Ai∈ A) it calculates, it finds out and removes attribute Ai Division U/IND (A- { A of the remaining property set in data set afterwardsi)={ C1,C2,…,Ct, calculate each set Ck(1≤k≤ T) quantity of the data tuple in, so that Knowledge entropy E (A- { A are calculatedi), then computation attribute importance Sig (Ai). To sequence of attributes and before choosing, p important attribute is put into set.Retain D with a Map-Reduce program again1In data set This p attribute, obtains data set D3
Stage 3 is divided into the feature that 3 modules calculate data tuple, and module 1 and module 2 can be carried out at the same time.
Module 1, according to data set D3Calculate the attribute value composite set S of each data tuplej
Input:Data set D3
Output:The attribute value composite set S of data tuple number j and data tuplejThe data set D of formation4
Map<Object,Text,Text,Text>
Input:Key=offset,
1.FOR each<key,value>DO
2.FOR each<value>DO
Combine(value)
3.Outkey:j
Outvalue:C(y,u)
Reduce<Text,Text,Text,Text>
1.FOR each in valuelist DO
ADD C(y,u)into Sj
2.Outkey:j
Outvalue:Sj
The modular algorithm scans D3, by D3Data tuple be divided into data tuple number and attribute value two parts, pass through letter Number Combine () computation attribute is worth combination C (y, u), and all C (y, u) are combined into attribute value composite set SjAnd it exports, shape The data set D of Cheng Xin4
Module 2, the entire data set D of statistics3The quantity countc of middle attribute value combination.
Input:Data set D3
Output:Data set D3In each data tuple attribute value combination C (y, u) and the countc formation of its quantity data Collect D5
Map<Object,Text,Text,Text>
Input:Key=offset,
1.FOR each<key,value>DO
2.FOR each<value>DO
Add Combine(value)in List vallist
FOR each<vallist>DO
3.Outkey:C(y,u)
Outvalue:1
Reduce<Text,Text,Text,Text>
1.FOR each in valuelist DO
2.Calculation number of C(y,u)
3.Outkey:C(y,u)
Outvalue:countc
Modular algorithm scan data set D3, pass through function Combine () statistical data collection D3In each data tuple Dj Attribute value combination quantity countc, output result set D5
Module 3, according to data set D4With data set D5Calculate data set D3Each data tuple feature vector and spy Value indicative.
Map<Object,Text,Text,Text>
Input:Key=offset,
1.FOR each<key,value>DO
2.SearchSet(C(y,u),countc)
Map<Object,Text,Text,Text>
Input:Key=offset,
1.FOR each<key,value>DO
2.IF SearchSet.containsKey(C(y,u))
ADD countc into Fj
Sum of Fj is Zj
3.Outkey:Ij
Outvalue:Zj
Modular algorithm scan data set D5, by D5In data set D3Attribute value combination C (y, u) and its quantity Countc is put into the form of key-value in SearchSet, then scan data set D4In category in each data tuple Property value combination C (y, u), the quantity of its C (y, u), composition data collection D are searched in SearchSet3Each data tuple spy Levy vector Fj, acquire it and be characterized value Zj, its characteristic value is exported, new result data collection D is obtained6
Stage 4, by D6Data tuple be ranked up, the characteristic value of q minimums is outlier before finding out.
Map<Object,Text,Text,Text>
Input:Key=offset,
1.FOR each<key,value>DO
2.Outkey:j
Outvalue:Zj
Reduce<Text,Text,Text,Text>
1.FOR each in valuelist DO
FOR intqDO
Outkey:j
Outvalue:Zj
Phase algorithm scan data set D6After Map and Reduce is operated, Map-Reduce calculation blocks Frame can be automatically to ZjThere are a global sequence, q outlier before obtaining.
The method of the present invention is illustrated with a specific embodiment below.
Data set:
Data tuple is numbered Gender Attitude towards study School grade
1 Man Conscientiously It is excellent
2 Man Conscientiously It is excellent
3 Man Conscientiously It is excellent
4 Man Conscientiously It is excellent
5 Man It is half-hearted Difference
6 Man It is half-hearted Difference
7 Man It is half-hearted Difference
8 Man It is half-hearted Difference
9 Man It is half-hearted It is excellent
1, dimensionality reduction
A divides domain:
U/IND (gender, attitude towards study, school grade)={ { 1,2,3,4 }, { 5,6,7,8 }, { 9 } }
E (A)=- (4/9*log (4/9)+4/9*log (4/9)+1/9*log (1/9))=- 8/9log (4/9) -1/ 9log1/9
(1) computational other Importance of Attributes:
A- genders divide domain:
U/IND (A- genders)==U/IND (attitude towards study, school grade)={ { 1,2,3,4 }, { 5,6,7,8 }, { 9 } }
E (A- genders)=- (4/9*log (4/9)+4/9*log (4/9)+1/9*log (1/9))=- 8/9log (4/9) -1/ 9log1/9
Sig (gender)=E (A)-E (A- genders)=0
(2) Importance of Attributes of attitude towards study is calculated:
A- attitudes towards study divide domain:
U/IND (A- attitudes towards study)=U/IND (gender, school grade)={ { 1,2,3,4,9 }, { 5,6,7,8 } }
E (A- attitudes towards study)=- (5/9log (5/9)+4/9log (4/9))=- 5/9log (5/9) -4/9log (4/9)
Sig (attitude towards study)=E (A)-E (A- attitudes towards study)=- 8/9log (4/9) -1/9log1/9+5/9log (5/9) +4/9log(4/9)
(3) Importance of Attributes of school grade is calculated:
A- school grades divide domain:
U/IND (A- school grades)==U/IND (gender, attitude towards study)={ { 1,2,3,4 }, { 5,6,7,8,9 } }
E (A- school grades)=- (5/9log (5/9)+4/9log (4/9))=- 5/9log (5/9) -4/9log (4/9)
Sig (school grade)=E (A)-E (A- school grades)=- 8/9log (4/9) -1/9log1/9+5/9log (5/9) +4/9log(4/9)
And:Sig (gender)<Sig (attitude towards study)=Sig (school grade)
So deleting gender attribute.
2, outlier is detected
Data tuple is numbered Attitude towards study School grade
1 Conscientiously It is excellent
2 Conscientiously It is excellent
3 Conscientiously It is excellent
4 Conscientiously It is excellent
5 It is half-hearted Difference
6 It is half-hearted Difference
7 It is half-hearted Difference
8 It is half-hearted Difference
9 It is half-hearted It is excellent
(1) quantity of each attribute value combination of entire data set is calculated
(conscientious):4
(half-hearted):5
(excellent):5
(poor):4
(conscientious, excellent):4
(half-hearted, excellent):1
(half-hearted, poor):4
(2) the combinations of attributes set of each data tuple is calculated
(3) characteristic value and feature vector of each data tuple are calculated
Data tuple is numbered Feature vector Characteristic value
1 { 4,5,4 } 13
2 { 4,5,4 } 13
3 { 4,5,4 } 13
4 { 4,5,4 } 13
5 { 5,4,4 } 13
6 { 5,4,4 } 13
7 { 5,4,4 } 13
8 { 5,4,4 } 13
9 { 5,5,1 } 11
(4) characteristic value sequence is carried out to data tuple
Data tuple is numbered Characteristic value
9 11
1 13
2 13
3 13
4 13
5 13
6 13
7 13
8 13
Wherein 9 number tuple features are smaller, can be counted as outlier.
Above example is merely illustrative of the invention's technical idea, and protection scope of the present invention cannot be limited with this, every According to technological thought proposed by the present invention, any change done on the basis of technical solution each falls within the scope of the present invention Within.

Claims (5)

1. a kind of outlier detection method towards big data, which is characterized in that include the following steps:
Step 1, setting data set D includes m row n column datas, and one data tuple of each behavior is each to be classified as an attribute, sweeps Retouch each data tuple D of data set DjAnd j is numbered successively, obtain new data set D1=(j, Dj), j=1 ..., m;
Step 2, according to the definition of " division " in rough set, scan data set D1, and identical data tuple is divided into one kind U/IND (A)={ C1,C2,…,Ct, wherein U indicates that domain, A indicate the set of all properties composition, CkIndicate k-th point Class, k=1 ..., t, t indicate all classification numbers, the corresponding attribute value of each attribute in identical data tuple i.e. some data tuple It is identical as the corresponding attribute value of same alike result in another data tuple, count each classification CkThe quantity of middle data tuple, and Calculate Knowledge entropy E (A) of all properties to domain U;
Step 3, an attribute A is chosen successivelyi, by attribute A in data set D1iThe row of corresponding attribute value one remove, for residue Data set, identical data tuple is divided into a kind of U/IND (A- { Ai)={ C1,C2,…,Ct, it counts in each classification The quantity of data tuple, and calculate and remove attribute AiKnowledge entropy E (A- { A of the remaining attribute to domain U afterwardsi), i=1 ..., n, together When computation attribute AiImportance of Attributes;It sorts to the Importance of Attributes of all properties, is chosen in data set D1 from big to small The corresponding attribute of preceding p Importance of Attributes, constitutes new data set D2, p<n;
Step 4, scan data set D2 is combined all properties value in data set D2, and counts each attribute value combination and exist Quantity in entire data set D2;
Step 5, scan data set D2, for data sets the attribute value of each data tuples of D2 be combined, obtain each data element The corresponding attribute value composite set of group;
Step 6, according to step 4 and step 5, by the corresponding attribute value composite set of each data tuple, each attribute value combines Quantity in entire data set D2 makees the sum of feature vector of each data tuple as the feature vector of each data tuple For the characteristic value of each data tuple;
Step 7, the characteristic value of each data tuple is ranked up from small to large, the corresponding data tuple of preceding q characteristic value is The outlier of data set D, q<m.
2. the outlier detection method towards big data according to claim 1, which is characterized in that all categories described in step 2 Property is as follows to Knowledge entropy E (A) calculation formula of domain U:
Wherein, CkIndicate that k-th of classification, k=1 ..., t, t indicate that all classification numbers, U indicate domain.
3. the outlier detection method towards big data according to claim 1, which is characterized in that attribute A described in step 3i's Importance of Attributes calculation formula is as follows:
Sig(Ai)=E (A)-E (A- { Ai})
Wherein, Sig (Ai) indicate attribute AiImportance of Attributes, E (A) indicate all properties to the Knowledge entropy of domain U, E (A- {Ai) indicate to remove attribute AiKnowledge entropy of the remaining attribute to domain U afterwards.
4. the outlier detection method towards big data according to claim 1, which is characterized in that described p, q are advance The positive integer of setting.
5. the outlier detection method towards big data according to claim 1, which is characterized in that the step 4, step 5 It is limited without sequence.
CN201810249198.5A 2018-03-21 2018-03-21 A kind of outlier detection method towards big data Pending CN108549669A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810249198.5A CN108549669A (en) 2018-03-21 2018-03-21 A kind of outlier detection method towards big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810249198.5A CN108549669A (en) 2018-03-21 2018-03-21 A kind of outlier detection method towards big data

Publications (1)

Publication Number Publication Date
CN108549669A true CN108549669A (en) 2018-09-18

Family

ID=63516955

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810249198.5A Pending CN108549669A (en) 2018-03-21 2018-03-21 A kind of outlier detection method towards big data

Country Status (1)

Country Link
CN (1) CN108549669A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110288014A (en) * 2019-06-21 2019-09-27 南京信息工程大学 A kind of local Outliers Detection method based on comentropy weighting

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110288014A (en) * 2019-06-21 2019-09-27 南京信息工程大学 A kind of local Outliers Detection method based on comentropy weighting

Similar Documents

Publication Publication Date Title
CN108564129B (en) Trajectory data classification method based on generation countermeasure network
CN107766883A (en) A kind of optimization random forest classification method and system based on weighted decision tree
Guevara et al. diverse: an R Package to Analyze Diversity in Complex Systems.
CN106384282A (en) Method and device for building decision-making model
CN111597348B (en) User image drawing method, device, computer equipment and storage medium
CN103279478B (en) A kind of based on distributed mutual information file characteristics extracting method
CN101692224A (en) High-resolution remote sensing image search method fused with spatial relation semantics
JP2015506026A (en) Image classification
CN104239553A (en) Entity recognition method based on Map-Reduce framework
CN105760888A (en) Neighborhood rough set ensemble learning method based on attribute clustering
CN103810299A (en) Image retrieval method on basis of multi-feature fusion
CN103366367A (en) Pixel number clustering-based fuzzy C-average value gray level image splitting method
CN104679818A (en) Video keyframe extracting method and video keyframe extracting system
CN113673697A (en) Model pruning method and device based on adjacent convolution and storage medium
CN104951430B (en) The extracting method and device of product feature label
CN108764302A (en) A kind of bill images sorting technique based on color characteristic and bag of words feature
CN106919719A (en) A kind of information completion method towards big data
CN110110035A (en) Data processing method and device and computer readable storage medium
CN114817575B (en) Large-scale electric power affair map processing method based on extended model
CN114612194A (en) Product recommendation method and device, electronic equipment and storage medium
CN109977131A (en) A kind of house type matching system
CN104143088A (en) Face identification method based on image retrieval and feature weight learning
CN112182207B (en) Invoice virtual offset risk assessment method based on keyword extraction and rapid text classification
CN108549669A (en) A kind of outlier detection method towards big data
CN105653567A (en) Method for quickly looking for feature character strings in text sequential data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180918