CN108549669A - A kind of outlier detection method towards big data - Google Patents
A kind of outlier detection method towards big data Download PDFInfo
- Publication number
- CN108549669A CN108549669A CN201810249198.5A CN201810249198A CN108549669A CN 108549669 A CN108549669 A CN 108549669A CN 201810249198 A CN201810249198 A CN 201810249198A CN 108549669 A CN108549669 A CN 108549669A
- Authority
- CN
- China
- Prior art keywords
- data
- data set
- attribute
- tuple
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
Abstract
The outlier detection method towards big data that the invention discloses a kind of, this method is using quantity of the combination of each data tuple all properties value in data set in entire data set as the feature of the data tuple, since this feature is that quantity calculating from the combination of all properties value in data tuple in entire data set is got, so this feature can react the difference degree between the data tuple and whole data set comprehensively and accurately, it whereby it can be detected which data characteristics differs markedly from whole data set, that is, be used as outlier.And the present invention is based on the Importance of attribute sex knowledge of rough set to carry out dimensionality reduction to high dimensional data so that High Dimensional Data Set can be effectively treated in the invention.The method of the present invention has higher outlier detection accuracy rate, and simple and practicable, is not required to the distribution of data in master data set, and domain knowledge need not also train estimation model on data set, a large amount of time is saved for outlier detection.
Description
Technical field
The outlier detection method towards big data that the present invention relates to a kind of, belongs to Data Preprocessing Technology field.
Background technology
Recently as the high speed development of information technology, global metadata is lasting to be increased with astonishing speed, we
The big data epoch are marched toward in the world.How valuable data or information obtained from complicated data, it has also become Ren Menguan
The focus of note.Outlier detection is an important directions of data mining.Outlier detection focuses mainly on one in data set
Fraction object, compared with remainder data in data set, this sub-fraction object does not meet the universal model of data set, we are just
The data of this part are referred to as outlier.Outliers Detection is exactly the data mining skill for finding unconventional pattern from mass data
Art.The purpose of detection outlier is to eliminate the noise of raw data set or find that initial data concentrates potential valuable letter
Breath.It is widely used in quality control, fault detect, financial fraud, Web Log Analysis, medical treatment, environmental science, smart city
Equal fields.In many scientific domains, Outlier Data may bring new inspiration to us, is found so as to cause new knowledge
It is developed with new application.Therefore there is highly important theory significance and actual application value for the detection of outlier.Mesh
The preceding detection and analysis to outlier has been developed as a vital task in data mining and data management.And it is traditional from
Group's point detecting method generally existing Detection accuracy is low, cannot handle extensive high-dimensional data set.
Therefore, it is badly in need of a kind of algorithm that outlier detection accuracy rate is high, and algorithm can preferably be suitable for extensive number
According to the environment of collection.
Invention content
The technical problem to be solved by the present invention is to:A kind of outlier detection method towards big data is provided, number is defined
The object that data set is differed markedly from according to feature is outlier, and this method has higher Detection accuracy, and uses and divide
The outlier detection method of cloth adapts to extensive High Dimensional Data Set.
The present invention uses following technical scheme to solve above-mentioned technical problem:
A kind of outlier detection method towards big data, includes the following steps:
Step 1, setting data set D includes m row n column datas, and one data tuple of each behavior is each to be classified as a category
Property, each data tuple D of scan data set DjAnd j is numbered successively, obtain new data set D1=(j, Dj), j=
1,…,m;
Step 2, according to the definition of " division " in rough set, scan data set D1, and identical data tuple is divided into
A kind of U/IND (A)={ C1,C2,…,Ct, wherein U indicates that domain, A indicate the set of all properties composition, CkIt indicates k-th
Classification, k=1 ..., t, t indicate all classification numbers, the corresponding attribute of each attribute in identical data tuple i.e. some data tuple
Value is identical as the corresponding attribute value of same alike result in another data tuple, counts each classification CkThe quantity of middle data tuple,
And calculate Knowledge entropy E (A) of all properties to domain U;
Step 3, an attribute A is chosen successivelyi, by attribute A in data set D1iThe row of corresponding attribute value one remove, for
Identical data tuple is divided into a kind of U/IND (A- { A by remaining data seti)={ C1,C2,…,Ct, each point of statistics
The quantity of data tuple in class, and calculate and remove attribute AiKnowledge entropy E (A- { A of the remaining attribute to domain U afterwardsi), i=1 ...,
N, while computation attribute AiImportance of Attributes;It sorts from big to small to the Importance of Attributes of all properties, in data set D1
The corresponding attribute of p Importance of Attributes, constitutes new data set D2, p before choosing<n;
Step 4, scan data set D2 is combined all properties value in data set D2, and counts each attribute value group
Close the quantity in entire data set D2;
Step 5, scan data set D2, for data sets the attribute value of each data tuples of D2 be combined, obtain each number
According to the corresponding attribute value composite set of tuple;
Step 6, according to step 4 and step 5, by the corresponding attribute value composite set of each data tuple, each attribute value
Combine quantity in entire data set D2, as the feature vector of each data tuple, by the feature vector of each data tuple it
With the characteristic value as each data tuple;
Step 7, the characteristic value of each data tuple is ranked up from small to large, the corresponding data tuple of preceding q characteristic value
The as outlier of data set D, q<m.
As a preferred embodiment of the present invention, Knowledge entropy E (A) calculation formula of all properties to domain U described in step 2
It is as follows:
Wherein, CkIndicate that k-th of classification, k=1 ..., t, t indicate that all classification numbers, U indicate domain.
As a preferred embodiment of the present invention, attribute A described in step 3iImportance of Attributes calculation formula it is as follows:
Sig(Ai)=E (A)-E (A- { Ai})
Wherein, Sig (Ai) indicate attribute AiImportance of Attributes, E (A) indicate all properties to the Knowledge entropy of domain U, E
(A-{Ai) indicate to remove attribute AiKnowledge entropy of the remaining attribute to domain U afterwards.
As a preferred embodiment of the present invention, described p, q are preset positive integer.
As a preferred embodiment of the present invention, the step 4, step 5 are limited without sequence.
The present invention has the following technical effects using above technical scheme is compared with the prior art:
1, the suitable big data of the present invention, at present most methods be small-sized missing data collection is handled on single machine, however
Now with information-based development, data volume sharp increase, data dimension is huge, and large-scale dataset is handled obviously on single machine
It is improper;The present invention can be based on Map-Reduce programming models and realize the outlier in distributed data processing platform
Detect large-scale data set.
2, the method for the present invention is simple and practicable, is not required to the distribution of data in master data set, domain knowledge, also need not be
Training estimation model on data set, a large amount of time is saved for outlier detection.
3, the method for the present invention carries out importance calculating according to Importance of attribute sex knowledge in rough set to each attribute.
4, the method for the present invention uses UCI machine learning data, and the outlier inspection of different dimensions is carried out in multiple data sets
It surveys, the results showed that the present invention has higher outlier detection accuracy rate.
Description of the drawings
Fig. 1 is a kind of algorithm sequence diagram of the outlier detection method towards big data of the present invention.
Specific implementation mode
Embodiments of the present invention are described below in detail, the example of the embodiment is shown in the accompanying drawings.Below by
The embodiment being described with reference to the drawings is exemplary, and is only used for explaining the present invention, and is not construed as limiting the claims.
The present invention is a kind of improvement property and comprehensive method, and there are one new definition to outlier for this method:Data are special
The object that sign differs markedly from data set is used as outlier, and the thinking of the algorithm is to identify data set using the feature of data
In outlier, in data set each data tuple have the feature of oneself, this feature depend on entire data set.
This feature comes from the combination of all properties value in each data tuple, and algorithm calculates all properties value in each data tuple
Combination in entire data set quantity as the data tuple feature vector of oneself.Feature vector is added again to obtain every number
According to the characteristic value of tuple, finally the characteristic value of each attribute tuple is ranked up, the smaller explanation data tuple of characteristic value is got over
Different from whole data set, that is, it is used as outlier.And the present invention is based on the Importance of attribute sex knowledge of rough set to high dimensional data
Carry out dimensionality reduction operation so that High Dimensional Data Set can be effectively treated in the invention.
Technical scheme of the present invention for ease of understanding, below first to data tuple characteristic model according to the present invention, category
The property parallelization of importance and algorithm based on Map-Reduce is briefly introduced.
One, data tuple characteristic model
It is D to enable raw data set, and D has m row n column datas, i.e. data set D to have m data tuples, has n per data tuple
A attribute, then data set D can be defined as:
D={ A1,A2,…,An}
Wherein, Ai(1≤i≤n) indicates the i-th Column Properties of data set D.
Data tuple is denoted as in data set:
Dj={ Vj(Ai)|1≤j≤m,1≤i≤n}
Wherein, DjIndicate j-th of data tuple of data set D, Vj(Ai) indicate the i-th Column Properties of jth row of data set D
In value, that is, D in j-th of tuple ith attribute value.
Data tuple DjMiddle attribute value composite set is defined as:
Sj=C (y, u) | 1≤y≤n, 1≤u≤y }
Wherein, C (y, u) is data tuple DjMiddle attribute value combination, i.e., choose u unordered categories from y complete attribute values
The combination of property value.
Data tuple DjFeature vector and characteristic value be:
Wherein,For data tuple DjAttribute value composite set SjIn the combination of first attribute value in entire data set
The attribute value combination number, L SjThe quantity of middle attribute value combination.FjFor data tuple DjFeature vector, ZjFor data
Tuple DjCharacteristic value.
The main target of algorithm is exactly to pass through characteristic value ZjTo detect outlier.
Two, Importance of Attributes
Since the dimension of data set may be very high, need to carry out dimensionality reduction behaviour to data set to simplify computation complexity
Make, dimensionality reduction operation is carried out to data set using concept " Importance of Attributes " important in rough set, chooses important attribute to examine
Survey outlier.
Divisions of the defined attribute collection A to domain U:
U/IND (A)={ C1,C2…,Ct}
Defining Knowledge entropy is:
Wherein, CkIndicate that k-th of classification, t indicate total classification number.
To arbitrary Ai∈ A, we are by AiImportance of Attributes be defined as:
Sig(Ai)=E (A)-E (A- { Ai})
By Importance of attribute sex knowledge in rough set theory, the importance of each attribute in data set can be calculated,
The importance of these attributes is ranked up again, an appropriate properties sequence is obtained and carries out outlier detection.
Three, parallelization of the algorithm based on Map-Reduce
Map-Reduce is a kind of parallelisation procedure design framework, is current cloud computing platform calculating mould the most popular
Type.Its basic thought is to large-scale dataset, using the strategy divided and rule.Map-Reduce calculates data with Key/
Value formats carry out operation.Map-Reduce realizes that the core of parallelization is the two operations of Map and Reduce, Map-
Segmentation of Data Set first at the small documents of many same sizes and is distributed to different nodes by Reduce Computational frames, each node
Map calculating is carried out, and result of calculation is ranked up merging, the Value of identical Key, which is placed in identity set, carries out Reduce
It calculates.
The present invention provides the algorithm based on Map-Reduce programming frameworks, to realize that the distributed of the algorithm is run.Such as Fig. 1
Shown, this algorithm is broadly divided into 4 stages, then judges the data set to each data tuple marker number in data set first
Whether dimension is excessively high, and dimensionality reduction operation is carried out if dimension height, and outlier detection is directly carried out if dimension is acceptable.
Stage 1, algorithm scan data set D simultaneously number for each data tuple label is unique, export each data tuple
Number and data tuple, these records constitute the destination file in the stage.
Stage 2, algorithm are that data set carries out dimensionality reduction operation.The stage is divided into two modules and carries out Importance of Attributes to data set
It calculates.
Module 1,1 destination file of sweep phase find out division U/IND (A)={ Cs of the property set A in data set1,
C2,…,Ct}.Calculate each set CkThe quantity of data tuple in (1≤k≤t), it is defeated so that Knowledge entropy E (A) is calculated
Go out Knowledge entropy E (A) and obtains destination file.
1 destination file of module 2, sweep phase 1 and 2 module of stage
Attribute A is chosen successivelyi(Ai∈ A) it calculates, it finds out and removes attribute AiDivision U/ of the remaining property set in data set afterwards
IND(A-{Ai)={ C1,C2,…,Ct, calculate each set CkThe quantity of data tuple in (1≤k≤t), to calculate
To Knowledge entropy E (A- { Ai), then computation attribute importance Sig (Ai).All properties sequence of importance is obtained, and chooses first p
Important attribute finally retains initial data and concentrates this p attribute, obtains destination file.
Stage 3, the stage are divided into the feature that 3 modules calculate data tuple, and module 1 and module 2 can be carried out at the same time.
Module 1,2 module of algorithm sweep phase, 2 destination file calculate combination C (y, u) group of each attribute value in data tuple
At the attribute value composite set S of data tuplej, output data member group # and Sj, these, which are recorded, forms its destination file.
Module 2,2 module of algorithm sweep phase, 2 destination file count each data in 2 module of all stage, 2 destination file
The quantity of the quantity of all properties value combination C (y, u) of tuple, output C (y, u) and C (y, u), these records constitute the mould
The destination file of block.
The destination file of 3 module 2 of module 3, the destination file of 3 module 1 of algorithm sweep phase and stage, according to 3 mould of stage
All properties value combines in each data tuple of the destination file of 3 module 1 of statistical result calculation stages of the destination file of block 2
Quantity of the C (y, u) in 2 module of stage, 2 destination file, the feature vector F as each data tuplej(1≤j≤m).It calculates
Its feature vector and as data tuple final characteristic value Zj(1≤j≤m).Output data member group # and the number
According to the characteristic value of tuple, these records constitute the destination file of the module.
Stage 4,3 module 3 of algorithm sweep phase destination file, its data tuple is arranged using the method for statistics
Sequence, q minimum characteristic value before finding out, their feature are different from global feature.
Stage 1, flag data tuple.
Input:Data set D
Output:The data set D of number j and the data tuple composition of the data tuple of data set D1
Map<Object,Text,Text,Text>
Input:Key=offset, value=Dj
1.FOR each<key,value>DO
2.Outkey:key
Outvalue:value
Reduce<Text,Text,Text,Text>
1.FOR each in valuelist DO
2.ADD j into tuple
3.Outkey:j
Outvalue:value
Algorithm scan data set D in this stage is each data tuple number j (1≤j≤m) in data set D, output
Number j and DjForm result data collection D1。
Stage 2, the stage are divided into two modules and carry out Importance of Attributes calculating to data set.
Module 1, calculation knowledge entropy E (A)
Input:Data set D1
Output:Data set D2
Map<Object,Text,Text,Text>
Input:
1.FOR each<key,value>DO
2.Outkey:j
Outvalue:value
Reduce<Text,Text,Text,Text>
1.FOR each in valuelist DO
IF value in Map<value,i>
ADD value into set Map
ELSE
ADD value into set Map
I=i++
FOR Map<value,i>DO
E (A) +=(i/m) * log2(i/m)
2.Outkey:null
Outvalue:E(A)
This module finds out division U/IND (A)={ Cs of the property set A in data set1,C2,…,Ct, knowledge is calculated
Entropy E (A), output result data collection D2。
Module 2, computation attribute importance Sig (a) simultaneously carry out dimensionality reduction.
Input:Data set D1And D2And parameter p
Output:Only retain the data set D of important attribute3
Map<Object,Text,Text,Text>
Input:Key=offset,
1.FOR each<key,value>DO
SearchSet(E(A))
Map<Object,Text,Text,Text>
Input:Key=offset,
1.FOR each<key,value>DO
2.Outkey:j
Outvalue:value-{Ai}
Reduce<Text,Text,Text,Text>
1.FOR each in valuelist DO
IF value in Map<value,i>
ADD value into set Map
ELSE
ADD value into set Map
I=i++
FOR Map<value,i>DO
E(A-{Ai) +=(i/m) * log2(i/m)
Sig(Ai)=E (A)-E (A- { Ai})
AddSig(Ai)into Map<Ai,Sig(Ai)>
2.Outkey:null
Outvalue:Map<Ai,Sig(Ai)>
Map<Object,Text,Text,Text>
Input:Key=offset,
1.FOR each<key,value>DO
FOR int n-p DO
FOR Map<Ai,Sig(Ai)>DO
Value=value- { Ai}
2.Outkey:j
Outvalue:value
In this module, attribute A is chosen successively with a Map-Reduce programi(Ai∈ A) it calculates, it finds out and removes attribute Ai
Division U/IND (A- { A of the remaining property set in data set afterwardsi)={ C1,C2,…,Ct, calculate each set Ck(1≤k≤
T) quantity of the data tuple in, so that Knowledge entropy E (A- { A are calculatedi), then computation attribute importance Sig (Ai).
To sequence of attributes and before choosing, p important attribute is put into set.Retain D with a Map-Reduce program again1In data set
This p attribute, obtains data set D3。
Stage 3 is divided into the feature that 3 modules calculate data tuple, and module 1 and module 2 can be carried out at the same time.
Module 1, according to data set D3Calculate the attribute value composite set S of each data tuplej。
Input:Data set D3
Output:The attribute value composite set S of data tuple number j and data tuplejThe data set D of formation4
Map<Object,Text,Text,Text>
Input:Key=offset,
1.FOR each<key,value>DO
2.FOR each<value>DO
Combine(value)
3.Outkey:j
Outvalue:C(y,u)
Reduce<Text,Text,Text,Text>
1.FOR each in valuelist DO
ADD C(y,u)into Sj
2.Outkey:j
Outvalue:Sj
The modular algorithm scans D3, by D3Data tuple be divided into data tuple number and attribute value two parts, pass through letter
Number Combine () computation attribute is worth combination C (y, u), and all C (y, u) are combined into attribute value composite set SjAnd it exports, shape
The data set D of Cheng Xin4。
Module 2, the entire data set D of statistics3The quantity countc of middle attribute value combination.
Input:Data set D3
Output:Data set D3In each data tuple attribute value combination C (y, u) and the countc formation of its quantity data
Collect D5
Map<Object,Text,Text,Text>
Input:Key=offset,
1.FOR each<key,value>DO
2.FOR each<value>DO
Add Combine(value)in List vallist
FOR each<vallist>DO
3.Outkey:C(y,u)
Outvalue:1
Reduce<Text,Text,Text,Text>
1.FOR each in valuelist DO
2.Calculation number of C(y,u)
3.Outkey:C(y,u)
Outvalue:countc
Modular algorithm scan data set D3, pass through function Combine () statistical data collection D3In each data tuple Dj
Attribute value combination quantity countc, output result set D5。
Module 3, according to data set D4With data set D5Calculate data set D3Each data tuple feature vector and spy
Value indicative.
Map<Object,Text,Text,Text>
Input:Key=offset,
1.FOR each<key,value>DO
2.SearchSet(C(y,u),countc)
Map<Object,Text,Text,Text>
Input:Key=offset,
1.FOR each<key,value>DO
2.IF SearchSet.containsKey(C(y,u))
ADD countc into Fj
Sum of Fj is Zj
3.Outkey:Ij
Outvalue:Zj
Modular algorithm scan data set D5, by D5In data set D3Attribute value combination C (y, u) and its quantity
Countc is put into the form of key-value in SearchSet, then scan data set D4In category in each data tuple
Property value combination C (y, u), the quantity of its C (y, u), composition data collection D are searched in SearchSet3Each data tuple spy
Levy vector Fj, acquire it and be characterized value Zj, its characteristic value is exported, new result data collection D is obtained6。
Stage 4, by D6Data tuple be ranked up, the characteristic value of q minimums is outlier before finding out.
Map<Object,Text,Text,Text>
Input:Key=offset,
1.FOR each<key,value>DO
2.Outkey:j
Outvalue:Zj
Reduce<Text,Text,Text,Text>
1.FOR each in valuelist DO
FOR intqDO
Outkey:j
Outvalue:Zj
Phase algorithm scan data set D6After Map and Reduce is operated, Map-Reduce calculation blocks
Frame can be automatically to ZjThere are a global sequence, q outlier before obtaining.
The method of the present invention is illustrated with a specific embodiment below.
Data set:
Data tuple is numbered | Gender | Attitude towards study | School grade |
1 | Man | Conscientiously | It is excellent |
2 | Man | Conscientiously | It is excellent |
3 | Man | Conscientiously | It is excellent |
4 | Man | Conscientiously | It is excellent |
5 | Man | It is half-hearted | Difference |
6 | Man | It is half-hearted | Difference |
7 | Man | It is half-hearted | Difference |
8 | Man | It is half-hearted | Difference |
9 | Man | It is half-hearted | It is excellent |
1, dimensionality reduction
A divides domain:
U/IND (gender, attitude towards study, school grade)={ { 1,2,3,4 }, { 5,6,7,8 }, { 9 } }
E (A)=- (4/9*log (4/9)+4/9*log (4/9)+1/9*log (1/9))=- 8/9log (4/9) -1/
9log1/9
(1) computational other Importance of Attributes:
A- genders divide domain:
U/IND (A- genders)==U/IND (attitude towards study, school grade)={ { 1,2,3,4 }, { 5,6,7,8 }, { 9 } }
E (A- genders)=- (4/9*log (4/9)+4/9*log (4/9)+1/9*log (1/9))=- 8/9log (4/9) -1/
9log1/9
Sig (gender)=E (A)-E (A- genders)=0
(2) Importance of Attributes of attitude towards study is calculated:
A- attitudes towards study divide domain:
U/IND (A- attitudes towards study)=U/IND (gender, school grade)={ { 1,2,3,4,9 }, { 5,6,7,8 } }
E (A- attitudes towards study)=- (5/9log (5/9)+4/9log (4/9))=- 5/9log (5/9) -4/9log (4/9)
Sig (attitude towards study)=E (A)-E (A- attitudes towards study)=- 8/9log (4/9) -1/9log1/9+5/9log (5/9)
+4/9log(4/9)
(3) Importance of Attributes of school grade is calculated:
A- school grades divide domain:
U/IND (A- school grades)==U/IND (gender, attitude towards study)={ { 1,2,3,4 }, { 5,6,7,8,9 } }
E (A- school grades)=- (5/9log (5/9)+4/9log (4/9))=- 5/9log (5/9) -4/9log (4/9)
Sig (school grade)=E (A)-E (A- school grades)=- 8/9log (4/9) -1/9log1/9+5/9log (5/9)
+4/9log(4/9)
And:Sig (gender)<Sig (attitude towards study)=Sig (school grade)
So deleting gender attribute.
2, outlier is detected
Data tuple is numbered | Attitude towards study | School grade |
1 | Conscientiously | It is excellent |
2 | Conscientiously | It is excellent |
3 | Conscientiously | It is excellent |
4 | Conscientiously | It is excellent |
5 | It is half-hearted | Difference |
6 | It is half-hearted | Difference |
7 | It is half-hearted | Difference |
8 | It is half-hearted | Difference |
9 | It is half-hearted | It is excellent |
(1) quantity of each attribute value combination of entire data set is calculated
(conscientious):4
(half-hearted):5
(excellent):5
(poor):4
(conscientious, excellent):4
(half-hearted, excellent):1
(half-hearted, poor):4
(2) the combinations of attributes set of each data tuple is calculated
(3) characteristic value and feature vector of each data tuple are calculated
Data tuple is numbered | Feature vector | Characteristic value |
1 | { 4,5,4 } | 13 |
2 | { 4,5,4 } | 13 |
3 | { 4,5,4 } | 13 |
4 | { 4,5,4 } | 13 |
5 | { 5,4,4 } | 13 |
6 | { 5,4,4 } | 13 |
7 | { 5,4,4 } | 13 |
8 | { 5,4,4 } | 13 |
9 | { 5,5,1 } | 11 |
(4) characteristic value sequence is carried out to data tuple
Data tuple is numbered | Characteristic value |
9 | 11 |
1 | 13 |
2 | 13 |
3 | 13 |
4 | 13 |
5 | 13 |
6 | 13 |
7 | 13 |
8 | 13 |
Wherein 9 number tuple features are smaller, can be counted as outlier.
Above example is merely illustrative of the invention's technical idea, and protection scope of the present invention cannot be limited with this, every
According to technological thought proposed by the present invention, any change done on the basis of technical solution each falls within the scope of the present invention
Within.
Claims (5)
1. a kind of outlier detection method towards big data, which is characterized in that include the following steps:
Step 1, setting data set D includes m row n column datas, and one data tuple of each behavior is each to be classified as an attribute, sweeps
Retouch each data tuple D of data set DjAnd j is numbered successively, obtain new data set D1=(j, Dj), j=1 ..., m;
Step 2, according to the definition of " division " in rough set, scan data set D1, and identical data tuple is divided into one kind
U/IND (A)={ C1,C2,…,Ct, wherein U indicates that domain, A indicate the set of all properties composition, CkIndicate k-th point
Class, k=1 ..., t, t indicate all classification numbers, the corresponding attribute value of each attribute in identical data tuple i.e. some data tuple
It is identical as the corresponding attribute value of same alike result in another data tuple, count each classification CkThe quantity of middle data tuple, and
Calculate Knowledge entropy E (A) of all properties to domain U;
Step 3, an attribute A is chosen successivelyi, by attribute A in data set D1iThe row of corresponding attribute value one remove, for residue
Data set, identical data tuple is divided into a kind of U/IND (A- { Ai)={ C1,C2,…,Ct, it counts in each classification
The quantity of data tuple, and calculate and remove attribute AiKnowledge entropy E (A- { A of the remaining attribute to domain U afterwardsi), i=1 ..., n, together
When computation attribute AiImportance of Attributes;It sorts to the Importance of Attributes of all properties, is chosen in data set D1 from big to small
The corresponding attribute of preceding p Importance of Attributes, constitutes new data set D2, p<n;
Step 4, scan data set D2 is combined all properties value in data set D2, and counts each attribute value combination and exist
Quantity in entire data set D2;
Step 5, scan data set D2, for data sets the attribute value of each data tuples of D2 be combined, obtain each data element
The corresponding attribute value composite set of group;
Step 6, according to step 4 and step 5, by the corresponding attribute value composite set of each data tuple, each attribute value combines
Quantity in entire data set D2 makees the sum of feature vector of each data tuple as the feature vector of each data tuple
For the characteristic value of each data tuple;
Step 7, the characteristic value of each data tuple is ranked up from small to large, the corresponding data tuple of preceding q characteristic value is
The outlier of data set D, q<m.
2. the outlier detection method towards big data according to claim 1, which is characterized in that all categories described in step 2
Property is as follows to Knowledge entropy E (A) calculation formula of domain U:
Wherein, CkIndicate that k-th of classification, k=1 ..., t, t indicate that all classification numbers, U indicate domain.
3. the outlier detection method towards big data according to claim 1, which is characterized in that attribute A described in step 3i's
Importance of Attributes calculation formula is as follows:
Sig(Ai)=E (A)-E (A- { Ai})
Wherein, Sig (Ai) indicate attribute AiImportance of Attributes, E (A) indicate all properties to the Knowledge entropy of domain U, E (A-
{Ai) indicate to remove attribute AiKnowledge entropy of the remaining attribute to domain U afterwards.
4. the outlier detection method towards big data according to claim 1, which is characterized in that described p, q are advance
The positive integer of setting.
5. the outlier detection method towards big data according to claim 1, which is characterized in that the step 4, step 5
It is limited without sequence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810249198.5A CN108549669A (en) | 2018-03-21 | 2018-03-21 | A kind of outlier detection method towards big data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810249198.5A CN108549669A (en) | 2018-03-21 | 2018-03-21 | A kind of outlier detection method towards big data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108549669A true CN108549669A (en) | 2018-09-18 |
Family
ID=63516955
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810249198.5A Pending CN108549669A (en) | 2018-03-21 | 2018-03-21 | A kind of outlier detection method towards big data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108549669A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110288014A (en) * | 2019-06-21 | 2019-09-27 | 南京信息工程大学 | A kind of local Outliers Detection method based on comentropy weighting |
-
2018
- 2018-03-21 CN CN201810249198.5A patent/CN108549669A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110288014A (en) * | 2019-06-21 | 2019-09-27 | 南京信息工程大学 | A kind of local Outliers Detection method based on comentropy weighting |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108564129B (en) | Trajectory data classification method based on generation countermeasure network | |
CN107766883A (en) | A kind of optimization random forest classification method and system based on weighted decision tree | |
Guevara et al. | diverse: an R Package to Analyze Diversity in Complex Systems. | |
CN106384282A (en) | Method and device for building decision-making model | |
CN111597348B (en) | User image drawing method, device, computer equipment and storage medium | |
CN103279478B (en) | A kind of based on distributed mutual information file characteristics extracting method | |
CN101692224A (en) | High-resolution remote sensing image search method fused with spatial relation semantics | |
JP2015506026A (en) | Image classification | |
CN104239553A (en) | Entity recognition method based on Map-Reduce framework | |
CN105760888A (en) | Neighborhood rough set ensemble learning method based on attribute clustering | |
CN103810299A (en) | Image retrieval method on basis of multi-feature fusion | |
CN103366367A (en) | Pixel number clustering-based fuzzy C-average value gray level image splitting method | |
CN104679818A (en) | Video keyframe extracting method and video keyframe extracting system | |
CN113673697A (en) | Model pruning method and device based on adjacent convolution and storage medium | |
CN104951430B (en) | The extracting method and device of product feature label | |
CN108764302A (en) | A kind of bill images sorting technique based on color characteristic and bag of words feature | |
CN106919719A (en) | A kind of information completion method towards big data | |
CN110110035A (en) | Data processing method and device and computer readable storage medium | |
CN114817575B (en) | Large-scale electric power affair map processing method based on extended model | |
CN114612194A (en) | Product recommendation method and device, electronic equipment and storage medium | |
CN109977131A (en) | A kind of house type matching system | |
CN104143088A (en) | Face identification method based on image retrieval and feature weight learning | |
CN112182207B (en) | Invoice virtual offset risk assessment method based on keyword extraction and rapid text classification | |
CN108549669A (en) | A kind of outlier detection method towards big data | |
CN105653567A (en) | Method for quickly looking for feature character strings in text sequential data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180918 |