CN108549669A

CN108549669A - A kind of outlier detection method towards big data

Info

Publication number: CN108549669A
Application number: CN201810249198.5A
Authority: CN
Inventors: 徐小龙; 崇卫之; 段卫华; 贾佳; 刘大勇; 胥备; 王俊昌
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2018-03-21
Filing date: 2018-03-21
Publication date: 2018-09-18

Abstract

The outlier detection method towards big data that the invention discloses a kind of, this method is using quantity of the combination of each data tuple all properties value in data set in entire data set as the feature of the data tuple, since this feature is that quantity calculating from the combination of all properties value in data tuple in entire data set is got, so this feature can react the difference degree between the data tuple and whole data set comprehensively and accurately, it whereby it can be detected which data characteristics differs markedly from whole data set, that is, be used as outlier.And the present invention is based on the Importance of attribute sex knowledge of rough set to carry out dimensionality reduction to high dimensional data so that High Dimensional Data Set can be effectively treated in the invention.The method of the present invention has higher outlier detection accuracy rate, and simple and practicable, is not required to the distribution of data in master data set, and domain knowledge need not also train estimation model on data set, a large amount of time is saved for outlier detection.

Description

A kind of outlier detection method towards big data

Technical field

The outlier detection method towards big data that the present invention relates to a kind of, belongs to Data Preprocessing Technology field.

Background technology

Recently as the high speed development of information technology, global metadata is lasting to be increased with astonishing speed, we The big data epoch are marched toward in the world.How valuable data or information obtained from complicated data, it has also become Ren Menguan The focus of note.Outlier detection is an important directions of data mining.Outlier detection focuses mainly on one in data set Fraction object, compared with remainder data in data set, this sub-fraction object does not meet the universal model of data set, we are just The data of this part are referred to as outlier.Outliers Detection is exactly the data mining skill for finding unconventional pattern from mass data Art.The purpose of detection outlier is to eliminate the noise of raw data set or find that initial data concentrates potential valuable letter Breath.It is widely used in quality control, fault detect, financial fraud, Web Log Analysis, medical treatment, environmental science, smart city Equal fields.In many scientific domains, Outlier Data may bring new inspiration to us, is found so as to cause new knowledge It is developed with new application.Therefore there is highly important theory significance and actual application value for the detection of outlier.Mesh The preceding detection and analysis to outlier has been developed as a vital task in data mining and data management.And it is traditional from Group's point detecting method generally existing Detection accuracy is low, cannot handle extensive high-dimensional data set.

Therefore, it is badly in need of a kind of algorithm that outlier detection accuracy rate is high, and algorithm can preferably be suitable for extensive number According to the environment of collection.

Invention content

The technical problem to be solved by the present invention is to：A kind of outlier detection method towards big data is provided, number is defined The object that data set is differed markedly from according to feature is outlier, and this method has higher Detection accuracy, and uses and divide The outlier detection method of cloth adapts to extensive High Dimensional Data Set.

The present invention uses following technical scheme to solve above-mentioned technical problem：

A kind of outlier detection method towards big data, includes the following steps：

Step 1, setting data set D includes m row n column datas, and one data tuple of each behavior is each to be classified as a category Property, each data tuple D of scan data set D_jAnd j is numbered successively, obtain new data set D1=(j, D_j), j= 1,…,m；

Step 2, according to the definition of " division " in rough set, scan data set D1, and identical data tuple is divided into A kind of U/IND (A)={ C₁,C₂,…,C_t, wherein U indicates that domain, A indicate the set of all properties composition, C_kIt indicates k-th Classification, k=1 ..., t, t indicate all classification numbers, the corresponding attribute of each attribute in identical data tuple i.e. some data tuple Value is identical as the corresponding attribute value of same alike result in another data tuple, counts each classification C_kThe quantity of middle data tuple, And calculate Knowledge entropy E (A) of all properties to domain U；

Step 3, an attribute A is chosen successively_i, by attribute A in data set D1_iThe row of corresponding attribute value one remove, for Identical data tuple is divided into a kind of U/IND (A- { A by remaining data set_i)={ C₁,C₂,…,C_t, each point of statistics The quantity of data tuple in class, and calculate and remove attribute A_iKnowledge entropy E (A- { A of the remaining attribute to domain U afterwards_i), i=1 ..., N, while computation attribute A_iImportance of Attributes；It sorts from big to small to the Importance of Attributes of all properties, in data set D1 The corresponding attribute of p Importance of Attributes, constitutes new data set D2, p before choosing<n；

Step 4, scan data set D2 is combined all properties value in data set D2, and counts each attribute value group Close the quantity in entire data set D2；

Step 5, scan data set D2, for data sets the attribute value of each data tuples of D2 be combined, obtain each number According to the corresponding attribute value composite set of tuple；

Step 6, according to step 4 and step 5, by the corresponding attribute value composite set of each data tuple, each attribute value Combine quantity in entire data set D2, as the feature vector of each data tuple, by the feature vector of each data tuple it With the characteristic value as each data tuple；

Step 7, the characteristic value of each data tuple is ranked up from small to large, the corresponding data tuple of preceding q characteristic value The as outlier of data set D, q<m.

As a preferred embodiment of the present invention, Knowledge entropy E (A) calculation formula of all properties to domain U described in step 2 It is as follows：

Wherein, C_kIndicate that k-th of classification, k=1 ..., t, t indicate that all classification numbers, U indicate domain.

As a preferred embodiment of the present invention, attribute A described in step 3_iImportance of Attributes calculation formula it is as follows：

Sig(A_i)=E (A)-E (A- { A_i})

Wherein, Sig (A_i) indicate attribute A_iImportance of Attributes, E (A) indicate all properties to the Knowledge entropy of domain U, E (A-{A_i) indicate to remove attribute A_iKnowledge entropy of the remaining attribute to domain U afterwards.

As a preferred embodiment of the present invention, described p, q are preset positive integer.

As a preferred embodiment of the present invention, the step 4, step 5 are limited without sequence.

The present invention has the following technical effects using above technical scheme is compared with the prior art：

1, the suitable big data of the present invention, at present most methods be small-sized missing data collection is handled on single machine, however Now with information-based development, data volume sharp increase, data dimension is huge, and large-scale dataset is handled obviously on single machine It is improper；The present invention can be based on Map-Reduce programming models and realize the outlier in distributed data processing platform Detect large-scale data set.

2, the method for the present invention is simple and practicable, is not required to the distribution of data in master data set, domain knowledge, also need not be Training estimation model on data set, a large amount of time is saved for outlier detection.

3, the method for the present invention carries out importance calculating according to Importance of attribute sex knowledge in rough set to each attribute.

4, the method for the present invention uses UCI machine learning data, and the outlier inspection of different dimensions is carried out in multiple data sets It surveys, the results showed that the present invention has higher outlier detection accuracy rate.

Description of the drawings

Fig. 1 is a kind of algorithm sequence diagram of the outlier detection method towards big data of the present invention.

Specific implementation mode

Embodiments of the present invention are described below in detail, the example of the embodiment is shown in the accompanying drawings.Below by The embodiment being described with reference to the drawings is exemplary, and is only used for explaining the present invention, and is not construed as limiting the claims.

The present invention is a kind of improvement property and comprehensive method, and there are one new definition to outlier for this method：Data are special The object that sign differs markedly from data set is used as outlier, and the thinking of the algorithm is to identify data set using the feature of data In outlier, in data set each data tuple have the feature of oneself, this feature depend on entire data set. This feature comes from the combination of all properties value in each data tuple, and algorithm calculates all properties value in each data tuple Combination in entire data set quantity as the data tuple feature vector of oneself.Feature vector is added again to obtain every number According to the characteristic value of tuple, finally the characteristic value of each attribute tuple is ranked up, the smaller explanation data tuple of characteristic value is got over Different from whole data set, that is, it is used as outlier.And the present invention is based on the Importance of attribute sex knowledge of rough set to high dimensional data Carry out dimensionality reduction operation so that High Dimensional Data Set can be effectively treated in the invention.

Technical scheme of the present invention for ease of understanding, below first to data tuple characteristic model according to the present invention, category The property parallelization of importance and algorithm based on Map-Reduce is briefly introduced.

One, data tuple characteristic model

It is D to enable raw data set, and D has m row n column datas, i.e. data set D to have m data tuples, has n per data tuple A attribute, then data set D can be defined as：

D={ A₁,A₂,…,A_n}

Wherein, A_i(1≤i≤n) indicates the i-th Column Properties of data set D.

Data tuple is denoted as in data set：

D_j={ V_j(A_i)|1≤j≤m,1≤i≤n}

Wherein, D_jIndicate j-th of data tuple of data set D, V_j(A_i) indicate the i-th Column Properties of jth row of data set D In value, that is, D in j-th of tuple ith attribute value.

Data tuple D_jMiddle attribute value composite set is defined as：

S_j=C (y, u) | 1≤y≤n, 1≤u≤y }

Wherein, C (y, u) is data tuple D_jMiddle attribute value combination, i.e., choose u unordered categories from y complete attribute values The combination of property value.

Data tuple D_jFeature vector and characteristic value be：

Wherein,For data tuple D_jAttribute value composite set S_jIn the combination of first attribute value in entire data set The attribute value combination number, L S_jThe quantity of middle attribute value combination.F_jFor data tuple D_jFeature vector, Z_jFor data Tuple D_jCharacteristic value.

The main target of algorithm is exactly to pass through characteristic value Z_jTo detect outlier.

Two, Importance of Attributes

Since the dimension of data set may be very high, need to carry out dimensionality reduction behaviour to data set to simplify computation complexity Make, dimensionality reduction operation is carried out to data set using concept " Importance of Attributes " important in rough set, chooses important attribute to examine Survey outlier.

Divisions of the defined attribute collection A to domain U：

U/IND (A)={ C₁,C₂…,C_t}

Defining Knowledge entropy is：

Wherein, C_kIndicate that k-th of classification, t indicate total classification number.

To arbitrary A_i∈ A, we are by A_iImportance of Attributes be defined as：

Sig(A_i)=E (A)-E (A- { A_i})

By Importance of attribute sex knowledge in rough set theory, the importance of each attribute in data set can be calculated, The importance of these attributes is ranked up again, an appropriate properties sequence is obtained and carries out outlier detection.

Three, parallelization of the algorithm based on Map-Reduce

Map-Reduce is a kind of parallelisation procedure design framework, is current cloud computing platform calculating mould the most popular Type.Its basic thought is to large-scale dataset, using the strategy divided and rule.Map-Reduce calculates data with Key/ Value formats carry out operation.Map-Reduce realizes that the core of parallelization is the two operations of Map and Reduce, Map- Segmentation of Data Set first at the small documents of many same sizes and is distributed to different nodes by Reduce Computational frames, each node Map calculating is carried out, and result of calculation is ranked up merging, the Value of identical Key, which is placed in identity set, carries out Reduce It calculates.

The present invention provides the algorithm based on Map-Reduce programming frameworks, to realize that the distributed of the algorithm is run.Such as Fig. 1 Shown, this algorithm is broadly divided into 4 stages, then judges the data set to each data tuple marker number in data set first Whether dimension is excessively high, and dimensionality reduction operation is carried out if dimension height, and outlier detection is directly carried out if dimension is acceptable.

Stage 1, algorithm scan data set D simultaneously number for each data tuple label is unique, export each data tuple Number and data tuple, these records constitute the destination file in the stage.

Stage 2, algorithm are that data set carries out dimensionality reduction operation.The stage is divided into two modules and carries out Importance of Attributes to data set It calculates.

Module 1,1 destination file of sweep phase find out division U/IND (A)={ Cs of the property set A in data set₁, C₂,…,C_t}.Calculate each set C_kThe quantity of data tuple in (1≤k≤t), it is defeated so that Knowledge entropy E (A) is calculated Go out Knowledge entropy E (A) and obtains destination file.

1 destination file of module 2, sweep phase 1 and 2 module of stage

Attribute A is chosen successively_i(A_i∈ A) it calculates, it finds out and removes attribute A_iDivision U/ of the remaining property set in data set afterwards IND(A-{A_i)={ C₁,C₂,…,C_t, calculate each set C_kThe quantity of data tuple in (1≤k≤t), to calculate To Knowledge entropy E (A- { A_i), then computation attribute importance Sig (A_i).All properties sequence of importance is obtained, and chooses first p Important attribute finally retains initial data and concentrates this p attribute, obtains destination file.

Stage 3, the stage are divided into the feature that 3 modules calculate data tuple, and module 1 and module 2 can be carried out at the same time.

Module 1,2 module of algorithm sweep phase, 2 destination file calculate combination C (y, u) group of each attribute value in data tuple At the attribute value composite set S of data tuple_j, output data member group # and S_j, these, which are recorded, forms its destination file.

Module 2,2 module of algorithm sweep phase, 2 destination file count each data in 2 module of all stage, 2 destination file The quantity of the quantity of all properties value combination C (y, u) of tuple, output C (y, u) and C (y, u), these records constitute the mould The destination file of block.

The destination file of 3 module 2 of module 3, the destination file of 3 module 1 of algorithm sweep phase and stage, according to 3 mould of stage All properties value combines in each data tuple of the destination file of 3 module 1 of statistical result calculation stages of the destination file of block 2 Quantity of the C (y, u) in 2 module of stage, 2 destination file, the feature vector F as each data tuple_j(1≤j≤m).It calculates Its feature vector and as data tuple final characteristic value Z_j(1≤j≤m).Output data member group # and the number According to the characteristic value of tuple, these records constitute the destination file of the module.

Stage 4,3 module 3 of algorithm sweep phase destination file, its data tuple is arranged using the method for statistics Sequence, q minimum characteristic value before finding out, their feature are different from global feature.

Stage 1, flag data tuple.

Input：Data set D

Output：The data set D of number j and the data tuple composition of the data tuple of data set D¹

Map<Object,Text,Text,Text>

Input:Key=offset, value=D_j

1.FOR each<key,value>DO

2.Outkey:key

Outvalue:value

Reduce<Text,Text,Text,Text>

1.FOR each in valuelist DO

2.ADD j into tuple

3.Outkey:j

Outvalue:value

Algorithm scan data set D in this stage is each data tuple number j (1≤j≤m) in data set D, output Number j and D_jForm result data collection D¹。

Stage 2, the stage are divided into two modules and carry out Importance of Attributes calculating to data set.

Module 1, calculation knowledge entropy E (A)

Input：Data set D¹

Output：Data set D²

Map<Object,Text,Text,Text>

Input:

1.FOR each<key,value>DO

2.Outkey:j

Outvalue:value

Reduce<Text,Text,Text,Text>

1.FOR each in valuelist DO

IF value in Map<value,i>

ADD value into set Map

ELSE

ADD value into set Map

I=i++

FOR Map<value,i>DO

E (A) +=(i/m) * log₂(i/m)

2.Outkey:null

Outvalue:E(A)

This module finds out division U/IND (A)={ Cs of the property set A in data set₁,C₂,…,C_t, knowledge is calculated Entropy E (A), output result data collection D²。

Module 2, computation attribute importance Sig (a) simultaneously carry out dimensionality reduction.

Input：Data set D¹And D²And parameter p

Output：Only retain the data set D of important attribute³

Map<Object,Text,Text,Text>

Input:Key=offset,

1.FOR each<key,value>DO

SearchSet(E(A))

Map<Object,Text,Text,Text>

Input:Key=offset,

1.FOR each<key,value>DO

2.Outkey:j

Outvalue:value-{A_i}

Reduce<Text,Text,Text,Text>

1.FOR each in valuelist DO

IF value in Map<value,i>

ADD value into set Map

ELSE

ADD value into set Map

I=i++

FOR Map<value,i>DO

E(A-{A_i) +=(i/m) * log₂(i/m)

Sig(A_i)=E (A)-E (A- { A_i})

AddSig(A_i)into Map<A_i,Sig(A_i)>

2.Outkey:null

Outvalue:Map<A_i,Sig(A_i)>

Map<Object,Text,Text,Text>

Input:Key=offset,

1.FOR each<key,value>DO

FOR int n-p DO

FOR Map<A_i,Sig(A_i)>DO

Value=value- { A_i}

2.Outkey:j

Outvalue:value

In this module, attribute A is chosen successively with a Map-Reduce program_i(A_i∈ A) it calculates, it finds out and removes attribute A_i Division U/IND (A- { A of the remaining property set in data set afterwards_i)={ C₁,C₂,…,C_t, calculate each set C_k(1≤k≤ T) quantity of the data tuple in, so that Knowledge entropy E (A- { A are calculated_i), then computation attribute importance Sig (A_i). To sequence of attributes and before choosing, p important attribute is put into set.Retain D with a Map-Reduce program again¹In data set This p attribute, obtains data set D³。

Stage 3 is divided into the feature that 3 modules calculate data tuple, and module 1 and module 2 can be carried out at the same time.

Module 1, according to data set D³Calculate the attribute value composite set S of each data tuple_j。

Input：Data set D³

Output：The attribute value composite set S of data tuple number j and data tuple_jThe data set D of formation⁴

Map<Object,Text,Text,Text>

Input:Key=offset,

1.FOR each<key,value>DO

2.FOR each<value>DO

Combine(value)

3.Outkey:j

Outvalue:C(y,u)

Reduce<Text,Text,Text,Text>

1.FOR each in valuelist DO

ADD C(y,u)into S_j

2.Outkey:j

Outvalue:S_j

The modular algorithm scans D³, by D³Data tuple be divided into data tuple number and attribute value two parts, pass through letter Number Combine () computation attribute is worth combination C (y, u), and all C (y, u) are combined into attribute value composite set S_jAnd it exports, shape The data set D of Cheng Xin⁴。

Module 2, the entire data set D of statistics³The quantity countc of middle attribute value combination.

Input：Data set D³

Output：Data set D³In each data tuple attribute value combination C (y, u) and the countc formation of its quantity data Collect D⁵

Map<Object,Text,Text,Text>

Input:Key=offset,

1.FOR each<key,value>DO

2.FOR each<value>DO

Add Combine(value)in List vallist

FOR each<vallist>DO

3.Outkey:C(y,u)

Outvalue:1

Reduce<Text,Text,Text,Text>

1.FOR each in valuelist DO

2.Calculation number of C(y,u)

3.Outkey:C(y,u)

Outvalue:countc

Modular algorithm scan data set D³, pass through function Combine () statistical data collection D³In each data tuple D_j Attribute value combination quantity countc, output result set D⁵。

Module 3, according to data set D⁴With data set D⁵Calculate data set D³Each data tuple feature vector and spy Value indicative.

Map<Object,Text,Text,Text>

Input:Key=offset,

1.FOR each<key,value>DO

2.SearchSet(C(y,u),countc)

Map<Object,Text,Text,Text>

Input:Key=offset,

1.FOR each<key,value>DO

2.IF SearchSet.containsKey(C(y,u))

ADD countc into F_j

Sum of F_j is Z_j

3.Outkey:I_j

Outvalue:Z_j

Modular algorithm scan data set D⁵, by D⁵In data set D³Attribute value combination C (y, u) and its quantity Countc is put into the form of key-value in SearchSet, then scan data set D⁴In category in each data tuple Property value combination C (y, u), the quantity of its C (y, u), composition data collection D are searched in SearchSet³Each data tuple spy Levy vector F_j, acquire it and be characterized value Z_j, its characteristic value is exported, new result data collection D is obtained⁶。

Stage 4, by D⁶Data tuple be ranked up, the characteristic value of q minimums is outlier before finding out.

Map<Object,Text,Text,Text>

Input:Key=offset,

1.FOR each<key,value>DO

2.Outkey:j

Outvalue:Z_j

Reduce<Text,Text,Text,Text>

1.FOR each in valuelist DO

FOR intqDO

Outkey:j

Outvalue:Z_j

Phase algorithm scan data set D⁶After Map and Reduce is operated, Map-Reduce calculation blocks Frame can be automatically to Z_jThere are a global sequence, q outlier before obtaining.

The method of the present invention is illustrated with a specific embodiment below.

Data set：

Data tuple is numbered	Gender	Attitude towards study	School grade
				1	Man	Conscientiously	It is excellent
2	Man	Conscientiously	It is excellent
				3	Man	Conscientiously	It is excellent
4	Man	Conscientiously	It is excellent
				5	Man	It is half-hearted	Difference
6	Man	It is half-hearted	Difference
				7	Man	It is half-hearted	Difference
8	Man	It is half-hearted	Difference
				9	Man	It is half-hearted	It is excellent

1, dimensionality reduction

A divides domain：

U/IND (gender, attitude towards study, school grade)={ { 1,2,3,4 }, { 5,6,7,8 }, { 9 } }

E (A)=- (4/9*log (4/9)+4/9*log (4/9)+1/9*log (1/9))=- 8/9log (4/9) -1/ 9log1/9

(1) computational other Importance of Attributes：

A- genders divide domain：

U/IND (A- genders)==U/IND (attitude towards study, school grade)={ { 1,2,3,4 }, { 5,6,7,8 }, { 9 } }

E (A- genders)=- (4/9*log (4/9)+4/9*log (4/9)+1/9*log (1/9))=- 8/9log (4/9) -1/ 9log1/9

Sig (gender)=E (A)-E (A- genders)=0

(2) Importance of Attributes of attitude towards study is calculated：

A- attitudes towards study divide domain：

U/IND (A- attitudes towards study)=U/IND (gender, school grade)={ { 1,2,3,4,9 }, { 5,6,7,8 } }

E (A- attitudes towards study)=- (5/9log (5/9)+4/9log (4/9))=- 5/9log (5/9) -4/9log (4/9)

Sig (attitude towards study)=E (A)-E (A- attitudes towards study)=- 8/9log (4/9) -1/9log1/9+5/9log (5/9) +4/9log(4/9)

(3) Importance of Attributes of school grade is calculated：

A- school grades divide domain：

U/IND (A- school grades)==U/IND (gender, attitude towards study)={ { 1,2,3,4 }, { 5,6,7,8,9 } }

E (A- school grades)=- (5/9log (5/9)+4/9log (4/9))=- 5/9log (5/9) -4/9log (4/9)

Sig (school grade)=E (A)-E (A- school grades)=- 8/9log (4/9) -1/9log1/9+5/9log (5/9) +4/9log(4/9)

And：Sig (gender)<Sig (attitude towards study)=Sig (school grade)

So deleting gender attribute.

2, outlier is detected

Data tuple is numbered	Attitude towards study	School grade
			1	Conscientiously	It is excellent
2	Conscientiously	It is excellent
			3	Conscientiously	It is excellent
4	Conscientiously	It is excellent
			5	It is half-hearted	Difference
6	It is half-hearted	Difference
			7	It is half-hearted	Difference
8	It is half-hearted	Difference
			9	It is half-hearted	It is excellent

(1) quantity of each attribute value combination of entire data set is calculated

(conscientious)：4

(half-hearted)：5

(excellent)：5

(poor)：4

(conscientious, excellent)：4

(half-hearted, excellent)：1

(half-hearted, poor)：4

(2) the combinations of attributes set of each data tuple is calculated

(3) characteristic value and feature vector of each data tuple are calculated

Data tuple is numbered	Feature vector	Characteristic value
			1	{ 4,5,4 }	13
2	{ 4,5,4 }	13
			3	{ 4,5,4 }	13
4	{ 4,5,4 }	13
			5	{ 5,4,4 }	13
6	{ 5,4,4 }	13
			7	{ 5,4,4 }	13
8	{ 5,4,4 }	13
			9	{ 5,5,1 }	11

(4) characteristic value sequence is carried out to data tuple

Data tuple is numbered	Characteristic value
		9	11
1	13
		2	13
3	13
		4	13
5	13
		6	13
7	13
		8	13

Wherein 9 number tuple features are smaller, can be counted as outlier.

Above example is merely illustrative of the invention's technical idea, and protection scope of the present invention cannot be limited with this, every According to technological thought proposed by the present invention, any change done on the basis of technical solution each falls within the scope of the present invention Within.

Claims

1. a kind of outlier detection method towards big data, which is characterized in that include the following steps：

Step 1, setting data set D includes m row n column datas, and one data tuple of each behavior is each to be classified as an attribute, sweeps Retouch each data tuple D of data set D_jAnd j is numbered successively, obtain new data set D1=(j, D_j), j=1 ..., m；

Step 2, according to the definition of " division " in rough set, scan data set D1, and identical data tuple is divided into one kind U/IND (A)={ C₁,C₂,…,C_t, wherein U indicates that domain, A indicate the set of all properties composition, C_kIndicate k-th point Class, k=1 ..., t, t indicate all classification numbers, the corresponding attribute value of each attribute in identical data tuple i.e. some data tuple It is identical as the corresponding attribute value of same alike result in another data tuple, count each classification C_kThe quantity of middle data tuple, and Calculate Knowledge entropy E (A) of all properties to domain U；

Step 3, an attribute A is chosen successively_i, by attribute A in data set D1_iThe row of corresponding attribute value one remove, for residue Data set, identical data tuple is divided into a kind of U/IND (A- { A_i)={ C₁,C₂,…,C_t, it counts in each classification The quantity of data tuple, and calculate and remove attribute A_iKnowledge entropy E (A- { A of the remaining attribute to domain U afterwards_i), i=1 ..., n, together When computation attribute A_iImportance of Attributes；It sorts to the Importance of Attributes of all properties, is chosen in data set D1 from big to small The corresponding attribute of preceding p Importance of Attributes, constitutes new data set D2, p<n；

Step 4, scan data set D2 is combined all properties value in data set D2, and counts each attribute value combination and exist Quantity in entire data set D2；

Step 5, scan data set D2, for data sets the attribute value of each data tuples of D2 be combined, obtain each data element The corresponding attribute value composite set of group；

Step 6, according to step 4 and step 5, by the corresponding attribute value composite set of each data tuple, each attribute value combines Quantity in entire data set D2 makees the sum of feature vector of each data tuple as the feature vector of each data tuple For the characteristic value of each data tuple；

Step 7, the characteristic value of each data tuple is ranked up from small to large, the corresponding data tuple of preceding q characteristic value is The outlier of data set D, q<m.

2. the outlier detection method towards big data according to claim 1, which is characterized in that all categories described in step 2 Property is as follows to Knowledge entropy E (A) calculation formula of domain U：

3. the outlier detection method towards big data according to claim 1, which is characterized in that attribute A described in step 3_i's Importance of Attributes calculation formula is as follows：

Sig(A_i)=E (A)-E (A- { A_i})

Wherein, Sig (A_i) indicate attribute A_iImportance of Attributes, E (A) indicate all properties to the Knowledge entropy of domain U, E (A- {A_i) indicate to remove attribute A_iKnowledge entropy of the remaining attribute to domain U afterwards.

4. the outlier detection method towards big data according to claim 1, which is characterized in that described p, q are advance The positive integer of setting.

5. the outlier detection method towards big data according to claim 1, which is characterized in that the step 4, step 5 It is limited without sequence.